Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

On my "intranet index machine" running htdig, I've successfully used doc2html.pl as "external parser", which is more of a wrapper than anything else:-). It uses several parsers that convert from a range of formats to html ... which in turn can rather easily be turned into text. You'll find it at http://www.htdig.org/files/contrib/parsers/ ... amongst some others.
rtf2html and catdoc (parsers used by the above) from http://www.45.free.net/~vitus/ice/catdoc/ would fit your bill...:-) Especially catdoc, since it reads M$ word files, and output text;-)
You might also be interrested in xlHtml and ppt2html (http://www.xlhtml.org).

Regarding pdf ... you have at least two options. Either install the xpdf package (almost all distros have it) that contain the pdftotext program... or do a pdf2ps followed by a ps2ascii ... these are part of ghostscript, which is very commonly installed to handle your PostScript(-ish) printing needs:-). (I'm to slow typing... majorwoo already covered xpdf).

:-)
It shouldn't have to come to that... catdoc and pdftotext should do nicely;-).

-- Glenn

0

Mario_castroAuthor Commented: 2003-10-28

ops i have a sevral problem but this software is for install in phpdig that is a search aplication an need two modules to search in pdf an doc files. this is development in php, and i need put this search engine in my site that is allowed in not dedicated server, but this i don´t can install any aplications.

Strings will work somewhat for "normal" M$ doc files (You'll likely index a hefty amount of garbage too)... Pdf is a tougher cookie, since you'll need "decode" the postscriptish language (so you don't index that), and perhaps also "unscramble" some binary parts... not easily done with sed (or awk ... or perl).
The easy thing is to try and convince the server owner to install some of the utils above.

Why should we doubt Marios word willy134? I'm guessing he's setting this up on a hosting service of some kind, that simply don't provide the tools he needs. Might be wrong though (wouldn't be the first time that that happens either:-).

There is no way you can extract text from a PDF file without a PDF parser. There are two problems with using tools like sed or strings: Depending on the PDF creator, even normal text will be stored as binary data. When using Adobe tools to create PDF documents, the more recent these tools are the more likely it is that the content is binary (actually compressed). The second problem are the incremental updates PDF allows. You may find text in a file (assuming that the text is actually uncompressed) that is no longer supposed to be in the document: Somebody may have removed one or more pages from the document and saved the updated document. A PDF creator will (if not specifically asked to do otherwise) only store the new and updated data, and will leave the original content intact - you can actually recover the old version of the document by stripping off the new part at the end of the file. So using a non-PDF-aware approach for indexing puposes will lead to a corrupt index.

The DOC format has similar mechanisms to retain old information, so it's also necessary to use a tool that is aware of these "hidden" data structures.

No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
PAQ'd and pts forfeited
Please leave any comments here within the next seven days.

I forgot one reason why this will not work: The PDF standard does not require you to output text as "text strings": You could for example first print all "a" characters on a page, then all "b", and so on... This means that a text extraction tool has to use the relative position of every "text piece" on a page to sort the data back into something that more or less resembles your original text. Do, don't do it with strings or sed. I hope we can put this to rest now :-)

Yup. It's a strange beast all right:-). So any perl/php/sed/awk/whatever would need be a fullfledged (well, not really... You'd just need ... "semirendering", and be damned with those cases that it cannot grok:-) pdf parser. Yucky at best:-).