Tool Time: Export PDF Text with Pdftotext

If you occasionally need to export text from PDF files, pdftotext might be a handy addition to your personal toolbox. Part of Foo Labs' free Xpdf package, pdftotext is a command-line tool that automates the export process.

Using pdftotext is straightforward. If you want to export the text from a file named vmware.pdf, you can use pdftotext like this

pdftotext vmware.pdf

This command automatically creates a new file named vmware.txt in the same folder as vmware.pdf. Where possible, pdftotext will remove embedded hyphenation and line breaks. If you also want to remove physical page breaks embedded in the PDF file, you can add the -nopgbrk option:

pdftotext vmware.pdf -nopgbrk

To send the text output to the screen instead of a file, you include the - parameter at the end of the command:

pdftotext vmware.pdf -

You can use multiple parameters together as well:

pdftotext vmware.pdf -nopgbrk -

Pdftotext works only with actual text, so you won't be able to export images or scanned text that hasn't had optical character recognition (OCR) performed on it. However, it works extremely well in its specific niche.

The Xpdf package contains several other tools that can be useful for manipulating PDF files. Pdftoppm and pdftops convert PDF files to the Portable Pixel Map (PPM) or PostScript format, respectively. Pdfimages extracts all images from a PDF file, pdfinfo returns general PDF metadata, and pdffonts diagnoses font-related problems with PDF files. If you work with PDF files and like command-line tools, xpdf is well worth checking out.