Update

April 20, 2015: The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.9). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).

February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.

How to Parse PDF Files

There are several main methods for extracting text from PDF files in .NET:

Microsoft IFilter interface and Adobe IFilter implementation.

iTextSharp

PDFBox

None of these PDF parsing solutions is perfect. We will discuss all these methods below.

1. Parsing PDF using Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].

Comments and Discussions

I try to used the pdfbox converter program, but it didn't create the txt file.
I put the name of pdf file in and name of txt file in
The pdf file is on the same path as the program.
Someone help me thanks

I downloaded the example of PDFBox , and after running the program, not any txt file was created and could not send typing the text extracted on the console. How I retrieve the text extracted to manipulate ?. Can you help me, please

Can you help me with an example? the way as i supply the arguments does not generate the txt file. the compiler does not indicate an error. I don't know where to write the sentence to supply the arguments.

I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).

The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.

It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.

Nice article and nice comparison.
I have to add comment to whom are going to include one of such libraries in a web applications. If you have to deploy to a service provider that doesn't allow full trust policy, you need to have the source library and recompile them with full trust flag or sign your app.

Unfortunately I'm not able to help much here. Extracting non-ASCII text from PDF files is a pain.

As far as I know, earlier versions of the PDF format (such as 1.4) which are still in wide use didn't support Unicode and you needed to use specific pre-unicode encoding (like Windows-1253). Which makes it difficult during text extraction because it has to be mapped properly. If the library of your choice doesn't do that you can hardly improve it yourself.