List of PDF Extraction Resources

Added 1/15/2014: Some commercial PDF solution vendors have agreed to offer special evaluation versions of their software to hackathon participants. While evaluation licenses are common, they often come with restrictions on the number of pages that can be processed – making them useless for the hackathon. The following vendors are providing versions of their software with high page limits or no page limits at all:

Here is a list of PDF tools that you can use at the hackathon or afterwards. I would like to keep this list complete and updated, so please use the comments to tell me about tools and technologies not listed thus far. Because this list is becoming very long, I am now denoting (with a ♦)tools that I believe participants should consider first. Selections are based on my own experiences applying the tools to a sample PDF, whether the project is still active and (for commercial tools) whether a liberal evaluation license is available.

PDF2SVG – Java tool developed by Peter Murray-Rust that converts PDFs to Scalable Vector Graphics (SVG) files that can be rendered by most modern browsers. PDF2SVG, which is based on PDFBox, is a component of the larger AMI suite of open source tools created for the purpose of liberating scientific documents. Another component, SVG2XML converts the SVG files to HTML and is currently under heavy development. Download Page: https://bitbucket.org/petermr/pdf2svg-dev/overview. Repository: https://bitbucket.org/petermr/pdf2svg-dev/src.

How about including MuPDF too? (See http://www.mupdf.com). It’s dual AGPL and commercial licensed, and will do text extraction from PDFs to a set of generic data structures that other tools can manipulate.

Thank you very much for the list of technology resources that can help with data extraction from PDFs. This was tremendously helpful in the research that I’ve been conducting for my company. I hope there are more Hackathons coming!