PDF Conversions in Java

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

If you have a few years of experience in the Java ecosystem, and you're interested in sharing that experience with the community (and getting paid for your work of course), have a look at the "Write for Us" page.
Cheers. Eugen

1. Introduction

In this quick article, we’ll focus on doing programmatic conversion between PDF files and other formats in Java.

More specifically, we’ll describe how to save PDFs as image files, such as PNG or JPEG, convert PDFs to Microsoft Word documents, export as an HTML, and extract the texts, by using multiple Java open-source libraries.

2. Maven Dependencies

The first library we’ll look at is Pdf2Dom. Let’s start with the Maven dependencies we need to add to our project:

The latest version of iText can be found here and you can look for Apache POI here.

3. PDF and HTML Conversions

To work with HTML files we’ll use Pdf2Dom – a PDF parser that converts the documents to an HTML DOM representation. The obtained DOM tree can then be then serialized to an HTML file or further processed.

To convert PDF to HTML, we need to use XMLWorker, library that is provided by iText.

Note that converting HTML to PDF, you need to ensure that HTML has all tags properly started and closed, otherwise the PDF will be not created. The positive aspect of this approach is that PDF will be created exactly the same as it was in HTML file.

4. PDF to Image Conversions

There are many ways of converting PDF files to an image. One of the most popular solutions is named Apache PDFBox. This library is an open source Java tool for working with PDF documents. For image to PDF conversion, we’ll use iText again.

4.1. PDF to Image

To start converting PDFs to images, we need to use dependency mentioned in the previous section – pdfbox-tools.

Please note, that we can provide an image as a file, or load it from URL, as it is shown in the example above. Moreover, the extensions of the output file that we can use are jpeg, jpg, gif, tiff or png.

5. PDF to Text Conversions

To extract the raw text out of a PDF file, we’ll also use Apache PDFBox again. For text to PDF conversion, we are going to use iText.

5.1. PDF to Text

We created a method named generateTxtFromPDF(…) and divided itinto three main parts: loading of the PDF file, extraction of text, and final file creation.

In order to read a PDF file, we use PDFParser, with an “r” (read) option. Moreover, we need to use the parser.parse() method that will cause the PDF to be parsed as a stream and populated into the COSDocument object.

In the first line, we’ll save COSDocument inside the cosDoc variable. It will be then used to construct PDocument, which is the in-memory representation of the PDF document. Finally, we will use PDFTextStripper to return the raw text of a document. After all of those operations, we’ll need to use close() method to close all the used streams.

In the last part, we’ll save text into the newly created file using the simple Java PrintWriter:

6. PDF to Docx Conversions

Creating PDF file from Word document is not easy, and we’ll not cover this topic here. We recommend 3rd party libraries to do it, like jWordConvert.

To create Microsoft Word file from a PDF, we’ll need two libraries. Both libraries are open source. The first one is iText and it is used to extract the text from a PDF file. The second one is POI and is used to create the .docx document.

Please note, that with the SimpleTextExtractionStrategy() extraction strategy, we’ll lose all formatting rules. In order to fix it, play with extraction strategies described here, to achieve a more complex solution.

7. PDF to X Commercial Libraries

In previous sections, we described open source libraries. There are few more libraries worth notice, but they are paid:

jPDFImages – jPDFImages can create images from pages in a PDF document and export them as JPEG, TIFF, or PNG images.

JPEDAL – JPedal is an actively developed and very capable native Java PDF library SDK used for printing, viewing and conversion of files

pdfcrowd – it’s another Web/HTML to PDF and PDF to Web/HTML conversion library, with advanced GUI

8. Conclusion

In this article, we discussed the ways to convert PDF file into various formats.

The full implementation of this tutorial can be found in the GitHub project – this is a Maven-based project. In order to test, just simply run the examples and see the results in the output folder.

Generic bottom

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2: