Java program to characterise PDF files, looking for preservation concerns.

Detailed description

Currently checks for the following:

Is the document encrypted?
Can the document be printed?
Can the document be amended?
Number of pages.
Embedded JavaScript.
External links, and extracts the URIs.

Embedded fonts proved to be challenging, here's a summary as to why:

Fonts are used in 3 places

the documents pages.

the embedded AcroForm (if present).

the form fields on pages.
Therefore all of these areas have to be crawled to extract the fonts used.

The PDFBox API does not provide and easy method to detect whether a font is embedded within the PDF documents (the iText, and JPod APIs both supply methods that do this, this should allow implementation and cross automated testing (Jpod vs. iText).

There is a final twist to the puzzle at this point. Detecting that a font is embedded isn't enough, the font may be corrupt or incomplete.

The embedded font may be corrupt, the font itself should be parsed to ensure that it is indeed a legal font (FontBox could be used for this).

PDF allows the embedding of font subsets so a check needs to be made that all of the characters in the document are contained in the embedded font.

Solution champion

Carl Wilson

Git link

Group Evaluation Notes

Embedded fonts issue exploration. Solution partial, but interesting discoveries in the journey. This needs to be documented here!