andreas1234567 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I find myself in charge of a large number of PDF documents produced by a document production system. There is currently no automated testing. For every new release of the (buggy) document production system itself, or its' (even more buggy) document templates, we face an time-consuming, error-prone, much hated and pretty much useless please see through this pile of documents and report errors if any nightmare of a manual testing process.

Preferably, I would like to add automated tests of both the PDFs' contents and visuals, and humbly ask for the Monks' advise.

PDF content testing

The strategy is to use Xpdf (pdftotext.exe) to convert PDF into text, and then use Test::File::Contents to check the output. This works reasonable good. But alternate solutions or suggestions are welcome.

PDF visuals testing

The strategy is ... non-existant. Any assistance or guidance is highly appreciated.

If you can find templates that "are not supposed to change", like the page for a book cover or something like that, maybe you can set up a special single-page document and render that to a bitmap using (yuck) Image::Magick (or maybe better direct Ghostscript). Then you can try to use Image::Compare or Image::SubImageFind to find the "not changing" parts again.

Of course, maintaining such a library of image-based tests gets really ugly. Maybe you can use wraith by the BBC to manage and compare the "screenshots" whenever a change is detected.

I feel your pain. I have the (mis)fortune to have to deal with this on a daily basis as $WORK.

The strategy is to use pdftotext.exe to convert PDF into text

*yuck*

If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route.
My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.

I use the following command line:
pdftohtml.exe -xml -zoom 1.4 [PDF FILE]

This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi).
Here is an example I am currently working with:

Yes, you can use packages such as PDF::API3 to “dumpster-dive” quite a ways into the “guts” of a PDF file, but if you can identify defects from the text content of the file, your ugly approach might be the most cost-effective. The content of a PDF can be very beastly unpredictable, making it difficult to write reliable logic to track down problems.

And in appropriate status meetings, keep oh-so politely mentioning the ¢o$t of the fact that this system is still not working as the business should have reason to expect. Every hour spent ... opportunity costs ...