Check that the METS, OCR, JPEG2000 masters and the PDFs are consistent

Detailed description

As shown in the diagram below, check images and ALTO files information defined in METS against the real files stored in separate Zip files. Also check the number of pages in the PDF file against the number of files in images zip file and ALTO zip file respectively. Report any mismatches.

How well does the solution meet your issue? The solution cross checks the METS of the .TIFF and the PDF so it meets the initial requirements of the issue.

What more would you like the solution to do? A possible future development is the generation of a machine readable output.

Do you think you can implement the solution in your organisation? And

What further investigation/development/testing would be required before implementation at your organisation?

Are there any process, workflow or technical obstacles to implementation? Believe that subject to the usual validation checks it can be implemented in the institution

Summarise the benefits to your organisation that the solution could provide? With over 150 000 books digitised in the collection the solution, especially if it generated a machine readable output, would allow checking of the files without the need for someone to manually check each one.

What potential exists to apply the solution elsewhere? The potential exists to apply the principle in the solution to other migration file combinations.

Tool (link)

The solution has been developed in Java. Some of the Java components have been integrated into the solution, e.g. PDFBox, Apache commons compress, dom4j. Also thanks to Carl for sharing Zip processing code.

User needs to define a list of METS files to be processed in a config file, e.g.