AFter a file migration (from pdf to pdf/a, doc to docx, doc to pdf/a) we should check that the conversion has been successful and that the significant properties of the object are maintained. We do not do this consistently at present. We may check a handful of files after a batch process but that means we are likely to miss the one conversion that has not been successful. Would be great to have a tool that could open the 2 documents (original and migrated files) and compare serveral quantifiable metrics (for example word count, page count, number of images, paragraph count, anything else) and report on those conversions where the numbers don't match up. These may then be assessed by eye individually and re-migrated if necessary.

Any other parties who are also interested in applying Issue Solutions to their Datasets

Possible Solution approaches

Apache POI for looking inside MS Word docs - see what metrics can be extracted

Other technology for PDF files

Context

Our Ingest manual states that we carry out checks after migration but we don't always do so. We have discovered in the past some conversions that havent worked properly. For example doc to odf used to cause problems with pages being slightly out. We often find out where there are issues and problems on an ad hoc basis though and it would be better if we had a more fool proof method of assessing the success of file conversions.

Lessons Learned

Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice