Freitag, 20. Januar 2017

During two years of operation, more than 3.000 ingests have been piling up in the Technical Analyst's workbench of our digital preservation software. The vast majority of them have been singled out by the format validation routines, indicating that there has been a problem with the standard compliance of these files. One can easily see that repairing these files is a lot of work that, because the repository software doesn't support batch operations for TIFF repairs, would require months of repetative tasks. Being IT personnel, we did the only sane thing that we could think of: let the computer take care of that. We extracted the files from our repository's working directory, copied them to a safe storage area and ran an automated repair routine on those files. In this article, we want to go a little into detail about how much of an effort repairing a large corpus of inhomogenously invalid TIFFs actually is, which errors we encountered and which tools we used to repair these errors.

So, let's first see how big our problem actually is. The Technical Analyst's workbench contains 3.101 submission information packages (SIPs), each of them containing exactly one Intellectual Entity (IE). These SIPs contain 107.218 TIFF files, adding up to a grand total of about 1,95 TB of storage. That's an average of 19,08 MB per TIFF image.

While the repository software does give an error message for invalid files that can be found in the WebUI, they cannot be extracted automatically, making them useless for our endeavour. Moreover, our preservation repo uses JHove's TIFF-hul module for TIFF validation, which cannot be modified to accomodate local validation policies. We use a policy that is largely based on Baseline TIFF, including a few extensions. To validate TIFFs against this policy (or any other policy that you can think of, for that matter), my colleague Andreas has created the tool checkit_tiff, which is freely (free as in free speech AND free beer) available on GitHub for anyone to use. We used this tool to validate our TIFF files and single out those that didn't comply with our policy. (If you are interested, we used the policy as configured in the config file cit_tiff6_baseline_SLUB.cfg, which covers the conditions covered in the german document http://www.slub-dresden.de/ueber-uns/slubarchiv/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten/langzeitarchivfaehige-dateiformate/handreichung-tiff/ as published on 2016-06-08.)

For the correction operations, we used the tool fixit_fiff (also created by Andreas and freely available), the tools tiffset and tiffcp from the libtiff suite and convert from ImageMagick. All of the operations ran on a virtual machine with 2x 2,2GHz CPUs and 3 GB RAM with a recent and fairly minimal Debian 8 installation. The storage was mounted via NFS 3 from a NetApp enterprise NAS system and connected via 10GBit Ethernet. Nevertheless, we only got around 35MB/s throughput during copy operations (and, presumeably, also during repair operations), which we'll have to further investigate in the future.

The high-level algorithm for the complete repair task was as follows:

copy all of the master data from the digital repository to a safe storage for backup

duplicate that backup data to a working directory to run the actual validation/repair in

split the whole corpus into smaller chunks of 500 SIPs to keep processing times low and be able to react if something goes wrong

run repair script, looping through all TIFFs in the chunk

validate a tiff using checkit_tiff

if TIFF is valid, go to next TIFF (step 4), else continue (try to repair TIFF)

parse validation output to find necessary repair steps

run necessary repair operations

validate the corrected tiff using checkit_tiff to detect errors that haven't been corrected

recalculate the checksums for the corrected files and replace the old checksums in the metadata with the new ones

steps 4-7 are run until only those files are left that cannot be repaired in an automatic workflow

During the several iterations of validation, failed correction and enhancements for the repair recipies, we found the following correctable errors. Brace yourself, it's a long list. Feel free to scroll past it for more condensed information.

"baseline TIFF should have only one IFD, but IFD0 at 0x00000008 has pointer to IFDn 0x<HEX_ADDRESS>"

This is a multipage TIFF with a second Image File Directory (IFD). Baseline TIFF requires only the first IFD to be interpreted by byseline TIFF readers.

"Invalid TIFF directory; tags are not sorted in ascending order"

This is a violation of the TIFF6 specification, which requires that TIFF tags in an IFD must be sorted ascending by their respective tag number.

"tag 256 (ImageWidth) should have value , but has value (values or count) was not found, but requested because defined"

The tag is required by the baseline TIFF specification, but wasn't fount in the file.

"tag 257 (ImageLength) should have value , but has value (values or count) was not found, but requested because defined"

Same here.

"tag 259 (Compression) should have value 1, but has value X"

This is a violation of our internal policy, which requires that TIFFs must be stored without any compression in place. Values for X that were found are 4, 5 and 7, which are CCITT T.6 bi-level encoding, LZW compression and TIFF/EP JPEG baseline DCT-based lossy compression, respectively. The latter one would be a violation of the TIFF6 specification. However, we've noticed that a few files in our corpus were actually TIFF/EPs, where Compression=7 is a valid value.

"tag 262 (Photometric) should have value <0-2>, but has value (values or count) 3"

The pixels in this TIFF are color map encoded. While this is valid TIFF 6, we don't allow it in the context of digital preservation.

"tag 262 (Photometric) should have value , but has value (values or count) was not found, but requested because defined"

The tag isn't present at all, even though it's required by the TIFF6 specification.

"tag 269 (DocumentName) should have value ^[[:print:]]*$, but has value (values or count) XXXXX"

The field is of ASCII type, but contains characters that are not from the 7-Bit ASCII range. Often, these are special characters that are specific to a country/region, like the German "ä, ö, ü, ß".

"tag 270 (ImageDescription) should have value word-aligned, but has value (values or count) pointing to 0x00000131 and is not word-aligned"

The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.

"tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count)"

The Make tag is empty, even though the specification requires it contains a string of the manufacturer's name.

"tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count) Mekel"

That's a special case where scanners from the manufacturer Mekel write multiple NULL-Bytes ("\0") at the end of the Make tag, presumeably for padding. This, however, violates the TIFF6 specification.

"tag 272 (Model) should have value ^[[:print:]]*$, but has value (values or count)"

The Model tag is empty, even though the specification requires it contains a string of the scanner device's name.

"tag 273 (StripOffsets) should have value , but has value (values or count) was not found, but requested because defined"

The tag isn't present at all, even though it's required by the TIFF6 specification.

"tag 278 (RowsPerStrip) should have value , but has value (values or count) was not found, but requested because defined"

Same here.

"tag 278 (RowsPerStrip) should have value , but has value (values or count) with incorrect type: unknown type (-1)"

This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.

"tag 278 (RowsPerStrip) was not found, but requested because defined"

The tag isn't present at all, even though it's required by the TIFF6 specification.

"tag 279 (StripByteCounts) should have value , but has value (values or count) was not found, but requested because defined"

The field doesn't contain a value, which violates the TIFF6 specification.

"tag 282 (XResolution) should have value word-aligned, but has value (values or count) pointing to 0x00000129 and is not word-aligned"

The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.

"tag 292 (Group3Options) is found, but is not whitelisted"

As compression is not allowed in our repository, we disallow this field that comes with certain compression types as well.

"tag 293 (Group4Options) is found, but is not whitelisted"

Same here.

"tag 296 (ResolutionUnit) should have value , but has value"

The tag ResolutionUnit is a required field and is set to "2" (inch) by default. However, if the field is completely missing (as was the case here), this is a violation of the TIFF6 specification.

"tag 296 (ResolutionUnit) should have value , but has value (values or count) with incorrect type: unknown type (-1)"

This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.

"tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=0"

The TIFF6 specification states that: "If PageNumber[1] is 0, the total number of pages in the document is not available.". We don't allow this in our repository by local policy.

"tag 306 (DateTime) should have value ^[12][901][0-9][0-9]:[01][0-9]:[0-3][0-9] [012][0-9]:[0-5][0-9]:[0-6][0-9]$, but has value (values or count) XXXXX"

That's one of the most common errors. It's utterly unbelievable how many software manufacturers don't manage to comply with the very clear rules of how the DateTime string in a TIFF needs to be formatted. This is a violation of the TIFF6 specification.

"tag 306 (DateTime) should have value should be "yyyy:MM:DD hh:mm:ss", but has value (values or count) of datetime was XXXXX"

Same here

"tag 306 (DateTime) should have value word-aligned, but has value (values or count) pointing to 0x00000167 and is not word-aligned"

The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.

"tag 315 (Artist) is found, but is not whitelisted"

The tag Artist may contain personal data and is forbidden by local policy.

"tag 317 (Predictor) is found, but is not whitelisted"

The tag Predcitor is needed for encoding schemes that are not part of the Baseline TIFF6 specification, so we forbid it by local policy.

"tag 320 (Colormap) is found, but is not whitelisted"

TIFFs with this error message contain a color map instead of being encoded as bilevel/greyscale/RGB images. This is something that is forbidden by policy, hence we need to correct it.

"tag 339 (SampleFormat) is found, but is not whitelisted"

This tag is forbidden by local policy.

"tag 33432 (Copyright) should have value ^[[:print:]]*$, but has value (values or count)"

The Copyright tag is only allowed to have character values from the 7-Bit ASCII range. TIFFs that violate this rule from the TIFF6 specification will throw this error.

"tag 33434 (EXIF ExposureTime) is found, but is not whitelisted"

EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD. As this probably hasn't happened here, this needs to be seen as a violation of the TIFF6 specification.

"tag 33437 (EXIF FNumber) is found, but is not whitelisted"

Same here.

"tag 33723 (RichTIFFIPTC / NAA) is found, but is not whitelisted"

This tag is not allowed by local policy.

"tag 34665 (EXIFIFDOffset) should have value , but has value"

In all cases that we encountered, the tag EXIFIFDOffset was set to the wrong type. Instead of being of type 4, it was of type 13, which violates the TIFF specification.

"tag 34377 (Photoshop Image Ressources) is found, but is not whitelisted"

This is a juicy one. This error message indicates that something's wrong with the embedded ICC profile. In fact, the TIFF itself might be completely intact, but the ICC profile has the value of the cmmtype field set to a value that is not part of the controlled vocabulary for this field, so the ICC standard is violated.

"tag 34852 (EXIF SpectralSensitivity) is found, but is not whitelisted"

EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD.

"tag 34858 (TimeZoneOffset (TIFF/EP)) is found, but is not whitelisted"

TIFF/EP tags are not allowed in plain TIFF6 images.

"tag 36867 (EXIF DateTimeOriginal) is found, but is not whitelisted"

EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD.

"tag 37395 (ImageHistory (TIFF/EP)) is found, but is not whitelisted"

Same here.

Some of the errors, however, could not be corrected by means of an automatic workflow. These images will have to be rescanned from their respective originals:

This tag contains a value for the image's horizontal resolution that is too low for what is needed to comply with the policy. In this special case, that policy is not our own, but the one stated in the German Research Foundation's (Deutsche Forschungsgemeinschaft, DFG) "Practical Guidelines for Digitisation" (DFG-Praxisregeln "Digitalisierung", document in German, http://www.dfg.de/formulare/12_151/12_151_de.pdf), where a minimum of 300 dpi is required for digital documents that were scanned from an analog master and are intended for close examination. 1.717 files contained this error.

"tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=2"

This error message indicates that the TIFF has more than one pages (in this case two master images), which is forbidden by our internal policy. Five images contained this error.

"tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=3"

Same here. One image contained this error.

"tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=5"

Same here. One image contained this error.

"TIFF Header read error3: Success"

This TIFF was actually broken, had a file size of only 8 Bytes and was already defective when it was ingested into the repository. One image contained this error.

From our experiences, Andreas has created eight new commits for fixit_tiff (commits f51f71d to cf9b824) that made fixit_tiff more capable and more independent of the libtiff, which contained quite some bugs and sometimes even created problems in corrected TIFFs that didn't exist before. He also improved checkit_tiff to vastly increase performance (3-4 orders of magnitude) and helped build correction recipies.

The results are quite stunning and saved us a lot of work:

Only 1.725 out of 107.218 TIFF files have not been corrected and will have to be rescanned. That's about 1,6% of all files. All other files were either correct from the beginning or have successfully been corrected.

26 out of 3.103 SIPs still have incorrect master images in them, which is a ratio of 0,8%.

11 new correction recipies have been created to fix a total of 41 errors (as listed above).

The validation of a subset of 6.987 files just took us 37m:46s (= 2.266 seconds) on the latest checkit_tiff version, which is a rate of about 3,1 files/sec. For this speed, checking all 107.218 files would theoretically take approximately 9,7 hours. However, this version hasn't been available during all of the correction, so the speed has been drastically lower in the beginning. We think that 24 - 36 hours would be a more accurate estimate.

UPDATE: After further improvements in checkit_tiff (commit 22ced80), checking 87.873 TIFFs took only 51m 53s, which is 28,2 TIFFs per second (yes, that's 28,2 Hz!), marking an ninefold improvement over the previous version for this commit alone. With this new version, we can validate TIFFs with a stable speed, independent from their actual filesize, meaning that we can have TIFF validation practically for free (compared to the effort for things like MD5 calculation).

10.774 out of 107.218 TIFF files were valid from the start, which is pretty exactly 10%.

The piechart shows our top ten errors as extracted from all validation runs. The tag IDs are color coded.

This logarithmically scaled graph shows an assembly of all tags that had any errors, regardless of their nature. The X-axis is labelled with the TIFF tag IDs, and the data itself is labeled with the number of error messages for their respective tag IDs.

Up until now, we've invested 26 person days on this matter (not counting script run times, of course); however, we haven't finished it yet. Some steps are missing until the SIPs can actually be transferred to the permanent storage. First of all, we will revalidate all of the corrected TIFFs to make sure that we haven't made any mistakes while moving corrected data out of the way and replacing it with yet-to-correct data. When this step has completed successfully, we'll reject all of the SIPs from the Technical Analyst's workbench in the repository and re-ingest the SIPs. We hope that there won't be any errors now, but we assume that some will come up and brace for the worst. Also, we'll invest some time to generate some statistics. We hope that this will enable us to make qualified estimates for the costs of reparing TIFF images, for the number of images that are affected by a certain type of errors and for the total quality of our production.

A little hint for those of you that want to try this at home: make sure you run the latest checkit_tiff compliance checker with the "-m" option set to enable memory-mapped operation and get drastically increased performance, especially during batch operation.

For the purpose of analysing TIFF files, checkit_tiff comes with a handy "-c" switch that enables colored output, so you can easily spot any errors on the text output.

I want to use the end of this article to say a few words of warning. On the one hand, we have shown that we are capable of successfully repairing large amounts of invalid or non-compliant files in an automatic fashion. On the other hand, however, this is a dangerous precedence for all the people who don't want to make the effort to increase quality as early as possible during production, because they find it easier to make others fix their sloppy quality. Please, dear digital preservation community, always demand only the highest quality from your producers. It's nothing less than your job, and it's for their best.