You may have read “scanner insecurity” stories before, and they’ve probably dealt with conventional security problems.

These were probably things such as incompletely deleted data left behind on the hard disks of decommissioned scanners; poor security configuration in network-enabled scanners, such as default passwords; and exploitable vulnerabilities in image-handling code built in to scanners.

This problem is quite different.

(For all we know, this flaw may be present in other vendors’ scanners, too, as it is a consequence of an algorithm chosen for compression. Xerox comes into it simply because that is the brand of scanner in the story.)

A German computer scientist, David Kriesel, was perusing the rooms depicted on some building plans he had scanned on a Xerox WorkCenter scanner.

Normally, when you notice scanning errors, it’s because the quality is poor and the details illegible.

A room that is 15m2 on the original might looks like 1▶m2 on the scanned copy, with the 5 scanned so badly it doesn’t even look like a digit.

Or the 15 might be blurred, or have sufficiently many stray pixels in it, to look like an indecisive 16.

What you don’t expect is that a crisply printed 21m2 on the original would be rendered as a crisply scanned 14m2, say, on the copy.

In other words, given the analog-to-digital nature of the scanning process, you’d expect imperfections, but you’d also expect the unreliable parts to look unreliable, thus making their unreliability self-documenting.

It turns out that the Xerox scanner in question was using a compression scheme called JBIG2, which emerged from the grandly-named Joint Bi-level Image Experts Group.

Bi-level images, as the name suggests, have just one bit per pixel, such as the images used in fax machines (if you remember them).

And JBIG2 has a clever, yet, with hindsight very reckless, feature: if two “swatches” of the image look like each other, the same data is used for both swatches, so that they effectively become identical.

This technique works perfectly in lossless compression, e.g. the deflate algorithm used in ZIP files, where the repeat of a string of characters such as NOTEWORTHY would be encoded as “repeat the ten characters I saw 164 bytes ago”, not as another NOTEWORTHY.

But if imperfect matches were allowed, you might find NOTEWORTHY encoded as a repeat of NOT WORTHY, introducing an error that would be very hard to spot, despite the fact that the two phrases are antonyms.

The “fix”, for our German computer scientist, seems to have been to use TIFF compression instead, a lossless image compression option supported by the scanner he was using.

Update: Xerox emailed us at 2013-08-10T11:15Z to point us at some official advice on the issue. In summary: JBIG2 compression isn’t on by default. If you’re worried someone might have changed the compression settings, a reset to factory defaults will change them back. Also, Xerox will be producing an optional patch that will prevent JBIG2 being turned on at all. (If you aren’t sending faxes, you probably don’t need it.)

The lesson to be learned here, other than that Graham has an excellent eye for interestingly quirky stories, is that algorithm choices are really important.

Imagine this sort of image transposition in a CCTV system that just recorded a crime.

Instead of a blurry and obviously inconclusive image of the perpetrator, which would make it obvious that evidence would have to be sought elsewhere, you might end up with a clear and convincing image of someone who just happened to look like the perpetrator.

Where security is concerned, it’s not just how safely you store what you’ve collected, it’s how reliably you collect it in the first place.

In a seminal paper from the early 1990s entitled "Why cryptosystems fail," Cambridge cryptographer Ross Anderson will convince you that many, if not most, security problems in his field have been due to implementation errors, not to an underlying flaw in the cryptography used.

I say that an algorithm for compressing documents might reasonably be expected to attempt to maintain as much accuracy as possible, given the exigiencies of the hardware and the compression desired.

So choosing an algorithm that deliberately makes two sections of an image precisely the same when the only thing known for sure is that they are different, to the point that almost undetectable and dangerous errors may be introduced, smells to me like a security flaw 🙂