Does anyone have an algorithm for working out whether a given block of data is likely to be picture data?

I'm not thinking of writing this on a BBC or RISC OS (Windows actually), but am looking for BBC screen/sprite data (which is not necessarily of, say 20K length for a MODE 2 screen, as could only be part of the screen) and telling it apart from code or other data. Sounds easy...I just can't get my head around how to do it programmatically.

A variant of that might be to count the percentage of pairs of adjacent bytes which differ by 3 bits or less (or some other threshold arrived at by experimentation). I'd expect image data to have a far higher percentage of such pairs than machine code, although other types of data might also come out as false positives.

What I can imagine the statistics people doing is to analyse 6502 instructions and build up some kind of probability model (a Markov chain, maybe). Then, running through a sequence of bytes, things that don't match the predicted successor would probably suggest data rather than instructions.

This is all hand waving, of course, and you'd need to be careful with the operands for instructions, so perhaps any probable instruction would cause following bytes to be considered operands (according to that instruction's requirements), and the next item in the sequence would be obtained from the next instruction location rather than the next byte.

Somewhat related to this is part-of-speech tagging which is used in natural language processing to classify each word in a natural language text. I'm not claiming that such taggers are applicable here, but you can get a feel for the kind of thing I was suggesting by reading up a bit on that topic.