A Checksum Algorithm

A checksum is a value which is computed which allows you to check the validity of something. Typically, checksums are used in data transmission contexts to detect if the data has been transmitted successfully.

Introduction

A checksum is a value which is computed which allows you to check the validity of something. Typically, checksums are used in data transmission contexts to detect if the data has been transmitted successfully.

Checksums take on various forms, depending upon the nature of the transmission and the needed reliability. For example, the simplest checksum is to sum up all the bytes of a transmission, computing the sum in an 8-bit counter. This value is appended as the last byte of the transmission. The idea is that upon receipt of n bytes, you sum up the first n-1 bytes, and see if the answer is the same as the last byte. Since this is a bit awkward, a variant on this theme is to, on transmission, sum up all the bytes, the (treating the byte as a signed, 8-bit value) negate the checksum byte before transmitting it. This means that the sum of all n bytes should be 0. These techniques are not terribly reliable; for example, if the packet is known to be 64 bits in length, and you receive 64 '\0' bytes, the sum is 0, so the result must be correct. Of course, if there is a hardware failure that simply fails to transmit the data bytes (particularly easy on synchronous transmission, where no "start bit" is involved), then the fact that you receive a packet of 64 0 bytes with a checksum result of 0 is misleading; you think you've received a valid packet and you've received nothing at all. A solution to this is to do something like negate the checksum value computed, subtract 1 from it, and expect that the result of the receiver's checksum of the n bytes is 0xFF (-1, as a signed 8-bit value). This means that the 0-lossage problem goes away.

Nonetheless, for all its simplicity, the checksum technique just described is remarkably weak. For example, if you were to transpose two of the characters of the transmission, the result would be the same, so although the wrong packet is received, a correct checksum is believed. Certain kinds of noise injection on the line can also introduce undetectable errors because the noise that mangles one byte is cancelled by the noise that mangles another byte.

People who care deeply about this have developed a number of much more reliable algorithms. For example, the Cyclic Redundancy Check algorithms, CRC-8, CRC-16, and CRC-32, do fairly complex things to make the checksum sensitive to such problems. For example, using CRC, swapping two bytes in the message will generate a different checksum because the value computed depends not only on the character value, but also on the position in the message in which the byte occurred.

Disk drives often use techniques derived from Hamming Codes (named after Richard Hamming, an AT&T/Bell Laboratories researcher who is probably best known for the techniques he developed for correcting single-bit parity errors in memory, although that is but one of the many applications of his wide-ranging work in mathematical characterizations of data in computer systems). There are some detailed tutorials out there for those of you who wish to pursue this more deeply. Possibly the best-known of these is the set of codes called Fire Codes, named after the inventor, whose surname is Fire (I can't find any citations to him handily, so I can't give you more information than this). These can do things like reconstruct a sequence of bytes (sometimes as many as 20 bytes, on typical disks) that are lost due to burst noise, typically a bad spot on the disk. These are very powerful data recovery codes.

Checksums have many other applications. For example, one feature I find very annoying in many programs is the notion of "change". If I change a value in a dialog, I often get a notification that I have "changed" something, and a save/update/etc. is required. But mostly these are done by detecting if the user (that is, I) have typed anything at all into a control, changed a selection on a ComboBox, etc. A simple Boolean value, maintained by responding to OnChange, OnSelendOK, and similar messages. Of course, if I haven't actually changed the information, or worse still, if I change it back, I get the same warning. I consider such systems primitive beyond recovery, and build systems which are far more user-friendly.

I do this by keeping the information available in some form, most commonly a class. I then compute a checksum on the values when I come in (in OnInitDialog), and recompute it each time there is a change. If the new checksum is the same as the old checksum, I assume that there are no changes. I can then indicate in various ways what is required. In other cases, I'll do this in the CDocument-derived class; whenever a change is effected by the GUI, I recompute the checksum and set the Modified flag according to how the checksum compares to the checksum computed when the document was first created/loaded/whatever, rather than assume any change whatsoever is implicitly a change in content. Thus if the user hits a key in a CEditView then hits the backspace key, I'll end up indicating "no change".

Like most checksum techniques, this decreases in reliability as the number of bytes checksummed increases. This is because the more information you try to pack into a 32-bit value, using an information-losing transformation, the more likely the case where two completely different sequences of values will produce the same 32-bit value. This is one consideration as to why network packets are not sent as megabyte packets; errors in a megabyte transmission might result in the same checksum as the error-free transmission, while for short packet sizes (e.g., 4K, or 1500 bytes) the chances are so low as to not be of any concern in practical networking.

Therefore, my techniques generally are useful when a few thousand bytes of state are involved, such as in a dialog.

I use a technique that has no particular theoretical justification. But I've found it to be reliable for my purposes. The story is that I wanted to use CRC-32 some years ago, but couldn't locate the source code for a CRC-32 algorithm on the Web at that time, so I turned to my Adobe Type 1 Font Handbook and cribbed their encryption algorithm. But rather than encrypt the data, I just used the basic algorithm to create a 32-bit checksum. You can replace my basic algorithm with a 32-bit CRC if you want to. Here's my code, and some commentary on how you might use it.

Don't worry about those "magic constants" you see r, c1, and c2 being initialized to. They are part of the encryption algorithm, and although I'm sure there is some mystical reason they are set to the particular values shown, many other values (perhaps any other values, except possibly 0 or 1) would suffice.

Note the use of overloading to get various data types. To use the checksum algorithm, create a variable of type checksum, for example, in your desired data structure. You can then call the various add methods to add in the values you wish to checksum (it is rarely the case that doing a checksum of the bytes of the structure yields anything useful; for example, some might represent transient computations that do not affect the actual values of importance; and checksumming the bytes would mean that you checksum the pointers to strings, not the contents of the strings, which leads to the situation where two strings that are otherwise identical would produce different checksums because they were at different addresses. So at some point you determine the structure contains all the values you care about (for example, just after you've read the document, or initialized all the values in OnInitDialog), and you apply the following operations to your checksum variable (which I'll call original). It is convenient to package up the checksum algorithm as shown.

Now, at some point, you determine that you have changed something; for example, the user has clicked a checkbox. You capture the Boolean value of the checkbox to the flag variable, then call doChecksum with a new variable, for example, by having the button-clicked, edit-changed, etc. handler call a function computeModification

Note that if the checkbox was originally unchecked, and the user checks it, we get an indication (at least in the Modified flag) that the document has changed. If the user then unchecks it, and all other values have been unchanged, we get an indication that the content has not been modified (this means the checksum has to account for all forms of change! And that's your responsibility!)

I have glanced at the tutorials found by a Web search for "Hamming code", and these three references seemed among the best. Some of this is deep stuff. There was a lot more, but these seem representative.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

About the Author

Comments and Discussions

First, let me say I spent a while looking for a simple checksum algorithm that could be re-exprexed in the test tool 'WinRunner' which uses a subset of 'C'. Yours fitted the bill, for which I'm grateful, although I've had to write my own XOR function using division by 2.

My specific point is this: step 3, p62 of the referenced adobe document says "Compute the next value of R by the formula ((C + R) * c1 + c2) mod 65536, ...". Neither the example code nor your implementation include the 'mod' operation. Is this deliberate because without it my code generates out of bounds errors?

The mod 65536 is implicit if one uses short integers. Actually, I may do the computation mod 0xFFFFFFFF. It isn't all that critical.

If you are running on a system that does not accept arithmetic overflow, welcome back to the stone age. I know of no checksum algorithm that will work, say, in Pascal, because Pascal works under the delusion that integer overflow is fatal. Integer overflow happens so rarely that checking for it is usually a mistake. Especially when checksums, encryption, and other interesting algorithms are being used.