Hash function

A hash function is a function which converts an input from a large domain into an output in a smaller range (often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing.

Formally speaking, a hash function is defined by its domain (typically variable length strings of bytes), its range (typically fixed length bit-sequences) and the defining function (let's call this function H). The key desired characteristics of a hash function in general are that H(x) ≠ H(y) implies x ≠ y and that H(x) = H(y) probably implies that x = y. Unfortunately, the use of probably here is just too inexact to let us succeed. If the set of all possible values of H(x) is much smaller than the set of all possible values of x, then this latter condition cannot be true if all sequences are equally likely. Thus, the second condition is often restated in different ways to make it plausible. For instance, cryptographic hashes use the second condition in the form that given x, it is computationally very difficult to find y such that H(x) = H(y). Additionally, it is desirable that if we are given x and H(x + s) where + can be bit changing or concatenation, then we can't find s short of exhaustive enumeration. In practice, for most applications other than error correction, cryptographic hashes serve very nicely. The MD5 and SHA algorithms are two of the most popular algorithms although any cryptosystem can be used to create a hash function (as, indeed, any cryptographically secure hash can be used to create a cryptosystem).

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy codes and Reed-Solomon codes.

For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.