Many DBAs know that their job is no longer simply a matter of delivering the right answers in the shortest amount of time. If the databases hold sensitive information like Social Security numbers, prescription records, medical files, or banking records, new laws can make it a crime to deliver the right answer to the wrong person. Controlling access is often a bigger and thornier problem than managing the data alone if only because the decisions are more often political than technical. Any one can run a benchmark and clock the speed of a database, but no one can numerically measure the right of some employee to see some records.

The basic access control mechanisms in most databases are quite good, but they have limitations. No product can block traitorous insiders or sophisticated hackers who slip in through the operating system. When the data is sensitive enough, many DBAs are looking for better solutions.

One good tool is the <i> one-way function </i>, a mathematical mechanism that is easy to compute but close to impossible to invert. The best functions like MD5 (message digest 5) or the SHA (Secure Hash Algorithm), can reduce any array of bytes into a scrambled mixture of seemingly random pile of bits in such a way that it’s practically impossible to reconstruct the original input.

Some one-way functions are often called <i> cryptographically secure hash functions </i> because they are quite similar to the hash functions often used in some data structures. Functions like SHA produce a 160-bit value that can act for a quick surrogate for a large block of data. Indexing on the hash of a complex record can be more efficient than working with the data itself.

Cryptographically secure hash functions also offer privacy and security. If it’s virtually impossible to invert one of the 160-bit entries and determine the original result, the hash results can be used as replacements for the data itself. Storing the hash value can often solve the same queries as the original data–without keeping the original sensitive data around.

Password databases are the most familiar use of this technique. Instead of storing the password itself where it could be read by hackers or insiders, the database stores the hash of the password, SHA(password). For most good hash functions, there’s no publicly known way to look at this hash and find the original password without mounting a brute force search. Yet anyone with the correct password can search for a match because it doesn’t take long to compute the function.

This technique is also useful whenever a pseudonym must be created for someone. Medical records, for instance, are often stripped of their name to protect patient privacy. In the past, some researchers assigned a pseudonym to each patient and then guarded a table linking a person’s real name to a pseudonym. Hash functions can remove the need for this table by replacing the name with the hash of the name.

This solution avoids the cost and responsibility of keeping such a table without destroying the ability to add new information to a record. If new data about someone arrives, the right pseudonym can be found by hashing their name. The database is still functional, but casual browsers can’t abuse it. There’s no easy way to find a record of some patient, and then work backwards to find their name.

There are many other problems that can be solved using one-way functions. One of my favorite projects was building a web site to help parents check the schedules of their baby sitters to find the sitter who would be free. The solution helps both parties without keeping personal information about baby sitters or parents. It’s not possible for a stalker or a thief to track an individual sitter or identify an adult-free house.

Peter Wayner is the author of Translucent Databases, a book about how to protect sensitive information. It includes many techniques, like the one above, and more than two dozen examples, with complete Java and SQL code.