Protecting Privacy with Translucent Databases

Last week, officials at Yale University complained to the FBI that admissions officers from Princeton University had broken into a Yale Web site and downloaded admission decisions on 11 students who had applied to both schools. Princeton responded by suspending its associate dean of admissions and launching an investigation. That's a good start, but both colleges should go further, and redesign the way that their databases treat personal information.

As details surrounding the incident have emerged, it's clear that there's a lot of blame to go around. Both Yale and Princeton compete vigorously for the nation's top high school students, and in recent years the competition has become increasingly aggressive. The schools shower the best students not just with phone calls and letters, but even with tuition discounts. As part of that competition, this year Yale unveiled a new Web site designed to let applicants find out if they had been admitted -- no more waiting for either that thin rejection letter or the thick admissions packet.

Unfortunately, the security on the Yale Web site was atrocious: all anybody needed to look up a student's record was that student's name, social security number (SSN), and date of birth. And it just so happened that the officials at Princeton had this same information for the most highly-contested applicants. So in April, when the Web site went live, Princeton's admissions office sprang to action as well, allegedly downloading admissions decisions from the Yale Web site on at least 18 separate occasions. Who's To BlameMost of the cyber-security professionals I've spoken with have taken a decidedly "blame-the-victim" approach with this latest story of Web site hackery. Assuming that the allegations are true, it's terrible that an administrator at Princeton would engage in such patently illegal activities. But what's even worse, they say, is that Yale would deploy a Web application so poorly conceived and implemented.

To be sure, Yale is not alone in deploying systems with poor security for personal information. Many banks and credit card companies continue to treat widely-circulated personal information, like SSNs and birthdays, as if this information is secret, available only to the bank account holder or credit card applicant. Clearly it is not, as evidenced by the national epidemic in identity fraud. But financial organizations have been stymied in their attempts to find a better means for verifying the identity of account applicants -- people with whom, by definition, the banks have no current relationship.

Poor Design Principles At PlayYale could have designed a better system: it could have asked each applicant to supply a PIN or a password as part of their application. An even more secure solution would have been for the university to assign a password to every applicant and send it back to the high school students with their confirmation cards. Such an approach would have protected the process against students who would otherwise use the same password for both Yale and Princeton.

To provide even better security, Yale and Princeton could have used what's called a translucent database, a term coined by author and cryptographer Peter Wayner in his new book by the same title.

A translucent database uses cryptographic methods like hash functions and public key cryptography to mathematically protect information so that it cannot be wrongly divulged -- not even to a crooked database administrator. Translucent databases provide for unparalleled protection of sensitive information, be that information personal, corporate, or academic. Yet, with one notable exception, translucent databases are practically unknown and unused in IT today.

The Unix password file is the one translucent database that is in wide use today. When you log into a Unix computer, you're asked to provide a username and a password. If you type the correct information, you're logged in.

Before Unix, most computers had a "password file" that simply listed valid accounts and their corresponding passwords. But there is a big problem with this approach: if an attacker gets access to the file, then everybody's password needs to be changed.

So Robert Morris and Ken Thompson adopted a different approach when they designed the Unix password system. Instead of storing the actual passwords, Unix stores passwords that have been processed with a one-way hash function. Many people call this a one-way encryption function, but it's really not encryption, because there's no way to "decrypt" the password once it is hashed. Instead, when you attempt to log into a Unix system, the computer takes the password you provide, hashes it, and sees if your hashed password is the same as the hashed password that is stored in the password file. If they are, you're allowed to log in. (If you have access to a library, you can read the original article: Morris, R.H., and Thompson, K., "UNIX Password Security", Communications of the 204 ACM, Volume 22, Number 11, November 1979, pp. 594-597. Unfortunately, the article is not available online without a subscription to the ACM's online library.)

Benefits of Using Translucent DatabasesIn Translucent Databases, Wayner extends this concept of hashing in new and important ways. For example, what if a police department needs to build a database of sexual-assault victims that lets them identify trends but hides personal information? You could use a translucent database where the first column is the hash of the victim's name, and the second column is a hash of their full address, and the third column is a hash of their block and street. You can now group incidents together by grouping entries with identical block hashes; you can see if the incidents refer to the same person by checking to see if those hashes are different.

Wayner's approach makes it possible to let victims update their records without giving anybody else the ability to search by a person's name. You do this by adding a password to the victim's name -- a password known to the victim and nobody else.

For example, if you were to use the MD5 hash function, you could key a victim's report with the value of MD5 ("J. Smith/color4") where "color4" is Smith's password. If Smith remembers that her password is "color4", then she will be able to update her database entry in the future -- perhaps to tell the database administrators that her perpetrator has been caught. If there is a concern that victims might forget their passwords, the database can have additional columns that are protected with other passwords, known to other people. For example, a second column where the password is known only to the intake officer. By creating multiple keys using different combinations of data, it's possible to protect a translucent database against browsing while simultaneously providing for people's natural tendency to forget critical pieces of information.

Had either Yale or Princeton adopted Wayner's principles, this nasty little episode might never have happened. We've already seen that Yale could have used a PIN or password to prevent the Princeton admissions office from being able to access the Yale Web site. But if Princeton had used a translucent database for its applications, then the admissions officials accused of browsing wouldn't have had access to the student's SSNs, either.

Although it's terrible that colleges like Yale and Princeton use a social security number as a universal identifier, they do so for a reason: there are occasionally cases where two students who apply have the same name. By using the SSN as a single identifier, it's possible to match up the student's application with their letters of recommendation, their SAT scores, and other information.

But once the match is done, there is no reason for the colleges to retain the number. Keeping around large databases of student names, birthdays, and SSNs merely opens these students up to the threat of identity fraud at some point in the future. It would be far better for the college databases to store the MD5 hash of the SSN, rather than the SSN itself.

There are a lot of other examples and clever tricks in Wayner's book. Together, they make this volume good reading for anybody interested in techniques for making privacy an inherent property of information systems -- rather than simply relying on policies, procedures, and access controls. His best example involves the creation of a database system for a community baby-sitter reservation system. Clearly, there's a lot of damage that somebody could do with a database of parents who are away from home, teenage baby sitters, and vulnerable children. But Wayner shows how you can use a combination of hash functions and digital signatures to store all of that information in a database, so that it's simply not possible for anyone other than authorized users to get it out.