A better way to store password hashes?

Note to reader, this is the first of a two part series. You can find the second part here.

Ever get that dreadful feeling after doing a password reset, when a site kindly emails your password back to you in cleartext? Nothing is more exceedingly stupid than emailing a user their own password, and yet I encounter these sites with uninspiring regularity.

We all know passwords should be salted and hashed, with a hashing algorithm that runs relatively slowly on current generation hardware. Obvious choices are scrypt, bcrypt or PBKDF2, but this isn’t a religious debate on how to hash. What I’m interested in is, what do you do next?

So now you have to store the 20 (or 32) byte salt and hash in non-volatile, redundant, disaster-safe, and highly accessible storage. You need an API to access these bytes so your application can compare them later when your users try to login. OK, so you store them in your database… But how exactly do you store them? In most examples, you see a table with columns for { UserID, Salt, Hash }… or perhaps the Salt and Hash are added directly to the Users table. In either case, the Hash and Salt have a one-to-one relationship with users of the system.

The problem is, even with the salted hash, it’s still too easy for an attacker to run dictionary attacks if they are able to retrieve a specific user’s salt and hash. Even if your attacker can only test 1,000 passwords per minute, you might be surprised how weak the average user’s password actually is, and how many can still be cracked!

We need a way to validate if a password is valid, with the absolute minimum risk of ever actually disclosing our users’ password. Assuming an attacker has equal (or better) access and understanding of our systems, and a great deal of hashing power, can you prevent them from retrieving cleartext passwords? History has shown it’s worth spending a bit of time and money to make this as difficult as possible.

So here’s an idea… instead of storing the password hash with a one-to-one mapping to users, you could store all your hashes in a table that looks like this:

Note the lack of foreign key. Hashes are no longer directly associated with users. Also note, salt is not stored in this table, it would still be directly tied (a one-to-one relationship) to each user.

When you create a new user, or a user changes their password, you randomly generate some salt, store it with the User data, hash their password with it, and INSERT the result into your Hashes table.

When a user logs in, you retrieve the salt for the given user, re-compute the hash as you normally would, and simply check if the resulting value EXISTS in the Hashes table. If it does, you consider the login as successful.

Now, instead of directly comparing the calculated hash to a specific one stored with the User, we are simply checking if the calculated hash exists among all the hashes in the database. The first thought is that existence is a much weaker check than equality. This is certainly true, but even if you were to store trillions of hashes, thanks to the size of the hash (160-bit at least), you would still have less than 1 in 1018 chance of a collision (see: http://preshing.com/20110504/hash-collision-probabilities). Your servers are more likely to be hit by meteors than someone ever successfully logging in with the wrong password.

Note: As one reader pointed out, while it’s always important to randomly generate your salt values, it is especially important in this case to use a random and unique salt value for each user.

The next step is to INSERT as many dummy hash values into your Hashes table as your hardware can possibly handle. For starters you could easily generate a billion random values to fill up the table with ‘fuzz’.

So, in exchange for the minute probability of a hash collision, what are the benefits of having a hash not directly tied to a specific user, and a Hashes table filled with noise?

You can create as many ‘fake’ hashes as you want to throw your attacker off–your random rows are indistinguishable from actual hashes. To brute force even a single user’s password, the attacker would have to search the entire Hashes table after each iteration. This basically adds a big RAM requirement on top of the existing CPU requirement for cracking your passwords, since your data set would have to fit in RAM to be reasonably fast to search. I don’t know enough to say for sure, but I believe this could hamper GPU-based brute force attacks as well.

This also means an attacker would need to steal a large portion of the Hashes table, if not the entire table, in order to likely succeed in brute forcing even a single user’s password. This makes it harder for the attacker, since you can make the Hashes table large enough that anyone trying to read or copy the entire table would be extremely likely to set off alarms before all the data is gone.

Imagine, for example, a Hashes table which starts with 1 billion rows pre-populated with random data, indistinguishable from real hashes. This is 20Gb of random data in your Hashes table (more with database overhead).

The key metrics we are concerned about are lookup time of a hash (password check), and insert time of a hash (new user / password reset). We need these to be fast enough for our site to function, on top of the time we’ll spend doing, for example, 10,000 iterations of PBKDF2 for the hash itself. Note the PBKDF2 will be running on your application servers, while the hash lookup will be handled by your database, so you’re distributing the load which your attacker will need to brute force.

As a quick test, I populated my own Hashes table on a copy of SQL Server running off my laptop, and after bulk inserting for about an hour I had 100 million rows in the table. At that point I was still inserting at well over 1 million rows per minute, and existence checks were still dominated entirely by network and application latency (<1ms query time reported in the profiler). The database file on disk was abount 6GB, of which 2.8GB was ‘data’ and 3.2GB was ‘index’. A fairly cheap server with 64 – 128GB of RAM should be able to handle many billion hashes just fine.

So is it all worth it? I think putting a few billion fake hashes in between the attacker and a successfully cracked password is a bit like throwing hay on top of the needle to make it harder to find. I’m picturing how the LinkedIn fiasco could have gone differently.

What Actually Happened

6M hash values, 120MB of data, all of which correspond to real passwords

2+ million passwords cracked in hours

What Would Happen Today (PBKDF2 Salted Hashes)

6M hash values, with 6M salts, 200MB of data, all of which correspond to real passwords

I’m going to guess you could still brute force 1 million passwords with less than $1,000 of EC2 cost

Consider this — while Facebook is reported to store over 100 petabytes of data, the user IDs, salts, and hashes for all it’s 800 million active users would fit on a single USB stick. One maligned employee and that data could walk right out the door. On the other hand, if Facebook used this technique, they could easily generate enough fake hashes to effectively hide the real hashes inside a petabyte-sized haystack. No one’s going to sneak out of the data center with that tucked in their sock!

Some genius on reddit has labeled this as security through obesity. I have to say, I like it! I will admit this is just a first crack at trying to take the next step at protecting users from themselves. Of course, you may have to transition out of this idea in the not too distant future, the same way millions of sites need to transition out of using PBKDF2 and into using scrypt; waiting for all your users to login so you can upgra-hash them, and eventually timing out the inactive users into a forced password reset.