Using grep and some simple command-line magic to parse a 13.5GB text file more efficiently.

In a previous blog entry [link], I mentioned I was already working on using Troy Hunt’s list of 320 million hashes of breached passwords.

To recap, those are all SHA1 hashes in hexadecimal format, uppercase.

A single 13.5GB textfile containing the entirety of 320M hashes is far too cumbersome to be searched efficiently, and in my use case, I can’t afford to have the searching to take up several seconds.

What’s the use case? Whenever a user logs in to a system maintained by my current employer, the system will check if the user’s password is part of that 320M breached passwords. Since Troy Hunt’s list is a list of SHA1 hashes, then we just SHA1 hash the user’s password, then check if that hash is in the hash dump.

That’s pretty straightforward, but parsing a tremendous file like that is too slow. We need to partition them somehow so that we can work on parsing much smaller files.

PARTITIONING

Thankfully, since the file contains nothing but 40-char hexadecimal hashes (SHA1 output), there’s a convenient way to partition them: separate them into files according to their starting characters.

If we want a 16-way split, then we just create a file that contains all hashes starting with ‘0’ (that’s a zero), then a file starting with ‘1’, then a file starting with ‘2’… until we get to the sixteenth file, which will contain all hashes starting with ‘F’.

Of course, those sixteen files are still pretty hefty – as I showed in my previous blog post [link], they are still ~840MB each.

We can do better easily by choosing to partition using double characters, or triple characters.

DOUBLE CHARACTER PARTITIONING

In double character partitioning, we’ll just separate the hashes into files according to their first two characters. That means all hashes that start with ‘00’ go into one file, then hashes that start with ‘01’ go into another file…until the 256th file, which contains all the hashes that start with ‘FF’. That’s 256 files total, because that’s just the combination of 16 possible initial character, multiplied by 16 possible second character (16 x 16).

This is now a lot better – since the hashes are uniformly distributed, we now have 256 files that are only 52MB each. That’s a far cry from 840MB.

For most purposes, this is probably feasible already.

TRIPLE CHARACTER PARTITIONING

We can do better than that if we decide to partition the hash list using the first three characters.

All hashes that start with ‘000’ go into one file, then hashes that start with ‘001’ go into another file, and so on until the last file which contains all hashes that start with ‘FFF’.

In total, this gives us 4,096 files (16 x 16 x 16). That’s a lot more loose files, but each one is only 3.3MB.

SOURCE CODE

I’ll share the code I used to create the partitioned sets so you can jump in immediately and do it for yourself.

This is the bash script I used to create the single-character partitioned files from Troy Hunt’s password dump (you’ll note that he actually released 3 files – the initial one is 12.9 GB; the remaining 600MB is from the next 2 supplements):

Nothing fancy about this bash script – it’s pretty straightforward, not even a loop to help me a long. That’s because I actually began by manually issuing commands in the terminal. A few commands later, I got tired of typing, then waiting, then typing, then waiting… so I just started typing inside an editor, copy-pasted stuff around, and a couple minutes later this was done and I run the whole thing as a bash script. It took a while, but after that I’ve got my sixteen 840MB files.

Since these 16 files already have all the hashes from all of Troy’s releases, they became the basis for the double-char and triple-char partitions.

Here’s the bash script for creating the double-char partitioned files:

#!/bin/bash
#This will use the grouped pwned passwords files and break them apart further into
#hashlists categorized according to the initial 2 characters.
#This effectively separates the original, single password hashlist from Troy Hunt (320M passwords)
#into 256 different files, so that processing each individual file to look for a hash is a lot faster.
hexchars=( 0 1 2 3 4 5 6 7 8 9 A B C D E F )
for hexchar1 in "${hexchars[@]}"
do
for hexchar2 in "${hexchars[@]}"
do
pattern=${hexchar1}${hexchar2}
grep "^${pattern}" pwned_group_"${hexchar1}".txt > hashlist_group_"${pattern}".txt
done
done

And very similar to that is the bash script for the triple-char partitioning:

#!/bin/bash
#This will use the grouped pwned passwords files and break them apart further into
#hashlists categorized according to the initial 3 characters.
#This effectively separates the original, single password hashlist from Troy Hunt (320M passwords)
#into 4,096 different files, so that processing each individual file to look for a hash is a lot faster.
hexchars=( 0 1 2 3 4 5 6 7 8 9 A B C D E F )
for hexchar1 in "${hexchars[@]}"
do
for hexchar2 in "${hexchars[@]}"
do
for hexchar3 in "${hexchars[@]}"
do
pattern=${hexchar1}${hexchar2}${hexchar3}
grep "^${pattern}" pwned_group_"${hexchar1}".txt > triplechar_groups/hashlist_group_"${pattern}".txt
done
done
done

USING GREP TO FIND A MATCH

Now that you have either a double-char or triple-char partitioning, looking for a match is easy:

grep -m 1 [hash] [hash_group_file]

This uses the grep command to search for [hash] (the hash you are looking for) within the specific hash file. The parameter “-m 1” tells grep to stop searching once it finds a match, since you know there aren’t any duplicates anyway. For hashes that are closer to the start of the file, that’ll save you time since grep won’t have to scan the rest of the file.

If you’re using the double-char partitioning, for example, then searching for a hash would look like this: