What's the best way to hash a credit card number so that it can be used for fingerprinting (i.e. so that comparing two hashes will let you know if the card numbers match or not)?

Ideally, I'm looking for recommendations of which hash + salt might be well suited for this use-case. The hash would uniquely identify a particular card number. You can use this attribute to check whether two customers who've signed up with you are using the same card number, for example. (since you wouldn't have access to the card numbers in plain text)

To give this some context, see the Stripe API docs and search for fingerprint. It's where I first heard about this concept. The credit card information will be stored on a secure machine (somewhere in the Stripe back-end) that's not accessible by the customer, and the API returns the fingerprint in order to allow the API consumer to make comparisons (e.g. to answer questions like has this card been used before?).

Let's make this one clear, coz I know you're going to ask: I'm not trying to replicate it. I'm just curious to understand how this could be implemented

So salt + hash won't work, because salts are unique. So, two identical card numbers, once salted, will produce different hashes, and you will not be able to determine that the card number matches a pre-existing salted hash of the same card number, since the salts will be different.
–
XanderJul 16 '14 at 16:40

2

I tend to agree with u2707's answer and if you can't safely store the credit card numbers, I don't think it's reasonable to store hashes either. I can't suggest doing this to your customers.
–
XanderJul 16 '14 at 17:19

4

@staticx That isn't a salt, it's a pepper, and it isn't particularly valuable from a security perspective when trying to prevent hash-cracking, as you still only need to compute a single set of hashes to recover all values.
–
XanderJul 16 '14 at 18:18

1

@staticx It's kind of a cheesy term I know, but since by definition a salt needs to be globally unique, a value that's used more than once doesn't qualify.
–
XanderJul 16 '14 at 18:24

2

@staticx If Stripe's server is down and you're using them for your merchant services, you can't process cards/transactions anyway, so it's irrelevant if you happen to depend on them for this function as well.
–
XanderJul 18 '14 at 13:21

7 Answers
7

I cannot comment on how Stripe does this but I can tell you exactly how Braintree does it (because that is where I work). If I had to guess, Stripe probably uses a similar method.

In the Braintree API, we offer a unique number identifier for a credit card. This identifier is a random opaque token that will always be the same for card number stored in our system. The seed for this number is different per merchant in our system so you cannot compare them across merchants.

When a new card comes in, we look it up by comparing it to a hashed + salted column. If it matches that existing column we know we can return the same unique number identifier. If it doesn't match any existing record, we use a cryptographically secure pseudo-random number generator to create a new unique number identifier and ensure it doesn't conflict with an existing one.

This way the hashed + salted value never leaves our backend but we can still provide a way for a merchant to uniquely identify stored credit cards.

@FloatingRock My take is that this basically extends the method I outlined by (a) peppering the PAN before hashing, and then (b) storing an additional unique, randomly generated token (or fingerprint) alongside the hash. You still have the requirements to strongly secure the app and database computing and storing the hashes and tokens, but you can then return the token to the front-end app to keep, and it doesn't reveal any information about the card it's associated with. It's quite nice in that it lets you offload functions to the front-end without any additional exposure to card data.
–
XanderJul 17 '14 at 19:02

1

@staticx I think you missed a bit. "When a new card comes in, we look it up by comparing it to a hashed + salted column" The hashed PAN is stored, by Braintree, and I assume Stripe. The OP wants to know how that implementation works.
–
XanderJul 18 '14 at 13:38

1

Out of curiosity, what prompted you to use a PRNG to generate the token, instead of say an auto-incremented PK, which would not need nearly as computational power and database lookups to ensure lack of collisions?
–
MrMarchelloJul 19 '14 at 15:05

1

For everyone commenting on the PCI implications of this. I agree that a hashed version of a PAN is trivially reversed into the full PAN if it is leaked (just because of the small domain of PANs). Braintree is a PCI Level 1 Service Provider and the hashing of PANs for this purpose is included in our yearly on-site PCI audit. The value add of using a service like Braintree is to help reduce the PCI burden on the merchants we support. That is why we provide a way to uniquely identify cards without needing to store the hash on the merchants end.
–
John DowneyJul 21 '14 at 18:47

Ok, so if this is a business requirement that you must meet (check to see if a card already exists in the system) this is how I would do it.

First, PCI-DSS does allow for a PAN (the primary account number, or credit card number) to be stored in hashed form using "strong cryptography" and they recommend (but do not require) that a salt be used as well. PCI-DSS v3 § 3.4(PDF link) The reason is to inhibit the ability of a malicious party who gets ahold of the hash from determining the PAN that was used as the input for the hash.

Second, your plan to store this in a separate, secure system and expose it only via an API is a good defense in depth measure. If I were implementing this, I wouldn't have it send a hash back to the primary system, I would simply take the hash as a parameter for the API call and have the API return "true" or "false" depending on whether the hash already exists in the system or not.

For the hashing function, the key to preventing the recovery of the inputs (as required by PCI-DSS) is to chose an algorithm that allows you to make the process computationally hard, so the task of brute-forcing the relatively small input space of valid PANs reasonably slow. SHA-512, suggested by another answer, is not this algorithm. Looking at the test matrix for hash cracking provided by hashcat.net the fastest machine in their matrix can calculate nearly 2 billion SHA-512 hashes per second. With an estimated input space of approximately 30,000,000,000,000 cards**, that means that unsalted, you could compute the SHA-512 of each of those PANs in just over 4 hours, and have the card numbers for ever card in the database, should it ever be stolen.

So, what we need is a slow hash function, however the definition of slow changes all the time as computers get faster and technology evolved. This means that a secure hash function will be one with an adjustable work factor, so over time you can increase the computational effort required to compute the hash of a given input. There are several candidate algorithms that meet this criteria including PBKDF2, bcrypt, and scrypt. The general consenus for password hashing is bcrypt, and that is what I would suggest here. For the cost, set it as high as your application allows. It likely isn't something you're going to be computing terribly frequently (I'd imagine, depending on your application, you're going to be looking at x hashes per minute or hour or day rather than per second) and the higher the cost, the harder it will be for an adversary who captures the hash to brute force.

** My estimate is based on a 9 digit account number, a computable check digit that doesn't increase complexity and a WAG of 30,000 active bank identification numbers (the first 6 digits of the card) based on a list I found.

I think you're too optimistic. Bcrypt doest resist FPGAs very well, and I'd assume the last 4 digits are known to the attacker.
–
CodesInChaosJul 16 '14 at 23:27

@CodesInChaos Well, overall, my advice to not to do this at all in practice still stands. In theory though, if following PCI, the truncated card numbers are well separated from the hashes and a breach of one doesn't compromise the other. In practice... Shrug. In a theoretical implementation though, vis-a-vis the lack of resistance to FPGA-based attacks you mentioned, would you suggest scrypt is a better option?
–
XanderJul 16 '14 at 23:34

3

credit card numbers are insecure in the first place anyway. Most serial number systems for games have more entropy than that. When will we get rid of them?
–
pqnetJul 17 '14 at 3:01

@CodesInChaos Because BCrypt uses a different salt every time, you wouldn't really be able to compare the resulting hashes. Unless, of course, you used the same salt, which doesn't seem very BCryptish to me.
–
corsiKaJul 18 '14 at 15:04

Hashing credit card numbers is not a substitute for securing the data. If your system isn't secure enough to store raw credit card numbers then it's not secure enough to store CC hashes.

Same thing for anything that's a fixed size and limited character set: date of birth, phone number, zip, etc.

Credit cards are all sixteen digits and while 10^16 seems like a big number, it's pretty straightforward for compute resources. If I have your table of CC# hashes, I can easily attack that with a rainbow table and get a full list of credit cards. Salting the values before you hash makes searching impossible.

All possible CC values: something less than 10^16.

For comparison, imagine your users used a reasonable 8 character password with a 20-byte salt: 62^28.

(If I was your PCI auditor I would consider hashing credit cards to be tantamount to storing the card numbers themselves.)

One addition and one pedantic point. The pedantic point: Some credit card numbers are 15 digits, not 16. Not that this changes your answer much, but I wanted to note it. Additionally, you wouldn't have to brute-force the entire space. The first set of digits in a CC number are one of a few known prefixes, and the last is a check-digit, so the space of possible valid numbers is quite a bit smaller.
–
XanderJul 16 '14 at 16:17

Not enough rep to comment yet, but this is mainly directed at the previous answers suggesting on 10^16 as the "max" number of "randomness" - it's not really close to 10^16, since out of the 16 digits that make up a credit card number, only the digits 7-15 (inclusive) identify the account - the others identify the type (mastercard, visa, etc.), the issuer (bank XXX), the specific type of card (I.E. Delta gold rewards). Finally there is a "checksum" which means you can (quickly for a computer relative to computing a hash) eliminate another digit if that doesn't add up, so realistically you are l
–
user2813274Jul 16 '14 at 17:50

2

@Xander: Yes, and its smaller than people think because the entire card number has to pass the Luhn algorithm.
–
staticxJul 16 '14 at 18:14

"something less than 10^16" -- yep, definitely something smaller than 10^16, just too lazy to do the math.
–
u2702Jul 17 '14 at 18:41

The answer will disappoint you: if a service/oracle/hash can tell you if an input number is on file, it can be broken by a very small amount of brute force.

Consider that in much of the country, credit cards are issued by relatively few banks in their geographical region. You can learn the BIN numbers for the most popular of these banks quite easily. Let's say you have a receipt from a Memphis store, and it says VISA ****1234. Chances are about 20% that it will be one of ten different BINs that belong to the largest banks in the Memphis area. That's 10^1 guesses for the first six digits. Next, you know the last 4, because they're printed on the receipt. That's now only 10^1 guesses that cover 10 of the 16 digits. Of the remaining six digits, iterate over 5 and compute the last using Luhn. That's 10^6 tests that will yield a valid account number 20% of the time.

All you have to do to test a number is loop through the service 100,000 times. It doesn't matter if the service uses a hash or encryption or a magic wand to protect the account numbers, as long as it tells you whether or not your input was a match.

You may note that this works only about 20% of the time. Consider that the guy who stole from Target took 40 million account numbers, and he couldn't sell them all, as there weren't enough crooks in the world to buy that many. 20% of that is 8 million, and he probably didn't sell that many, either. So 20% is good enough odds for an attacker.

tl;dr:
It is not possible to do such a thing in a truly secure manner, but it can be done in a "probably good enough" way for being usable and acceptable.

Your options are basically some form of encrypted storage (with the risk of encryption keys being stolen, as they need to be present for decryption), or something involving hashing, with the well-known issues of hashing passwords (a credit card number is basically a "password").

I would implement this as as two-level system where the first level, would be a simple CRC32 and the second level would be a many times iterated secure hash function. Yes, CRC32 is not a cryptographically secure hash, but that doesn't matter in this particular case, since the CRC is not the weak link.
If the thought of using a not-cryptographically-secure algorithm scares you too much, you can still replace the CRC with a cryptographically secore algorithm and only use e.g. the least significant 32 (or 16, for a bigger margin) bits from the resulting hash.

The reason for this approach is that the range of possible inputs only spans around 39 bits[1]. It is practically impossible to hash this keyspace in a truly secure way while keeping the system operational for the designated use.
A brute-force attack on a 39 bit keyspace is not only theoretically feasible, but in fact quite trivial -- unless each single operation is extremely expensive.

The main threat you need to defend against is theft of your database followed by brute-forcing your hashes to retrieve valid card numbers. The next best thing an attacker could do would be querying online whether some unknown number that hashes identically to a randomly submitted number has been used before, or querying whether an already known valid number has been used (neither one is very useful, though!).

Credit cards are usually valid for 5 years, thus for the system to be secure, it must withstand at least 5 years of brute-force. In other words, the work performed for a single verification must be so computionally complex that a dedicated multi-GPU cracking machine cannot perform more than 2^39/(5*365*86400) = 3486.5 verifications per second.

Taking the well-known 348-billion-per-second figure from 2012, the only really secure approach would thus have to store the value of having hashed at least 100 million times. Taking Moore's Law into account, that'd be 200 million by now, and we're not even considering the possibility that someone owns 2 or 3 such machines (or maybe 10).

However, this still does not take into account the fact that banks are inherently stupid and will simply increment the second last digit for a new valid number (and update the check digit). All my credit cards issued during the last 20 years have had the same number with the exception of the second-last digit incrementing 0, 1, 2, 3, ...
Which means that planning for 5 years still isn't enough. Someone knowning an expired credit card number can trivially derive the valid number of the follower card. You therefore need to plan rather for something like 30 years, not 5 years. I'll leave extrapolating Moore's Law to another 20-30 years for homework.

In conclusion, for the system to be really, really secure, your server would have to perform the equivalent of a trillion hashes (or more) per verification, which at a rate of, say 10 million per second for a typical CPU implementation, would take over a day. In other words, the system is completely unusable for its designated purpose.

Something that's several orders of magnitude faster is needed. Most queries will be negative queries, and a simple, fast hash could be used to prune out 99.9% of all negatives, if only it doesn't render the already weak security even weaker. This is where CRC32 comes in.

If the CRC32s of two credit card numbers don't match, it is guaranteed that the numbers are not the same. Most of the time, that will be the case, and you can immediately return "No!". You can skip 99.9% of the work without giving anything away.
If the CRCs do match, it might be a hit, but there is a chance of a hash collision. Now, the secure cryptographic hash (which would have at least 160, better 256 bits) is calculated and compared. The likelihood of a collision is neglegible now. If the hash matches, it is certain that the number matches.

CRC32 contains 32 bits of information. It is impossible to restore 39 bits of information from 32 bits, no matter what computional resources you may own. Therefore, the CRC is safe to use in this case (or at least, it is not less safe than anything else).

As pointed out by Ben Voigt, it is of course still possible to guess the information that cannot be restored, and 7 bits to guess aren't a terrible lot (concatenating the expiry date will make this 15 bits). That is true, but there is not much you can do about it.

What about CRC and security anyway? CRC is not cryptographically secure. It is relatively easy to tamper with bits to get a desired output (not a problem here, would be with logins), and the function can be trivially reversed (4 table lookups, plus a few shifts and xor) to a 32 bit input, a somewhat bigger problem. Luckily, but this bit pattern is pretty worthless since it isn't a valid number, nor a subset of any 39-bit pattern that might be a valid number (other than by sheer coincidence).

Brute-forcing the complete 39-bit keyspace is more straightforward than trying to play tricks on the CRC, since, unluckily, this is easy enough and fast: A SSE4.2 implementation uses about half a cycle per byte (so, 2 cycles per hash), which means a desktop computer could do the complete keyspace in a second or two (assuming the keys to compare with all fit in L1).

The exhaustive search would reveal all 39-bit patterns that have the same CRC, which means they could possibly be credit card numbers, and using Luhn's algorithm the attacker can immediately discard 9 out of 10 numbers which cannot be valid. That's bad, admittedly, but it is really just a consequence of not having too many input choices and having a check digit implanted on top!

Still, the CRC is actually stronger in this scenario than a cryptographically secure algorithm: Given 2-3 billion credit cards in circulation worldwide, the chance of a collision using a 160-bit hash is somewhere around 1 to 1020, and don't even think about the chances using a 256-bit hash. Which means that any hit during your exhaustive search of the 39-bit keyspace will immediately give you the one and only guaranteed valid card number. A hit on the CRC32 will only give you one of many possible numbers.

One thing you could do to trade some speed for security would be to further throw away some information from the CRC, for example, you could just discard the most significant 8 or 16 bits of the CRC, which would add these to the bits that an attacker will have to guess. Of course this will somewhat reduce the effectiveness of the pruning, but not too drastically. Instead of pruning 99.99% of the work, you might end up pruning 99% or 98%, which is still mighty fine.

The second level of verification, as stated above, needs to be a cryptographically secure hash that runs at least a million iterations much like it is done in a key stretching scheme (actually it needs much more than that, but the system needs to remain practically usable, too).
The implementor needs to trade risk versus usability here, a server should at least still be able to process a few dozen requests per second. Customers/merchants will not consent if looking up one number takes several minutes (or longer), no matter what the security implications are. Nor is it practical to have more or less one dedicated server dedicated per merchant.

Since matching is only interesting on a per-merchant base, one could use a constant-per-merchant salt and store card numbers redundantly along with an index on the merchant's ID, which should be within the powers of any reasonable database system.

[1] A credit card number has 16 digits of which the last is a checksum derived from the others (Luhn algorithm). The second but last digit is usually (not necessarily, but usually) the sequential card number, starting at zero for the account owner's first card, 1 for the spouse or a replacement card, 2 for the next card thereafter, and so on.
Further, the first three digits encode the card issuer and the bank, and only use about half of the possible number of 3-digit combinations.
All in all, this leaves us with betwen 12 and 13 complete decimal digits of entropy, which is anywhere between slightly over 36 bits, and 39 bits.

I would recommend concatenating the expiry date to the number, as this will make this situation a bit less miserable. My credit cards are always valid for 5 years (not sure if that is the universal rule, but I'd assume so), so adding the expiry date would add 5*12 possibilities, or 5.9 bits (these come pretty much for free, since you need the expiry data in a transaction anyway!).

Thanks Damon. Just to clarify, RE: (only a query of the type "did this particular person use that number before" would be possible). If I'm a merchant using Stripe, I don't care if the card was used with another merchant or not. The scope would be limited to the cards used on one particular merchant (i.e. Me).
–
FloatingRockJul 17 '14 at 13:46

This is an absolutely terrible approach. Sure you cannot recover 39 bits from a CRC-32. But the space you need to brute-force is reduced to 7 bits :( It doesn't matter how large your cryptographic hash is, anything larger than the input space isn't expanding the search.
–
Ben VoigtJul 17 '14 at 21:56

1.1 Do not store sensitive authentication data after authorization (even if encrypted): Sensitive authentication data includes the data
as cited in the following Requirements 1.1.1 through
1.1.3.

In the adjacent column it states:

If this payment application stores sensitive authentication data,
verify that the application is intended only for issuers and/or
companies that support issuing services

I assume you are not an issuer or a company that support issuing services. Keep in mind that Target Corporation was recently hacked and credit card information was stolen in real-time and from servers who did not encrypt the data end to end. It is my understanding that it is being rectified and made more secure. Frankly, do not store it period. Even if you use SHA512 hash, your implementation might be flawed. With the best countermeasures, I would not want to be on the end of a lawsuit.

Edit:

It technically is permitted in section 2.3 of PA-DSS v2.0. But don't do it.

@Xander: Yea, I was hoping to avoid saying that and just say carte blanche don't do it :)
–
staticxJul 16 '14 at 18:26

2

Actually, it's permitted, because the PAN isn't on that list of sensitive data (see page 12.) Sensitive data is either full track data, CAV2/CVC2/CVV2/CID, or PIN blocks. (Disclaimer: I am not a QSA, consult your QSA for your own situation.)
–
John DetersJul 17 '14 at 0:41

@JohnDeters: Yes, but the PAN is included in Track 1 and 2. It's a bad paragraph really. They shouldn't allow it
–
staticxJul 17 '14 at 10:44

There's no such thing as 'SHA512 encryption'. Hashing and encryption are very different, and it's dangerous to confuse them.
–
sapiJul 17 '14 at 12:45

If you're going to do it, then add some kind of obfuscation. I just guaranteed myself downvotes, but obfuscation is not without value like some will insist.

If you assume the system is known, then there is no value. But if the system isn't known, there is absolutely value. An IV used to 'pepper', as Xander put it, would be a form of obfuscation, while not increasing collisions, and maintaining the ability to query.

A pepper wouldn't be a very effective obfuscation however, unless the pepper is large enough to not be bruteforce guessed in a reasonable amount of time. 100 random bytes would may be overkill but theres no reason not to go big.

Another thing that might be effective obfuscation would be to not assume that both the pepper and the message are attained at the same time, so that if you periodically rehash the hashes in a nightly batch (keeping track of how many rounds), they'd need another piece of info - how many rounds, or how the rounds are calculated.

Keep in mind that I don't think this is a good idea, I'm just thinking out loud here. If your environment is secure enough for credit cards, not only would I not hash them, I'd store them in a table designed for the fastest lookup performance.

Check out the accepted answer .. it uses a random number generator that reveals nothing about the card number.
–
FloatingRockJul 17 '14 at 18:28

1

Oh, well I read this all yesterday and there wasn't any context or use case, which changes everything.
–
Andrew HoffmanJul 17 '14 at 18:33

@FloatingRock: Your weak point is the transit of the card from the application to Braintree. Also, there is a weakness at the swipe reader to the PC to the application. There are solutions out there that encrypt end-to-end but it costs money.
–
staticxJul 22 '14 at 11:23