When Hashes Collide

If there was any doubt in my mind that data deduplication is a mainstream technology, it was wiped out when I saw--in the business section of The New York Times last week--a full-page ad from Symantec touting its deduplication technology. Even so, I still occasionally run into people who consider deduplication to be a dangerous form of black magic that is likely to mangle their data and end their careers. This attitude represents an overestimation of the likelihood of a hash collision in dedupli

Curtis even had a math Ph.D. create a spreadsheet to calculate the odds of a hash collision, which you can download from his Website BackupCentral.com/hashodds.xls. In order for the probability of a hash collision to equal the 10^15 odds of a disk read error, you would need 5x10^16 data blocks or 432 yottabytes of data in 8K blocks. I cheated and used the high-precision calculator at www.ttmath.org/online_calculator to compute that, for a deduping system with four petabytes of stored data in 8K blocks, the probability of a hash collision is 4.5x10^26, or about the same as a tape read error with perfect media.

Now, it's true that people tend to avoid catastrophic events, even if they're very unlikely, while accepting much higher probabilities of events that have lesser consequences. As a result, we mine coal for electricity knowing miners will die and people will get asthma, but we won't build nuclear power plants. But a hash collision doesn't ruin all your backup data. It just means that one block of data will be restored with the wrong data, just like a tape or disk read error.

One hash collision, one corrupt file to restore once every 10^26 times you backup 3PB of data. Seems like a reasonable risk to me. After all, I cross the street every morning to walk the dog, and I could get run over by a streetcar--or by Jessica Alba, if she reads this blog. No, I won't calculate the probability of that.

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage ... View Full Bio