How Does Dropbox Store Data?

Very good blog from Stephen Foskett (thank you) on the how Dropbox stores data (some de-duplication here)

Dropbox recently clarified (via their blog and privacy policy) that they “de-duplicate” user files. This has been known for quite a while, and is obvious to anyone who’s had a large file “upload” instantly. But how exactly does Dropbox store files? Are they really de-duplicated or just single-instanced? I set out to discover the answer.

Single Instance Storage

It’s fairly simple for a system to eliminate duplicate data by storing only a single instance of multiple identical files. In other words, if you and I both upload “Presentation.pptx” and it’s bit-for-bit identical, it would be a simple matter to store just one copy.

Copy the file with a new name to the folder and notice that it “uploads” instantly

Dropbox is at least single-instancing storage. This helps users, since it speeds uploads and reduces bandwidth usage. It helps Dropbox in the same way, but goes further since they still “charge” files against your account whether they’re single-instanced or not.

Clashing MD5 Hashes?

Three files with identical sizes and MD5 hashes but different names? Creepy!

A global single-instance storage system sounds great, but it opens the door to hash collision issues. Imagine if you and I both uploaded identical files. Both would have the same “fingerprint” and Dropbox would only store it once. Now imagine instead that, out of coincidence or malice, I uploaded a file with the same fingerprint as yours but different contents. This is not so far-fetched as it seems, and could lead to all sorts of security nightmares. Read on here