Posted
by
samzenpus
on Monday March 31, 2014 @07:02AM
from the lets-see-what-you-have-there dept.

Advocatus Diaboli (1627651) writes "This weekend a small corner of the Internet exploded with concern that Dropbox was going too far, actually scanning users' private and directly peer-shared files for potential copyright issues. What's actually going on is a little more complicated than that, but shows that sharing a file on Dropbox isn't always the same as sharing that file directly from your hard drive over something like e-mail or instant messenger. The whole kerfuffle started yesterday evening, when one Darrell Whitelaw tweeted a picture of an error he received when trying to share a link to a Dropbox file with a friend via IM. The Dropbox web page warned him and his friend that 'certain files in this folder can't be shared due to a takedown request in accordance with the DMCA.'"

Publicly shared files that match known hashes are restricted, but not deleted, and any file can be shared to anyone privately without restriction, just not publicly to the world. Not much of a story. Read TFA.

This is news, in the sense that Dropbox now actively crawls your files (DMCA still went about for publicly listed files anyway).

You obviously didn't bother to read the article.

The truth is that they always scan every single file uploaded to make sure they do not already have a copy of that file stored on their network. If they do, they throw your copy in the bin and just add an extra link to that stored copy in your account. That keeps their data usage lower as it means they never store duplicate copies of the same file, even if they are uploaded by completely different people.

So there is no crawling involved, this was done at the point of upload. They found that the same file had already been uploaded by someone else, shared, and that user got the shared copy of that file DMCA'd. Once a file has been DMCA'd in their system it seems it is blocked from being shared so only people uploaded that file also get to download it.

Drop Box is nothing more than a gussied up repackaging of a SFTP or FTPS and a nice fancy ol' GUI.

The post office is nothing but a gussied up repackaging of walking to your friend's house and giving him the letter yourself.The fax machine is nothing but a waffle iron with a phone attached!

No, it's slightly more than that.

You set up a server for SFTP or FTPS and download a nice, friendly little program called FileZilla.

...and then? Will Filezilla run on startup, settle itself inconspicuously in the systray without a running window you could accidentally close, connect to the SFTP server, download files automatically to local directories so they're instantly accessible, then monitor, sync and notify you of any changes? Will it allow you to dish out invitations to share directories and files direct from your desktop, and manage those permissions for an unlimited number of users and directories?

Anyone who uploads copyright infringing content to a cloud server and entrusts it to the care of a company is an idiot. There are various ways that files could be scanned simply from looking at the filename or hash all the way through to analysis of the tag / contents / watermark.

And DropBox is probably the most benign of mainstream cloud hosts. Google, Amazon, Apple and Microsoft all sell content and sign voluminous contracts for the sale of said content. It's not hard to imagine that they would or could be obliged to scan for infringing content and notify the content providers when they find any.

He wasn't making an analogy between how you find a hash collision and how you win the lottery -- only comparing the odds.

Dropbox uses SHA-256 hashes. I'm assuming this is what they use for this feature, since it's what they use internally for file identification and deduplication. They actually hash 2 MB file chunks, which means that any file more than 2 MB produces multiple hashes (one per chunk, naturally).

The "many chances of winning" you're referring to here is the birthday collision problem. A good, rough approximation is that for an N-bit hash, while the number of different hashes is 2^N, the number you can generate before risking a collision is about 2^(N/2). So, with SHA-256, we run no significant risk of collision until we've generated around 2^128 ~= 10^38 hashes.

The total amount of data stored worldwide is on the order of 1 ZB. That's room enough for about 10^15 2-MB chunks. Of course, some of our files might be smaller than this 2 MB chunk size, enabling us to be more efficient with storage. We might be able to get somewhere around 10^20 different files in there.

That's a strange and untenable use of all of the world's storage, and it still puts us about 18 orders of magnitude short of being able to risk a SHA-256 collision. If you had this giant set of a ton of different files, the probability of a collision existing is about 1 in 10^37.

So, short of a flaw in SHA-256, you can assume that a hash collision will never happen. We know of no such flaws. (If we do, it will almost certainly be the case that the collision only occurs because one of the two files was specifically manipulated to produce the collision.)

On the other hand, the odds of winning the lottery are rarely worse than 1 in 10^9.