Similarity searches accelerate P2P downloads by 30-70 percent

P2P file sharing offers the possibility of dramatically increased download speeds by avoiding the problems associated with a single (overloaded) server. But even P2P downloads can bog down when files are first seeded or if few active users have a copy of the file. A research team with members at Carnegie Mellon, Purdue, and Intel thinks they've found a way around some of these limitations using what they call Similarity-Enhanced Transfer (SET), a technique they claim can speed up P2P downloads by anywhere from 30 to 70 percent. They'll be presenting their technique at the 4th Symposium on Networked Systems Design and Implementation tomorrow.

The "Similarity" portion of SET comes from the realization that many of the files being shared contain pieces of identical data. Examples include music files that differ only in terms of tags, movies or movie trailers that are dubbed in different languages, and updated versions of software. Like other P2P systems, SET divides large files into small segments. Once that process is complete, however, the SET software searches for similar files using a method called "handprinting," which is similar to the pattern matching techniques used to cluster search results or filter spam. Once similar files are identified, they are scanned for any individual chunks that are identical to pieces of the file being downloaded.

As a result, SET should greatly expand the available sources of any given file. In practice, it seemed to work pretty well. Using existing P2P networks, they were able to grab a 30MB movie trailer in only a third of the time, since their software was able to find other sources that shared about 50 percent similarity. The rate of an MP3 download shot up by over 70 percent.

We may be able to see SET appearing in clients and distribution services soon. The presentation will come with actual implementation code, and the team hopes to see others put it to use. "This is a technique that I would like people to steal," said David Andersen of Carnegie Mellon, "Developers should just take the idea and use it in their own systems."