Seminars Menu

Approximating the Jaccard Metric Efficiently: An introduction to shingleprinting and its applications

by

Mark S. ManasseMicrosoft Research &#8212; Silicon Valley, CA

Tuesday, April 29, 200311:15 am

402 Walter Library

It is often of interest, given a large collection of things, to quickly determine if many of the things are effectively the same. Consider, for example, the problem of identifying music from a Napster-like service: it may be of interest to RIAA to ascertain which of the songs are copies of songs in their catalog. It might be of use to a search engine for the web to identify near-duplicate pages, to reduce the clutter on a results page, or to reduce the number of pages in the full-text index. It might be of interest to an intelligence agency when scanning a crowd at a public event, when trying to identify which members of the crowd are likely to be suspected malefactors.
Having extracted features (and hashing the features), things can be reduced to sets of bounded-range integers. The Jaccard metric defines the similarity of two sets as the cardinality of their intersection divided by the cardinality of their union. In shingleprinting, Dr. Manasse will exhibit a technique for approximating the Jaccard metric by randomly sampling features from documents, in such a way that he can efficiently find all pairs of highly-similar things, even from a collection of billions of things. He will also speak on some recent results of the PageTurner project, which uses shingleprinting to condense data storage needs.

Dr. Mark Manasse has been a senior researcher at Microsoft, since 2001. He received his Ph.D. in Mathematical Logic in 1982 and his M.S. degree in Computer Science in 1981 from the University of Wisconsin. He has worked in a wide variety of areas in computer science and some of his previous successes have included multiprocessor memory architectures, which led to work on competitive algorithms, window systems, which led to work on methodologies for concurrent programming, and to the studies behind low-bandwidth X. Dr. Manasse has authored as well as co-authored several publications in his field. Additional information can be found at:
http://research.microsoft.com/users/manasse/.