Abstract: Large, sparse high dimensional datasets pose scaling challenges for machine learning techniques. Several algorithmic tools have been developed to tackle this issue - random projection and locality sensitive hashing are two such elegant constructs. Random projection is a well studied algorithmic tool that has found multiple uses as a primitive to control both time and space dependence of algorithms on input dimension. I will show a machine learning application that motivates, in addition, the preservation of sparsity of input vectors, and show a corresponding new construction. Locality Sensitive Hashing (LSH) enables efficient near neighbor search and duplicate detection for large and high dimensional datasets. I will also discuss improvements to LSH techniques that facilitate a theoretically justifiable, and empirically observed lift in performance.
Based on joint work with Ravi Kumar, John Langford, Tamas Sarlos, Alex Smola, and Kilian Weinberger.

Speaker Profile: Anirban Dasgupta is currently a Senior Scientist at Yahoo! Labs, where he works on algorithmic problems for massive data sets. He did his undergraduate studies at IIT Kharagpur. His doctoral thesis at Cornell University was on learning mixtures of distributions using spectral techniques. At Yahoo!, Anirban has since worked on efficient crawling techniques, large scale machine learning, scheduling, analysis of large social networks and randomized algorithms in general.