Case Studies: Finding Similar Documents
A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?
In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.
Learning Outcomes: By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Identify various similarity metrics for text data.
-Reduce computations in k-nearest neighbor search by using KD-trees.
-Produce approximate nearest neighbors using locality sensitive hashing.
-Compare and contrast supervised and unsupervised learning tasks.
-Cluster documents by topic using k-means.
-Describe how to parallelize k-means using MapReduce.
-Examine probabilistic clustering approaches using mixtures models.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.

Revisiones

AG

Nice course with all the practical stuffs and nice analysis about each topic but practical part of LDA was restricted for GraphLab users only which is a weak fallback and rest everything is fine.

PK

Sep 08, 2017

Filled StarFilled StarFilled StarFilled StarFilled Star

Great course, all the explanations are so good and well explained in the slides. Programming assignments are pretty challenging, but give really good insight into the algorithms!.\n\nThanks!

De la lección

Nearest Neighbor Search

We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

Impartido por:

Emily Fox

Carlos Guestrin

Transcripción

So the last way in which we're going to use KD-trees for nearest neighbor search is for approximate nearest neighbor search. And this is something that we're going to spend a lot of time talking about in the next section. But this idea, it's really important because so far, everything we've been talking about assumes that out of all the articles out there, you want to find the nearest neighbor. But that assumption relies so heavily on first assuming that it's important to find the nearest neighbor and not just a good alternative. But also the fact that you really believe in your representation of your data, as well as the distance measure that you're using to compute these similarities. because if you don't really have 100% faith in those to begin with, then what's the difference of finding the nearest neighbor versus one that's, pretty, pretty, pretty close to that. So a question is, can we do things even more efficiently if we don't care about finding according to our setup, the nearest neighbor but just a pretty good approximation? So within the context of KD-trees, the way we can do our approximate nearest neighbor search is as follows. Remember before, we maintained the distance to the nearest neighbor found so far. So this, I'm going to call r = distance to nearest neighbor so far. But now, what we're going to do, is we're going to, instead of pruning based on the distance to the nearest neighbor so far, we're going to prune more aggressively. That's what's going to give us the added efficiency, but at the cost of an approximation. And to prune more aggressively, instead of pruning based on distance r, we're going to prune based on a distance, r over alpha where alpha is some number greater than 1. And so this is going to be a smaller radius that we're using. So it means it's more likely that the distance to the bounding box exceeds r over alpha than it does just r. Okay, so what are the properties of performing nearest neighbor search with this more aggressive pruning? Well, we know that if we return a neighbor at some distance r, then there's no neighbor within r over alpha. So, what we're saying here is, let's say this is our query point Xq. This is our distance r, this is the nearest neighbor that we're going to return. If we look at, let me draw in a different color, I think it'll help make this clear. So if this is r over alpha, Then we know that there could be points here. So, these would all be, Data points that were actually closer than the nearest neighbor that we returned. But we know, That there's no points within this box. Otherwise we would have found them and that would have been our distance to the nearest neighbor. And we're pruning all other things that fall further from that. Okay, so by doing this, we get a fairly good approximation, and actually this bound that we're stating here is rather loose, and in practice, we're often much closer to the optimal solution. And importantly, it just saves a ton of searching you have to do, of course if alpha is reasonably large. If alpha becomes 1, then it defaults back to our exact nearest neighbor search. Okay, so in a lot of cases, doing approximate nearest neighbor search is pretty much just as good as what you would do with brute force, because there's so much noise in how you're thinking about computing distances to begin with. And whether that's really relevant to, let's say I'm sitting here reading an article, whatever you're going to present me as a pretty good approximation to the nearest neighbor is probably going to satisfy my needs for exploring other literature. Okay, so we're going to talk a lot more about alternative ways to do approximate nearest neighbor search, but in ways that also scale better with higher dimensions. But within the context of KD-trees, this is a very effective way of performing approximate nearest neighbor search. So to wrap up on KD-trees, first I want to say that they really are a very cool and efficient data structure. But I also want to discuss a few other things. One is the fact that there are lots and lots of variants. We talked about some of these in terms of heuristics for how to choose the splitting dimension and the splitting value, and all sorts of other things. And, in addition, instead of just making these axis-aligned cuts, there are other ways of performing a KD-tree. Well, it's really these ball-trees and different shapes you can plop down. And what were going to talk about next is, you can think of it as one of these alternatives to these axis-aligned cuts. And the other thing that we've seen so far is the fact that in performing nearest-neighbor search, things rely very, very heavily on how you're representing your data, and how you're computing the distance between two different data points. And for both of these, KD-trees and nearest neighbor search, things become much, much more challenging when you're in high dimensional spaces. So for KD-trees, we talked about the fact that the number of inspections you have to do can be exponential in the dimensionality. So just as a rule of thumb, typically KD-trees are only useful if the number of observations you have is much, much larger than 2 to the d, 2 to the number of dimensions you have. So in high dimensional cases, often you don't have enough observations to really make a KD-tree worthwhile. And in terms of thinking about nearest neighbor search for getting the challenges just of KD-trees, in high dimensional spaces, things are really hard, because from a really high dimensional feature vector, a lot of those features are going to be basically just noise. And those can be swamping your distance calculations. And so, what happens is in a high dimensional spaces, basically everything is far away from itself or each other. And so, you really need techniques to learn which features are important or how do you think about wading across the different features. And like we mentioned before, that's a really, really challenging task that requires a lot of domain knowledge or a lot of thought and exploration. And lastly, I really want to thank Andrew Moore for the outline of these slides. The slides on KD-trees were built very heavily from a really nice presentation that he put together on KD-trees. So thanks for the permission to use that. [MUSIC]