Saturday, February 07, 2009

Book review: Introduction to Information Retrieval

My old favorite book on search, "Managing Gigabytes", is getting quite dated at this point. I've been looking for one I liked as much for the last few years, but just hadn't come across anything as practical or useful as that old text.

Now, I think I may have found a new favorite. Three search gurus, Chris Manning, Prabhakar Raghavan (head of Yahoo Research), and Hinrich Schutze, just published a wonderful new book, "Introduction to Information Retrieval".

If you work in search or if you are just the kind of person that reads textbooks for fun, this one is a great one. It not only describes how to build a search engine (including crawling, indexing, ranking, classification, and clustering), but also offers the kind of opinionated wisdom you can only get from people who have had substantial experience using these techniques at large scale.

Let me try to pick out some excerpts to show at least a few examples of why I found this book so valuable. First, on the difference between perceived relevance and measured relevance:

The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness. It seems reasonable to assume that relevance of the results is the most important factor: blindingly fast, useless answers do not make a user happy. However, user perceptions do not always coincide with system designers' notions of quality. For example, user happiness commonly depends very strongly on user interface design issues, including the layout, clarity, and responsiveness of the user interface, which are independent of the quality of the results returned.

This is a point that is often missed in information retrieval. It is extremely common to measure the quality of search results by having judges sit down and mark how relevant they think they are. That ignores speed, layout, the quality of nearby results, and the many other factors that go into perceived quality and impact user satisfaction.

On compression:

One benefit of compression is immediately clear. We need less disk space.

There are two more subtle benefits of compression. The first is increased use of caching ... With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it ... Increased speed owing to caching -- rather than decreased space requirements -- is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer data from disk to memory ... We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.

The authors go on to say that simple compression techniques might be better for some applications than more complicated algorithms because small additional reductions in data size may not be worth any substantial increase in decompression time.

On supporting phrase searches:

If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional posting lists. A combination strategy uses a phrase index, or just a biword index, for certain queries, and uses a positional index for other phrase queries. Good queries to include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare.

Adding Britney Spears as a phrase index may only give a speedup factor to that query of about three, because most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1,000. Hence, having the latter is more desirable, even if it is a relatively less common query.

Another factor that might be important to consider when picking phrases to put in the phrase index is the caching layer. If [Britney Spears] is almost always cached when queried, it will be satisfied without hitting the posting lists. You probably want to use the frequency of the query hitting the posting lists, not the frequency of the query in the logs, when looking at what to put in a phrase index.

On compression, skip lists, and other attempts to speed up access to the index:

Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you're running a search engine with everything in memory, the equation changes again.

A nice, long excerpt on Naive Bayes as a classifier and why it works so well:

Naive Bayes is so called because the independence assumptions ... are indeed very naive for a model of natural language.

The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true. The pairs hong and kong or london and english ... are examples of highly dependent terms.

In addition, the multinomial model makes an assumption of positional independence [and] the Bernoulli model ignores positions in documents altogether because it only cares about absence or presence ... How can NB be a good text classifier when its model of natural language is so oversimplified?

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good ... It does not matter how accurate the estimates are ... NB classifiers estimate badly, but often classify well.

Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision. It is also somewhat robust to noise features ... and concept drift.

[But,] NB's main strength is its efficiency ... It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.

[Many] have shown that the average effectiveness of NB is uncompetitive with classifiers like SVMs when trained and tested on independent and identically distributed (i.i.d.) data ... However, these differences may be invisible or even reverse themselves when working in the real world where, usually, the training sample is drawn from a subset of the data to which the classifier will be applied, the nature of the data drifts over time ... and there may well be errors in the data (among other problems). Many practitioners have had the experience of being unable to build a fancy classifier for a certain problem that consistently performs better than NB.

On feature selection:

Feature selection [picks] a subset of the terms occurring in the training set and [uses] only this subset as features in text classification. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient ... this is of particular important for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features ... [that] might produce ... overfitting. It may appear counterintuitive that at first that a seemingly weaker classifier is advantageous ... [but] we will see that weaker models are often preferable when limited training data are available.

Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.

Mutual information and chi-squared represent rather different feature selection methods ... Because its criterion is significance, chi-squared selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy. Despite the differences, the classification accuracy of feature sets selected with chi-squared and MI does not seem to differ systematically ... As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good. Both methods do this.

On the effectiveness of linear classifiers and why using non-linear classifiers is not an obvious win:

For some problems, there exists a nonlinear classifier with zero classification error, but no such linear classifier. Does that mean we should always use a nonlinear classifier?

It is perhaps surprising that so many of the best-known text classification algorithms are linear. Some of these methods, in particular linear SVMs, regularized logistic regression and regularized linear regression, are among the most effective known methods ... Typical classes in text classification are complex and seem unlikely to be modeled well linearly. However, this intuition is misleading for the high dimensional spaces we typically encounter in text applications. With increased dimensionality, the likelihood of linear separability increases rapidly.

Excellent advice on picking which classifier to use:

If you have fairly little [training] data and you are going to train a supervised classifier, then machine-learning theory says you should stick to a classifier with high bias ... There are theoretical and empirical results that Naive Bayes does well in such circumstances ... A very low bias model like nearest neighbor is probably contraindicated. Regardless, the quality of the model will be adversely affected by the limited training data ... Often, the practical answer is to work our how to get more labeled data as quickly as you can.

If there is a reasonable amount of labeled data, then you are in a perfect position to use ... an SVM. However, if you are deploying a linear classifier such as an SVM, you should probably ... [overlay] a Boolean rule-based classifier over the machine-learning classifier ... If management gets on the phone and wants the classification of a particular document fixed right now, then this is much easier to do by hand-writing a rule than by working out how to adjust the weights of an SVM without destroying overall classification accuracy. This is one reason why ... decision trees, which produce user interpretable Boolean-like models, retain considerable popularity.

If a huge amount of data are available, then the choice of classifier probably has little effect on your results and the best choice may be unclear (cf. Banko and Brill 2001). It may be best to choose a classifier based on the scalability of training or even runtime efficiency.

A nice, intuitive explanation of why support vector machines (SVMs) work well:

For two-class, separable training data sets, there are a lot of possible linear separators ... Although some learning methods such as the perceptron algorithm find just any linear separator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point ... Maximizing the margin seems good because points near the decision surface represent very uncertain classification decisions ... A slight error in measurement or a slight document variation will not cause a misclassification.

[Also,] by construction, an SVM classifier insists on a large margin around the decision boundary. Compared with a decision hyperplane, if you have to place a fat separator between the classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data has been increased.

The superlinear training time of traditional SVM algorithms makes them difficult or impossible to use on very large training sets. Alternative traditional SVM solutions algorithms which are linear in the number of training examples scale badly with a large number of features, which is another standard attribute of text problems ... [and it] remains much slower than simply counting terms as is done in the Naive Bayes model.

Nonlinear SVMs ... [are usually] impractical ... [and] it can often be cheaper to materialize the higher order features and train a linear SVM ... The general idea is to map the original feature space to some higher dimensional feature space where the training set is [linearly] separable ... The easy and efficient way of doing this ... [is] the kernel trick ... 90% of the work with kernels uses ... polynomial kernels and radial basis functions ... Really, we are talking about some quite simple things. A polynomial kernel allows us to model feature conjunctions (up to the order of the polynomial) ... [For example,] pairs of words ... [require] a quadratic kernel. If triples of words give distinctive information, then we need to use a cubic kernel. A radial basis function allows you to have features that pick out circles (hyperspheres), although the decision boundaries become much more complex as multiple such features interact. A string kernel lets you have features that are character subsequences of terms.

On clustering:

K-means is the most important flat clustering algorithm ... The ideal cluster in K-means is a sphere with the centroid as its center of gravity ... The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centers around in space to minimize RSS ... the squared distance of each [item] from its [closest] centroid.

Hierarchical clustering outputs ... a [useful tree] structure ... [and] does not require us to prespecify the number of clusters .... [but is at least] O(N^2) and therefore infeasible for large sets of 1,000,000 or more documents. For such large sets, HAC can only be used in combination with a flat clustering algorithm like K-means ... [which] combines the ... higher reliability of HAC with the efficiency of K-means.

And there is much, much more. This is just a small sample of the insights in those pages. Definitely worth a read.

By the way, the references sections at the end of each chapter offer excellent surveys of recent work on each topic. They are worth their weight in gold and should not be missed.

The authors made the full text of their book available online, but I'd recommend picking up a copy for your shelf too. It is too good of a book to just skim through online; it deserves a more thorough read.

10 comments:

Thanks for the great writeup with useful pointers from the book. I really liked Manning and Schütze's "Foundations of Statistical Natural Language Processing", and I'm happy I bought this latest book. I have no formal IR training, but I thought this book was very approachable from a general CS and Comp. Ling. background.

Greg - This is a no-brainer to buy. One question before I do: does it address temporal issues of IR (i.e. what happens when static relevance collides with the timely nature of corpora like the blogosphere and other pieces of social content)?

The key utility measure is user happiness...This is a point that is often missed in information retrieval.

In the original Cranfield experiments (1960s), didn't Cleverdon propose 6 different metrics of information retrieval effectiveness? Relevance, and the associated measures of recall and precision, was just one of those 6 metrics. The other 5 metrics included things like speed, coverage, etc. It's been a long time since I read that report, so I don't remember offhand. But all these measures have been part of IR from the beginning.

Historically, however, getting good relevance was relatively much harder than getting a good user interface. So if what you are saying is that most IR research has concentrated on algorithmic relevance, rather than UI, you're probably right. But nowadays the more interesting UI research out there is not just on "layout" and "clarity" of the results. It's on the input side of things. Often a single-line input box, with two buttons (search, and "lucky") does not translate to eliciting the best query from the user. Interactive search, query as a dialogue, etc. are becoming more necessary to ensure user happiness these days.

Stateless (aka typical web search), or state-filled but non-transparent (aka recent history personalization) might yield decent results in terms of relevance. But I often find myself frustrated at my inability to get insight into why the results are being returned and, more importantly, what I can do to overcome/change a poor result. I have no idea how that sense of user frustration is measured, in a web environment, but I'll bet it is there. Interfaces that offer more transparency into the search process are more important than strictly relevance alone.

Take a look at the recent HCIR workshops, which are trying a little harder to explore the overlap between IR and HCI, and get at some of these issues.

@matthewhurst: The book is titled "Introduction" to Information Retrieval. I think that temporal retrieval is a more advanced topic. For that matter, the book probably also does not cover other, more advanced non-textual types of information retrieval, such as image retrieval (whether content-based or textual context-based) and music retrieval.

It's an important thing to keep in mind, and that's something I think most people tend to forget: Information retrieval does not just mean "text" retrieval or "web" retrieval.

But those are, again, more advanced topics, and probably not covered by this book.

Greg: Here is one of the reports that I had mentioned. It's Cranfield III. I know there are other mentions of evaluation measures beyond just recall and precision in the early days of IR, but it's been about a decade since I read those papers. So I haven't been able to re-find the best ones yet. I know there is another paper out there that goes into much more detail that the one I am about to cite.

So check out page 4 of this PDF. Note that in addition to relevance, Cranfield mentions time/speed, presentation/layout, and amount of effort required by the user to construct a useful query.

The factor that I personally have been very interested in for quite some time is the "amount of user effort" metric. There are philosophical approaches that say it is up to the user to repeatedly reformulate their own queries, when things don't work. The user then has to make the additional mental effort to come up with a bunch of various strategies. There are other approaches that try to assist the user in query reformulation, results understanding and summarization, etc. These are attempts to reduce the amount of user effort.

How one measures the effort, and how one compares two different information seeking approaches, is very much an open problem, I think. But from an early day, the recognition was there, that these other factors needed to be considered.

Also check out the 2-3 paragraphs following this long list of user evaluation criteria. I find it particularly interesting that the "ease of making one's needs known" is an important factor. Also interesting is a discussion of the various type of information needs, ranging from known-item and fact-finding information needs (1 and 2) all the way to literature searches and alerting services (3 and 4). In the comprehensive literature search department (which I think also includes information needs as diverse as planning activities for one's vacation and exploring a genre of music) there is an acknowledgment that the user is willing to make a speed vs. quality tradeoff, i.e. that the user will usually be willing to sacrifice a little bit of response time in exchange for results which are of marked better quality.

One of my ongoing frustrations with web search engines is that they do not allow me to make that tradeoff. Every single search is done with the same amount of focus on speed, with no ability for me to specify the fact that I want "deeper" quality links and am willing to wait 2 seconds, 10 seconds, or even a whole minute, to get them.

Realistically, I find it very hard to believe that every single search has the exact same speed requirements. It's very limiting to only design your system to that one end goal.

I'd like to point out, pdf-page-3 of that document, that "overall efficiency" is one of the proposed metrics, and "number of searches made" is one of the ways of evaluating overall efficiency.

Currently, it seems that most web IR systems have a built-in bias or expectation that the correct way to deal with more comprehensive scale information needs is to have the user manually enter query after query after query.

However, the time spent searching, Karen Sparck-Jones suggests, should not only be measured in terms of how quickly the engine responds to any one query, but the total amount of time (and effort) it takes to issue all the multiple queries, until the user is satisfied.

Maybe someone with more firsthand inside knowledge can debunk me, but I see very little being done, comparison-wise, in terms of the tradeoff between single-query, single-iteration response time, and total number of search queries issued. It seems like most search engines model these as separate factors, when in fact they are joint factors.