This is the Terrier Team blog. It is managed by several members of the Terrier Team at the University of Glasgow. It is used to publicise our research projects, to post news about various research work and activities performed by the team, as well as to share information and thoughts on information retrieval and search engines issues.

Tuesday, March 26, 2013

This is the second post in a two-part series addressing our recent research in learning to rank. While the previous blog post addressed the role of the sample within learning to rank, and its impact on effectiveness and efficiency, in this blog post, I'll be discussing the role of different features within the learning to rank process.

Features

With various IR approaches being proposed over the years, these have naturally formed features within learned approaches. Features can be calculated on the documents, in a query independent (e.g. URL length, PageRank) or query dependent (e.g. weighting or proximity models) manner. For instance, the LETOR learning to rank test collections deploy various (query dependent) weighting models calculated on different fields (body, title, anchor text, and URL). We have successfully been using the same four fields for representing Web documents in our participations to the TRECWebtracks.

The role of different weighting models - particularly those calculated on different fields - within learning to rank intrigued us and formed an article About Learning Models with Multiple Query Dependent Features that we recently published in Transactions in Information Systems. In particular, some learned models take the form of linear combinations of feature scores. In contrast, Robertson [CIKM 2004] warned against the linear combination of weighting model scores. Yet the LETOR test collections deploy weighting models deployed on each field (e.g. BM25 body, BM25 title, BM25 anchor text). Hence, among other research questions, we re-examined the role of field-based weighting models in the learning to rank era. Our findings on the TREC ClueWeb09 corpus showed that field-based models such as BM25F and PL2F are still important for effective learned models.

Our TOIS paper also shows how to efficiently calculate multiple query dependent features within in IR system. In particular, as the postings of an inverted index are compressed, it is expensive to calculate additional query dependent features once the top K sample has been identified, due to cost of decompressing the relevant parts of the inverted index posting lists again. Instead, we show how the postings of documents that are inserted into the top K documents within a DAAT retrieval strategy can be "cloned", and retained decompressed in memory, such that additional query dependent features can be calculated in the Feature Extraction phase. We call this the fat framework, as it "fattens" the result set with the postings of the query terms for those documents. We have implemented this framework within the Terrier IR platform, and it will be released as part of the next major release of Terrier, as described in our OSIR 2012 paper.

Finally, query features are an increasingly important aspect within learning to rank. In contrast to (query independent or query dependent) document features, query features have the same value for each document ranked for a query. In this way, query features can be said to be document independent. Our CIKM 2010 and SIGIR 2011 papers on diversification used query features to decide on the ambiguity of a query, or to decide on the likely type of information need underlying different aspects of an ambiguous query. On the other hand, the role of query features is to allow learners based on regression trees (e.g. GBRT, LambdaMART) to customise branches of the learned model for different types of query. For instance, if query length is a query feature, then the learner might recognise that a query dependent proximity (document) feature is important for two terms queries, but not for one term queries. In our CIKM 2012 short paper, we recognised the lack of a comprehensive study on the usefulness of query features for learning to rank. Our experiments combined the LambdaMART learning to rank technique with 187 different query features that were grouped into four types: pre-retrieval Query Performance Prediction, Query Concept Identification, Query Log Mining, and Query Type Classification. We found that over a quarter of the 187 query features could significantly improve the effectiveness of a learned model. The most promising query features were Query Type Classification, which identified the presence of entities in the query, suggesting that such features are useful for triggering sub-trees that promote entity homepages. Overall, we found query features could be employed to customise learned ranking models for queries with different popularity, length, difficulty, ambiguity, and related entities.

Summary

There remains a great deal of black magic involved in the effective and efficient application of learning to rank within information retrieval. Indeed, my colleagues and I strongly believe that empirically dervied best practices are an important part of information retrieval research. This series of blog posts has been aimed at addressing some of the aspects missing in the literature, and provide insights into our recent research within this area.

Acknowledgements

This body of research would not have been possible without a number of co-authors and contributors: Iadh Ounis, Rodrygo Santos, Nicola Tonellotto (CNR, Italy) and Ben He (University of the Chinese Academy of Science).

The
thesis tackles the problem of ambiguity in web search queries. In particular, with
the enormous size of the Web, a misunderstanding of the information need
underlying an ambiguous query can misguide the search engine, ultimately
leading the user to abandon the originally submitted query. To overcome this
problem, a sensible approach is to diversify the documents retrieved for the
user's query. As a result, the likelihood that at least one of these documents
will satisfy the user's actual information need is increased.

In
the thesis, we argue that an ambiguous query should be seen as representing not
one, but multiple information needs. Based upon this premise, we propose xQuAD – Explicit Query Aspect Diversification,
a novel probabilistic framework for search result diversification. In
particular, the xQuAD framework naturally models several dimensions of the search
result diversification problem in a principled yet practical manner. To this
end, the framework represents the possible information needs underlying a query
as a set of keyword-based sub-queries.
Moreover, xQuAD accounts for the overall coverage
of each retrieved document with respect to the identified sub-queries, so as to
rank highly diverse documents first. In addition, it accounts for how well each
sub-query is covered by the other retrieved documents, so as to promote novelty– and hence penalise
redundancy – in the ranking. The framework also models the importance of each of the identified sub-queries, so as to
appropriately cater for the interests of the user population when diversifying
the retrieved documents. Finally, since not all queries are equally ambiguous,
the xQuAD framework caters for the ambiguity level of different queries, so as
to appropriately trade-off relevance for diversity on a per-query
basis.

The
xQuAD framework is general and can be used to instantiate several
diversification models, including the most prominent models described in the
literature. In particular, within xQuAD, each of the aforementioned dimensions
of the search result diversification problem can be tackled in a variety of
ways. In this thesis, as additional contributions besides the xQuAD framework,
we introduce novel machine learning approaches for addressing each of these
dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware
approach for choosing the ranking models most likely to be effective for
estimating the coverage and novelty of multiple documents with respect to a
sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we
perform the first empirical analysis of the role of novelty as a diversification
strategy for web search.

As
demonstrated throughout the thesis, the principles underlying the xQuAD
framework are general, sound, and effective. In particular, to validate the
contributions of this thesis, we thoroughly assess the effectiveness of xQuAD
under the standard experimentation paradigm provided by the diversity task of
the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation
demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains
consistent and significant improvements in comparison to the most effective
diversification approaches in the literature, and across a range of
experimental conditions, comprising multiple input rankings, multiple sub-query
generation and coverage estimation mechanisms, as well as queries with multiple
levels of ambiguity.

These investigations led to the publication of 12 peer-reviewed research papers and 5 evaluation forum reports directly related to the thesis. Moreover, the thesis opened up directions for other researchers, who deployed and extended the xQuAD framework for different applications, and inspired a series of workshops on Diversity in Document Retrieval as well as a research track at the internationally renown NTCIR forum. From a practical perspective, xQuAD has been subjected to scrutiny from the research community as a regular contender in both TREC and NTCIR. As the winning entry in all editions of the diversity task of the TREC Web track (best cat. B submission in TREC 2009 and TREC 2010; best overall submission in TREC 2011 and TREC 2012), we believe that the xQuAD framework has secured its place in the state-of-the-art.

Thursday, March 21, 2013

This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.

The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:

Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.

In particular, there are typically three phases:

Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.

Feature Extraction - various features are calculated for each of the sample documents.

Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.

The Sample

The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.

Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article.

Dynamic pruning strategies such as WAND are generally configured to be safe-to-rank-K, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in our SIGIR 2012 poster, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.

How many documents that are necessary in the sample clearly varies from query to query. In our WSDM 2013 paper, we proposed selective pruning, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%, the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.

In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.