Introduction to Information Retrieval

CS276

Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 1: Boolean retrieval

Information Retrieval

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

new slide

Unstructured (text) vs. structured (database) data in 1996

Unstructured (text) vs. structured (database) data in 2009

Unstructured data in 1680

Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia?

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?

Why is that not the answer?

Slow (for large corpora)

NOTCalpurnia is non-trivial

Other operations (e.g., find the word Romans nearcountrymen) not feasible

Ranked retrieval (best documents to return)

Later lectures

Term-document incidence

Brutus AND Caesar BUT NOT 1 if play contains word,

Calpurnia 0 otherwise

Incidence vectors

So we have a 0/1 vector for each term.

To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) → bitwise AND.

110100 AND 110111 AND 101111 = 100100.

Incidence vectors

So we have a 0/1 vector for each term.

To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) → bitwise AND.

110100 AND 110111 AND 101111 = 100100.

Basic assumptions of Information Retrieval

Collection: Fixed set of documents

Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

The classic search model

How good are the retrieved docs?

Precision : Fraction of retrieved docs that are relevant to user’s information need

Recall : Fraction of relevant docs in collection that are retrieved

More precise definitions and measurements to follow in later lectures

Bigger collections

Consider N = 1 million documents, each with about 1000 words.

Avg 6 bytes/word including spaces/punctuation

6GB of data in the documents.

Say there are M = 500K distinct terms among these.

Can’t build the matrix

500K x 1M matrix has half-a-trillion 0’s and 1’s.

But it has no more than one billion 1’s. ← Why?

matrix is extremely sparse.

What’s a better representation?

We only record the 1 positions.

Inverted index

For each term t, we
must store a list of all documents that contain t.

Identify
each by a docID, a document serial number

Can we use fixed-size arrays for
this?

Brutus

→

1

2

4

11

31

45

173

174

Caesar

→

1

2

4

5

6

16

57

32

Calpurnia

→

2

31

54

101

What happens if the word Caesar
is added to document 14?

Inverted index

We need variable-size postings lists

On disk, a continuous run of postings is normal and best

In memory, can use linked lists or variable length arrays

Some tradeoffs in size/ease of insertion Posting(below)

Brutus

→

1

2

4

11

31

45

173

174

Caesar

→

1

2

4

5

6

16

57

32

Calpurnia

→

2

31

54

101

Brutus , Caeser and Calpurnia are dictionaries.

Numbers are postings.

Sorted by docID
(more later on why).

Inverted index construction

Indexer steps: Token sequence

Indexer steps: Sort

Indexer steps: Dictionary & Postings

Where do we pay in storage?

The index we just built

How do we process a query?

Later - what kinds of queries can we process? ← Today’s focus

Query processing: AND

Consider processing the query: BrutusANDCaesar

Locate Brutus in the Dictionary;

Retrieve its postings.

Locate Caesar in the Dictionary;

Retrieve its postings.

Merge” the two postings:

The merge

Walk through the two postings
simultaneously, in time linear in the total number of postings entries

Intersecting two postings lists (a "merge algorithm")

Query optimization

What is the best order for query processing?

Consider a query that is an AND of n terms.

For each of the n terms, get its postings, then AND them together.

Brutus

→

2

4

8

16

32

64

128

Caesar

→

1

2

3

5

8

16

21

34

Calpurnia

→

13

16

Query: Brutus AND Calpurnia AND Caesar

Query optimization example

Process in order of increasing freq:

start with smallest set, then keep cutting further.

↑

(This is why we kept document freq. in dictionary)

Brutus

→

2

4

8

16

32

64

128

Caesar

→

1

2

3

5

8

16

21

34

Calpurnia

→

13

16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

Boolean queries: Exact match

The Boolean retrieval model is being able to ask a query that is a Boolean expression:

F1 and other averages

Evaluating ranked results

By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve

A precision-recall curve

Averaging over queries

A precision-recall graph for one query isn’t a very sensible thing to look at

You need to average performance over a whole bunch of queries.

But there’s a technical issue:

Precision-recall calculations place some points on the graph

How do you determine a value (interpolate) between the points?

Interpolated precision

Idea: If locally precision increases with increasing recall, then you should get to count that…

So you take the max of precisions to right of value

Evaluation

Graphs are good, but people want summary measures!

Precision at fixed retrieval level

Precision-at-k: Precision of top k results

Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages

But: averages badly and has an arbitrary parameter of k

11-point interpolated average precision

The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

Evaluates performance at all recall levels

Typical (good) 11 point precisions

SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

Yet more evaluation measures…

Mean average precision (MAP)

Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved

Avoids interpolation, use of fixed recall levels

MAP for query collection is arithmetic ave.

Macro-averaging: each query counts equally

R-precision

If we have a known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of the top Rel docs returned

Perfect system could score 1.0.

Variance

For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7)

Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query.

That is, there are easy information needs and hard ones!

Creating Test Collectionsfor IR Evaluation

Test Collections

From document collections to test collections

Still need

Test queries

Relevance assessments

Test queries

Must be germane to docs available

Best designed by domain experts

Random query terms generally not a good idea

Relevance assessments

Human judges, time-consuming

Are human panels perfect?

Kappa measure for inter-judge (dis)agreement

Kappa measure

Agreement measure among judges

Designed for categorical judgments

Corrects for chance agreement

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

P(A) – proportion of time judges agree

P(E) – what agreement would be by chance

Kappa = 0 for chance agreement, 1 for total agreement.

Kappa Measure: Example

P(A)? P(E)?

Number of docs

Judge 1

Judge 2

300

Relevant

Relevant

70

Nonrelevant

Nonrelevant

20

Relevant

Nonrelevant

10

Nonrelevant

Relevant

Kappa Example

P(A) = 370/400 = 0.925

P(nonrelevant) = (10+20+70+70)/800 = 0.2125

P(relevant) = (10+20+300+300)/800 = 0.7878

P(E) = 0.2125^2 + 0.7878^2 = 0.665

Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

Kappa > 0.8 = good agreement

0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)

Depends on purpose of study

For >2 judges: average pairwise kappas

TREC

TREC Ad Hoc task from first 8 TRECs is standard IR task

50 detailed information needs a year

Human evaluation of pooled results returned

More recently other related things: Web track, HARD

A TREC query (TREC 5)

Number: 225

Description:

What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?

Standard relevance benchmarks: Others

GOV2

Another TREC/NIST collection

25 million web pages

Largest collection that is easily available

But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index

NTCIR

East Asian language and cross-language information retrieval

Cross Language Evaluation Forum (CLEF)

This evaluation series has concentrated on European languages and cross-language information retrieval.

Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

Probability Ranking Principle (PRP)

Simple case: no selection costs or other utility concerns that would differentially weight errors

Bayes’ Optimal Decision Rule

xis relevant iff p(R|x) > p(NR|x)

PRP in action: Rank all documents by p(R|x)

Theorem:

Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss

Provable if all probabilities correct, etc. [e.g., Ripley 1996]

Probability Ranking Principle

More complex case: retrieval costs.

Let d be a document

C - cost of retrieval of relevant document

C’ - cost of retrieval of non-relevant document

Probability Ranking Principle: if

for all d’ not yet retrieved, then dis the next document to be retrieved

We won’t further consider loss/utility from now on

Probability Ranking Principle

How do we compute all those probabilities?

Do not know exact probabilities, have to use estimates

Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest model

Questionable assumptions

“Relevance” of each document is independent of relevance of other documents.

Really, it’s bad to keep on returning duplicates

Boolean model of relevance

That one has a single step information need

Seeing a range of results might let user refine query

Probabilistic Retrieval Strategy

Estimate how terms contribute to relevance

How do things like tf, df, and length influence your judgments about document relevance?

One answer is the Okapi formulae (S. Robertson)

Combine to find document relevance probability

Order documents by decreasing probability

The Probability Ranking Principle

“If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

R. K. Belew. 2001. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge UP 2001.

MIR 2.5.4, 2.8

CS276A

Text Retrieval and Mining

Lecture 12

[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]

Recap

Probabilistic models: Naïve Bayes Text Classification

Introduction to Text Classification

Probabilistic Language Models

Naïve Bayes text categorization

Today

The Language Model Approach to IR

Basic query generation model

Alternative models

Standard Probabilistic IR

IR based on Language Model (LM)

A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

The LM approach directly exploits that idea!

Formal Language (Model)

Traditional generative model: generates strings

Finite state machines or regular grammars, etc.

Example:

I wish

I wish I wish

I wish I wish I wish

I wish I wish I wish I wish

…

*wish I wish

Stochastic Language Models

Models probability of generating strings in the language (commonly all strings over alphabet ∑)

Stochastic Language Models

Model probability of generating any string

Stochastic Language Models

A statistical model for generating text

Probability distribution over strings in a given language

Unigram and higher-order models

Unigram Language Models (Easy , effective !)

Bigram (generally, n-gram) Language Models

Other Language Models

Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

Using Language Models in IR

Treat each document as the basis for a model (e.g., unigram sufficient statistics)

Rank document d based on P(d | q)

P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore

P(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genre

P(q | d) is the probability of q given d’s model

Very general formal approach

The fundamental problem of LMs

Usually we don’t know the model M

But have a sample of text representative of that model

Estimate a language model from a sample

Then compute the observation probability

Language Models for IR

Language Modeling Approaches

Attempt to model query generation process

Documents are ranked by the probability that a query would be observed as a random sample from the respective document model

Multinomial approach

Retrieval based on probabilistic LM

Treat the generation of queries as a random process.

Approach

Infer a language model for each document.

Estimate the probability of generating the query according to each of these models.

Rank the documents according to these probabilities.

Usually a unigram estimate of words is used

Some work on bigrams, paralleling van Rijsbergen

Retrieval based on probabilistic LM

Intuition

Users …

Have a reasonable idea of terms that are likely to occur in documents of interest.

They will choose query terms that distinguish these documents from others in the collection.

Collection statistics …

Are integral parts of the language model.

Are not used heuristically as in many other approaches.

In theory. In practice, there’s usually some wiggle room for empirically set parameters

Query generation probability (1)

Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Insufficient data

Zero probability p(t | Md) = 0

May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]

General approach

A non-occurring term is possible, but no more likely than would be expected by chance in the collection.

If :

Insufficient data

Zero probabilities spell disaster

We need to smooth probabilities

Discount nonzero probabilities

Give some probability mass to unseen things

There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation

[See FSNLP ch. 6 or CS224N if you want more]

A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

Mixture model

P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

Mixes the probability from the document with the general collection frequency of the word.

Correctly setting lis very important

A high value of lambda makes the search “conjunctive-like” – suitable for short queries

A low value is more suitable for long queries

Can tune l to optimize performance

Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summary

General formulation of the LM for IR

The user has a document in mind, and generates the query from this document.

The equation represents the probability that the document that the user had in mind was in fact this one.