Download Presentation

Processing of large document collections

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Course material

Large document collections

“a document records a message from people to people” (Wilkinson et al., 1998)

each document has content, structure, and metadata (context)

in this course, we concentrate on content

particularly: textual content

Large document collections

large?

some person may have written a document, but it is not possible later to process the document manually -> automatic processing is needed

large w.r.t to the capacity of a device (e.g. a mobile phone)

collection?

documents somehow similar -> automatic processing is possible

Applications

text categorization

text summarization

information extraction

question answering

text compression

text indexing and retrieval

machine translation

…

Text categorization

given a predefined set of categories and a set of documents

label each document with one or more categories

Text summarization

”Process of distilling the most important information from a source to produce an abridged version for a particular user or task” (Mani & Maybury, 1999)

Example

A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night.

According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of 15-20 years.

Example

summary could be, e.g.

“A Spanish priest is charged after an unsuccessful murder attempt on the Pope”

or a set of phrases:

a Spanish priest was charged

attempting to murder the Pope

he trained for the assault

Pope furious on hearing priest´s criticisms

Information extraction

”Information extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text” (Grishman, 1997)

Example: terrorist events

19 March - A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualties have been reported.

According to unofficial sources, the bomb - allegedly detonated by urban guerrilla commandos - blew up a power tower in the northwestern part of San Salvador at 0650 (1250 GMT).

Example: terrorist events

Incident typebombing

DateMarch 19

LocationEl Salvador: San Salvador (city)

Perpetratorurban guerilla commandos

Physical targetpower tower

Human target-

Effect on physical targetdestroyed

Effect on human targetno injury or death

Instrumentbomb

Example: terrorist events

a document collection is given

for each document, decide if the document is about terrorist event

for each terrorist event, determine

type of attack

date

location, etc.

= fill in a template (~database record)

Question answering systems

the user asks a question in a natural language

the question answering system finds answers from a document collection, e.g. from a collection of newspaper stories

Example

question:

When did Chuck Yeager break the sonic barrier?

a text fragment in the collection:

“For many, seeing Chuck Yeager – who made his historic supersonic flight Oct. 14, 1947 – was the highlight of this year’s show, in which…”

answer: Oct. 14, 1947

Methods

typically several methods (from several research fields) are combined in each application

statistics (or simply counting frequencies…)

machine learning

knowledge-based methods

linguistic methods

algorithmics

Learning goals

learn to recognize components of applications/processes

learn to recognize which (kind of) methods could be used in each component

learn to implement some methods

(meta)learn to control learning processes (What do I know? What should I know to solve this problem?)

Mapping to the information retrieval process

information

need

documents

query

document

representations

matching

result

query reformulation

Schedule

15.-22.3.

text representation, text categorization, term selection

31.3.-7.4.

text summarization

12.4.-19.4.

information extraction

21.-26.4

question answering systems,…

28.4.

closing

2. Text representation

selection of terms

vector model

weighting (TD*IDF)

Text representation

text cannot be directly interpreted by the many document processing applications

we need a compact representation of the content

which are the meaningful units of text?

Terms

words

typical choice

set of words, bag of words

phrases

syntactical phrases (e.g. noun phrases)

statistical phrases (e.g. frequent pairs of words)

usefulness not yet known?

Terms

part of the text is not considered as terms: these words can be removed