The QA System
James Driscoll, Jennifer Lautenschlager, Mimi zhao
Department of Computer Science
University of Central Florida
Orlando, Florida 32816 USA
Abstract
In the QA system, semantic information is combined with
keywords to measure similarity between natural language
queries and documents. Acombination of keyword relevance
and semantic relevance is achieved by treating keyword
weights and semantic weights alike using the vector pro-
cessing model as a basis for the combination. The approach
is based on (1) the database concept of semantic modeling
and (2) the linguistic concept of thematic roles. Semantic
information is stored in a lexicon built manually using
information found in Roget's Thesaurus.
Keywords: vector prossing model, semantic data model,
semantic lexicon, thematic roles, entity attributes.
1. Introduction
The QA system is based on the semantic approach to text
search reported in [9). The QA system accepts natural
language queries against collections of documents. The
system uses keywords as document identifiers in order to sort
retrieved documents based on their similarity to a query. The
system also imposes a semantic data model upon the "surface
level" knowledge found in text (unstructured information
from a database point of view).
The intent of the QA System has been to provide conve-
nient access to information contained in the numerous and
large public information documents maintained by Public
Affairs at NASA Kennedy Space Center (KSC). During a
launch at KSC, about a dozen NASA employees access these
printed documents to answer media questions. The planned
document storage for NASA KSC Public Affairs is around
300,000 pages (approximately 900 megabytes of disk stor-
age).
Because of our environment, the performance of our
system is measured by a count of the number of documents
one must read in order to find an answer to a natural language
question. Consequently, the traditional precision and recall
measures for IR have not been used to measure the per-
formance of the QA System.
We have had success using semantics to improve the
ranking of documents when searching for an "answer" to a
query in adocument collection of size less than 50 megabytes.
However, it is important to note that our success has been
demonstrated only in a real-world situation where queries are
the length of a sentence and documents are either a sentence
or, at most, a paragraph [8,9].
Our reasons for participating in ~fl~EC have been to (1)
learn now our semantic approach fares when traditional IR
measures of performance are used, and (2) test our system on
larger collections of documents. In this paper, we describe
our system, the experiments we performed, our results, and
failure analysis.
2. Overview of the QA System Modified for ThEC
The QA System has been restricted to an IBM compatible
PC platform running under the DOS 5.0 operating system and
without the use of any other licensed commercial software
such as a DOS extender. The DOS version of the QA System
is available at nominal cost from [3). About 2,000 hours of
programming have been used to develop the current software
which includes a pleasant user interface; just as many hours
have been used testing the basic keyword operation of the
system. In addition, approximately 1,000 hours have been
used performing experiments invoWing the semantic aspect
of the QA System. The SQIWI)S relational data base system
has been used to carry out some of these semantic experi-
ments.
The QA System is implemented in C and uses B+ tree
structures for the inverted files. We felt the speed of the
system and its storage overhead was not efficient enough to
appear reasonable for ~I~EC, so we designed a separate
system without a pleasant user interface which uses a hashing
scheme to establish codes for strings. This was done to cut
down on storage space and eliminate the use of B+ trees.
Approximately 400 hours of programming and debugging
effort was used to modify the system for the TREC experi-
ments. We kept the DOS environment.
This work has been supported in part by NASA KSC Cooperative Agreement NCC 104)03 Project 2, Florida High Technol-
ogy and Industry Council Grants 494O~ll-28-72l and 4940-11-728, and DARPA Grant 4940-11-28-808.
199
For the TREC experiments, weused nine IBMPS/2 Model
95 computers. These were 50MHz 486 computers, each with
8 megabytes of RAM and one 400 megabyte hard drive. Two
of the machines had 16 megabytes of RAM and another one
had two 400 megabyte hard drives. A 33 Mf~ 486 PC was
used to distribute text to the nine IBM machines, and then
collect information for merging and redistribution.
Our plan was to put each of the nine Vol.1 and Vol.2
document collections on a separate machine, determine the
document frequency for each term tj on each machine, and
then merge the results and determine the inverse document
frequency for each term in the entire collection. Next, we
would index and query each document collection separately,
and finally merge the retrieval results.
The vector processing model is the basis for our approach.
Terms used as document identifiers are keywords modified
by various techniques such as stop lists, stemming, synonyms,
and query reformulation.
The calculation of the weighting factor (w) for a term in a
document is a combination of term frequency (ti), document
frequency (dJ), and inverse document frequency (id]). The
basic definitions are as follows:
tj~, = number of occurrences of term tj in document D~
df1 - number of documents in a collection which contain tj
idf~ - Iog(NIdf,), where N = total number of documents
W.. =~f.*jdf
£j `J,5J
When the system is used to query acollection ofdocuments
with I terms, the system computes a vector Q equal to
(Wql,Wq2,..., Wqg) representing the weights for each term in the
query. The retrieval of a document with vector D~ equal to
4') representing the weights of each term in the
document is based on the value of a similarity measure
between the query vector and the document vector. For
the NIST experiments, we used the Jaccard similarity
coefficient [7] to retrieve documents for the September
deadline.
Jaccard Coefficient
X Wqjdjj
j-1
sim~,D~) = I I *
X(41)2+ X(w~1)2- X Wq~j
j-1 j-1 1-1
3. SemantIc App~ach
Mthough the basic IR approach has shown some success
in regard to natural language queries, it ignores some valuable
information. For instance, consider the following query:
How long does the payload crew go through
training before a launch?
The typical IR system dismisses the following words in the
query as useless or empty: "how", "does", "the", "through",
200
"before", and "a". Some of these words contain valuable
semantic information. The following list indicates some of
the semantic information triggered by a few of these words:
how long Duration, Time
through I£~tion/Space, Motion with Reference
to Direction, Time
before 1~tion~pace, Time
The database concept of semantic modeling and the linguistic
concept of thematic roles can help in this regard.
3.1 Semantic Modeling
Semantic modeling was an object of considerable database
research in the late 1970's and early 1980's. Abriefoverview
can be found in [2]. Essentially, the semantic modeling
approach identified concepts useful in talking informally
about the real world. These concepts included the two notions
of entities (objects in the real world) and relationships among
entities (actions in the real world). Both entities and rela-
tionships have properties.
The properties of entities are often called attributes. There
are basic or surface level attributes for entities in the real
world. Examples of surface level entity attributes are General
Dimensions, Color, and Position. These properties are
prevalent in natural language. For example, consider the
phrase "large, black book on the table" which indicates the
General Dimensions, Color, and Position of the book.
In linguistic research, the basic properties of relationships
are discussed and called thematic roles. Thematic roles are
also referred to in the literature as partidpant roles, semantic
roles and case roles. Examples of thematic roles are Bene-
ficiary and Time. Thematic roles are prevalent in natural
language; they reveal how sentence phrases and clauses are
semantically related to the verbs in a sentence. For example,
consider the phrase "purchase for Mary on Wednesday"
which indicates who benefited from a purchase (13eneficiary)
and when a purchase occurred (Fime).
The goal of our approach is to detect thematic information
along with attribute information contained in natural Ian-
guage queries and documents. When the information is
present, our system uses it to help find the most relevant
document. In order to use this additional information, the
basic underlying concept of text relevance as presented earlier
needs to be modified. The major modifications include the
addition of a lexicon with thematic and attribute information,
and a modified computation of the similarity coefficient. We
now discuss our semantic lexicon.
3.2 The Semantic lexicon
Our system uses a thesaurus as a source of semantic
categories (thematic and attribute information). For example,
Roget's Thesaurus contains a hierarchy of word classes to
relate word senses [6]. For our research, we have selected
several classes from this hierarchy to be used for semantic
categories. We have defined thirty-six semantic categories
as shown in Figure 1.
Thematic Rote Categories Attribute Categories
A ~ m~nt C'Lolor
Pvt~rn~1 ~nd Internal Dimensions
Amoun~
~pnefi~i ~r', Form
Cause
r~nd~r
(`~n~r~1 flirnpn~ion~
(`~ndition
rr'nin~iriqnn T ~ flinipn~onq
~A~;~n r~nininp~ with Force
(`onvevance
fle~~e
n~~t; ~~ti~n
RA~;r'~ with ~ to Direction
1VLLJIi~~ 1~~i~L~11 - -- -
flur~tion Order
(;ml Phv~ica1 Prorerties
I n~rument Position
T ncation/Smce State
\"At'ner Temnerature
Me~n~~ Use
Pii~e Variation
RRn~e
~~Q,1lt
Time
Figure 1. Thirty-Six Semantic Categories.
Prior to ThEC, there were 3,000 entries in the lexicon
established by manual examination of roughly 6,000 of the
most frequent words occurring in NASA KSC text. For
ThEC, we made 1,000 new entries by examination of 1,700
frequent words occurring in the training text and the 52
training topics. Since the 1911 edition of Roget's Thesaurus
has become public domain, we also created software which
automatically extracted approximately 20,000 lexicon
entries. However, we did not have enough time to explore
the use of these entries.
In order to explain the assignment of semantic categories
to a given term using Roget's Thesaurus, consider the brief
index quotation for the term "vapor":
vapor
n. fog 404.2
fume 401
illusion 519.1
4.3
~sPe~~i 328.10
thing imagined 535~
v. be bombastic 601.6
bluster 911.3
boast 910.6
exhale 310.23
talk nonsense 547.5
The eleven different meanings of the term "vapor" are given
in terms of a numerical category. We have developed a
mapping of the numerical categories in Roget's Thesaurus to
the thematic role and attribute categories given in Figure 1.
In this example, "fog" and "fume" correspond to the attribute
State; "steam" maps to the attribute Temperature; and "ex-
201
hale" is a trigger for the attribute Motion with Reference to
Direction. The remaining seven meanings associated with
"vapor" do not trigger any thematic roles or attributes. Since
there are eleven meanings associated with "vapor," we
indicate in the lexicon a probability of 1/11 each time a
category is triggered. Hence, a probability of 2/11 is assigned
to State, 1/11 to Temperature, and 1/11 to Motion with
Reference to Direction. This technique of calculating prob-
abilities is being used as a simple alternative to a corpus
analysis. It should be pointed out that we are still
experimenting with other ways of calculating probabilities.
Figure 2 shows lexicon entries for prepositions as asample
of the lexicon used in our experiments. These entries are
somewhat misleading. Most p repositions trigger too many
semantic categories to be of real use. The prepositions
"during" and "until" are examples of useful prepositions.
3~ Extended Computation of the Similarity Measure
The probabilistic details of a semantic lexicon and the
computation of semantic weights can be found in [9]. A
detailed explanation of the manner in which we combine
semantic weights and keyword weights can be found in [8].
Essentially we treat semantic categories like indexing
terms, and the probabilities introduced by a semantic lexicon
mean that the frequency of a category in a document becomes
an ~ frequency and the presence of a category in a
document becomes a probability for the category being
present. This means that the document frequency for a
category becomes an expected document frequency, and this
enables an inverse document frequency to be calculated for
a category.
after: TIme(3), None(3), External and Internal Dimensions(2), Motion with Respect to Direction, Order, Li~ation~pace
above: Amount(2), Ii)Catio~pace, None(2), linear Dimensions(2), Order, External and Internal Dimensions, Position
as: Condition, Comparison, Manner1 None(2), Amount
at: Condition, Ii)cation~S pace, Manner, Time, Position, External and Internal Dimensions, Duration
ato p : ~cation~Space, Position
befbre: [£~tion~Space, Time(4), External and Internal Dimensions, Motion with Respect to Direction, Order, None(2), Position
below: Motion with Respect to Direction, None(2), Amount, State, linear Dimensions, Iocatio~pace, Position
between: Comparison, Duration, External and Internal Dimensions, Icication/Space, Position, Amount
by: Amount, Conveyance, Ic:cation/Space, Time, Position, External and Internal Dimensions, Means, Order, Motion with Respect
to Direction, Duration, None
during: Duration, Time
except: Condition, None(2), Order, Amount(2)
for: Duration, Goal, Purpose, Variation, Destination, Beneficiary, Amount, Range
from: Cause, Source, Time, Motion with Respect to Direction(2), Amount, location/Space, Position, Instrument
in: Instrument, location/Space, Purpose, ~me, Motion with Respect to Direction(S), External and Internal Dimensions(2), Position,
Condition, Goal, Means
into: Condition, location/Space, Time, Motion with Respect to Direction, External and Internal Dimensions, Position, None
like: Comparison(3), Amount(2), Condition, Manner, None(7)
of: Cause, location/Space, Source, None, Time, Duration, Position
on: Condition, Conveyance, location/Space, Source, Time, linear Dimensions(2), Motion with Respect to Direction, General
Dimensions, Position, Means, External and Internal Dimensions, Purpose
over: Duration, location/Space, Time(2). Order, linear Dimensions(3), Amount(4), General Dimensions, Motion with Respect to
Direction, External and Internal Dimensions, Position, Degree, Condition
per: Means, Order(2)
thrzugh: None, Order, Goal, linear Dimensions, Means, Time, location/Space, Position, Motion with Respect to Direction
to: Accompaniment, Beneficiary, Condition, De~ee, Ii)cation/Space, Purpose, Result, Time, General Dimensions, Position,
Motion with Respect to Direction(2), Companson
under: Condition, location/Space, Position, Order, Degree, None, Amount, linear Dimensions(3)
until: Condition, Time
upon: Condition, Conveyance, Source, Time, None, General Dimensions, linear Dimensions, Means, External and Internal
Dimensions, Motion with Respect to Direction, Purpose
with: Accompaniment(2), Comparison, Manner, Result, Amount(3), Means, None, Order, location/Space, Position, Time
within: Range, External and Internal Dimensions(2), Time, Degree, Position, location/Space
without: External and Internal Dimensions, Order, Amount, None(2)
Figure 2. Prepositions and Their Semantic Categories
~eprinted by Permission of [3J).
For ThEC semantic experiments performed after the
September deadline (refer to Section 5.2), we treated
semantic categories like keywords and used the following
normalized similarity coefficient:
a +s
1X~1 Wqj di~
sim~,D~) -
11+3
wheres -36 is the number of semantic categories. It should
be pointed out that we are still performing experiments to
determine a proper blend of keyword and semantic infor-
mation.
4. System Details
The following is an overview of the system we used to
perform the ThEC experiments. As pointed out earlier, many
hours of programming and debugging effort were used to
create a system for the ThEC experiments.
4.1 The Scanning Pt~~cess
Under the QA scanning procedure, scanning a ciccument
for indexing terms is a three-step process:
A. A token is scanned.
B. The token is analy:eed. It is compared to a list of 166
stopwords, and these words (along with any numbers) are
later discarded. Dates are transformed into a generic format,
202
to allow for the matching of dates in a variety of different
formats. Non-valid hyphenated words are separated into
mulliple tokens. Words that can be abbreviated (as deter-
mined by a list of 122 abbreviations) are replaced by their
abbreviated form. Acronyms of multiple words are detected,
and replaced by their respective acronyms.
C. The remainder of the words are stemmed according
to a modified version of the 3. B. lovins stemming
algorithm [5). In some cases, prefixes are also removed.
The scanning procedure for the ThEC experiments is a
modification of the QA scanning procedure, in which only
the words found in the text fields of the documents are
tokenized. In order to speed up text processing, the amount
of analyzing in step B is reduced. Dates are no longer
transformed, and abbreviations and acronyms are not created.
The stemming algorithm and removal of prefixes in step C
remains unchanged.
For the ThEC scanning procedure, an additional step is
added. During this extra step, a hashing algorithm is used to
assign each indexing term a unique 32-bit integer value,
which we call a stem code.
4.2 Data Files
Aftera document is processed by the ThEC version of the
QA System, four primary data files are created. These are
the document weight file, the inverted index file, the inverted
data file, and the document name file.
The document weight file is a binary file containing a list
of floating point numbers. These numbers are ordered
sequentially by document number, and represent the sum-
marion of the weight in the query squared for a particular
document's keywords.
The inverted index file is ordered by stem code. Each code
has two file pointers, one pointing to the first block of data
in the data file, and the second pointing to the last block of
data. The inverted data file then consists of blocks of data,
containing pairs of document numbers that the code is found
in, and the code's frequency within that document. The
blocks are linked together to form a list.
The document name file consists of a list of pointers into
the original text file.
For Vol.1, both the document weight file and the inverted
index file were two megabytes. The inverted data file was
approximately 385 megabytes, and the document name file
was two megabytes.
4~ Basic Pi~)cedure
For the TREC experiments, we did the following:
Step 1: This step involves matching every legitimate stem
in the document collection with a unique integer value. This
is done with a linear hashing function. A table containing
this mapping, along with the number of documents each code
is found in, is temporarily saved for use in Step 2.
Step 2: This step creates the four data files described
above. The entire database is scanned, with the four files
being created on the fly. Once this is accomplished, the table
from Step 1 is no longer necessary, and is discarded.
Step 3: Relevant documents for each query are selected
using the Jaccard similarity coefficient. The top 200 docu-
ments for each query are then determined.
The above three steps were followed to create the results
for the September deadline. In the next section we present
our "official" experiments and results, and some "unofficial"
experiments and results.
5. Experiments and Results
Our experiments were intended to be Category A
experiments with two results submitted for each ranking task.
One ranking result would be for just keywords, the other
ranking result would be for keywords combined with
semantics. All query construction was automatic, and the
treatment of ad-hoc routing queries was identical.
We also performed experiments concerning the expected
behavior of string hashing functions and the use of part~f-
speech tagging to improve retrieval performance. These
experiments are not reported here.
As an example of our automatically built ad-hoc and
routing queries, consider Topic 004 reproduced in Figure 3.
Figure 4 indicates the keyword and semantic information
generated by the QA System for this topic. The first part of
Figure 4 indicates the stems along with their frequencies
found in the query. The second part of Figure 4 indicates the
semantic categories also found in the query along with their
expected frequencies and probability present.
It is important to note that the topic represented in Figure
3 and Figure 4 has generated many semantic categories and
the probability present for most of them is close to, or at,
203
100%. This is mainly due to the length of the text involved.
We discovered that, for the TREC document collection, each
document generated many semantic categories with high
probability present. Because we treated semantic categories
like keywords, this caused semantic weights to be essentially
useless.
For the ThEc September deadline, we were only able to
submit a routing experiment using a keywording approach.
The results of this experiment, computed with the aid of Chris
Buckley's SMART evaluation program [1), are shown in
Figure 5. The results are not good. In Section 6, we discuss
what impeded our experiments.
Further "unofficial" experiments were designed to test the
use of semantics. The main goal of our experiments was to
demonstrate that our original routing results could be
improved through the use of semantic analysis. In order to
do this, we made two modifications to our approach. The
first change involved dividing the original TREC documents
into paragraphs. The second change involved a semantic
analysis when calculating the list of relevant documents.
Our experiments involved the use of only six routing
queries (for topics 001,002,003,007,017, and 022). These
topics were selected because our original results for them
were poor. Through the use of semantic analysis, we hoped
to significantly improve our results. Figure 6 shows the
precision-recall statistics for the six "poor" queries using the
retrieval results which created the statistics in Figure 5.
When analyzing our results, we computed all precision-
recall tables through the use of Chris Buckley's SMART
evaluation program [1). The relevancy lists used were those
produced before November 1 (the original qrels for the
routing queries). We did not use the modified results that
were distributed later for Query 017. This should not affect
our results, though, because our experiments were aimed only
at improving our precision-recall averages, and the relevancy
results used were consistent from one experiment to the other.
5.1 Re-~nk'ng of Documents
In an effort to demonstrate that semantics could affect
retrieval in the TREC environment, we used the original QA
System with a semantic lexicon containing TREC words as
described in Section 3.2. We created a separate database for
each query we considered, for a total of six databases. Each
database contained only the documents that we originally
judged in the top 250 for each query. Because of this, when
we computed new relevancy lists we were simply rearranging
the order of the same 250 documents, IIQL bringing in new
documents.
Figure 7 reveals the precision-recall statistics when orig-
inally retrieved documents fora query are used as a document
collection and re-ranked by imposing the query again. There
is a 25.8% increase when comparing the 11-pt average here
to the 11-pt average of the originally retrieved text (Figure
6).
To determine the ranking of a particular document with
paragraph divisions, we defined the similarity coefficient of
a document to be equalto the highest coefficient associated
with one of its corresponding paragraphs. The paragraph
divisions were automatically constructed from the original
text. The precision-recall statistics for paragraphs being used
as documents are shown in Figure 8. There is an 18.8%
Stems Stems
Tipster Topic Description
Number: 004
Domain: International Finance
Topic: Debt Rescheduling
Description: Document wrn discuss a current debt
rescheduling agreement between a developing country and
one or more of its creditor(s).
Narrative:
A relevant document wrn discuss a current debt rescheduling
agreement reached, proposed, or being negotiated between a
debtor developing country and one or more of its creditors,
commercial and/or official. It wrn identify the debtor country
and the creditor(s), the repayment time period requested or
granted, the monetary amount requested or covered by the
accord, and the interest rate, proposed or set.
Concept(s):
1. rescheduling agreement, accord, settlement, pact
2. bank debt, commercial debt, foreign debt, trade debt,
medium-term debt, long-term debt
3. negotiations, debt talks
4. creditor banks, creditor countries/governments, Paris Club
5. debtor countries, developing countries
6. debt package
7. debt repayments
8. restructuring, rescheduling existing loans
9. lower interest-rate margin, easier terms, more lenient terms
Factor(s):
Nationality: Developing country
Time: Current
Definition(s):
Debt Rescheduling - agreement between creditors and debtor
to provide debt relief by alterihg the original payment terms
of an existing debt. This is most often accomplished by
lengthening the original schedule for principal and interest
payments, and deferring interest payments. Done most
publicly by developing countries and their bankers, but often
less publicly by other willing creditors and debtors, e.g.,
governments, banks and companies. Much in vogue in the
early 19805, the road to rescheduling for countries in crisis
runs as follows: when a country borrows so much that its
lenders grow nervous, the banks start lending for shorter and
shorter maturities. Eventually the country, though still paying
interest on its debt, is unable to make payments on the
principal. The country is then forced to requestarescheduling,
which means that it is able to escape its immediate repayment
commitments by converting short-term loans into longer-term
ones. A country wishing to reschedule its official debt tal
tal 1 wish 1
on 1 longer 1
short 1 convers 1
commitm 1 immedi 1
escap 1 abl 1
mean 1 forc 1
un 1 pay 1
though 1 eventu 1
matur 1 shorte 2
lend 1 start 1
nerv 1 row 1
lender 1 orrow 1
follow 1 run 1
cris 1 road 1
ear 1 vogu 1
compan 1 e.g. 1
wil 1 less 1
banker 1 public 2
defer 1 princip 2
length 1 accompl 1
orisin 2 alter 1
relief 1 prov 1
leni 1 easxxx 1
margin 1 lower 1
loan 2 exist 2
structur 1 pack 1
club 1 par 1
2 talk 1
~ernm 1 term 6
m~4erm 1 trad 1
foreign 1 bank 4
pact 1 settl 1
rat 2 interest 5
accord 2 cover 1
amount 1 monet 1
grant 1 request 3
period 1 paf~m 7
identif 1 2
commerc 2 negoti 2
propos 2 reach 1
relev 1 credit 7
countr 13 develop 5
4 cur 3
discus 2 docum 2
schedl 10 debt 22
fin 1 intern 1
Expected Probability
Frequency £r~~
Linear Dimensions 0.300000 0.300000
Motion with Reference to Direction 2.000000 1.000000
Accom~animent 0.333333 0.333333
Condition 2.000000 0.937500
External and Internal Dimensions 0.333333 0.333333
Time 5.744589 1.000000
Duration 1.291667 0.819444
Purpose 2.000000 1.000000
Color 2.333333 0.941472
Position 6.071428 0.999654
Location/Space 6.071428 0.999654
Variation 5.933333 1.0000
Amount 53.412807 1.0000
Order 6.833333 1.000000
Figure3. Topic004. Figure 4. Automatically Generated Query for Topic 004.
204
Top ranked evaluation
Rus number: UCFQA1 (category B)
Num~queries: 19
Total number of documents over all queries
Retrieved: 3800
Relevant: 2056
Rel_ret: 517
Recall - Precision Averages:
at 0.00 0.5146
atOlO 0.2645
at 0.20 0.1735
at0.30 0.1006
at 0.40 0.0693
at 0.50 0.0107
at 0.60 0.0069
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.1037
Average precision for 3 intermediate points
(0.20, 0.50, 0.80)
3-pt Avg: 0.0614
Recall:
at Sdoes: 0.0162
at lSdocs: 0.0362
at 30does: 0.0692
at 100 does: 0.2156
at 200 does: 0.2899
Precision:
at Sdocs: 02737
at 15 does: 02211
at 30 does: 02088
at 100 does: 0.1932
at 200 does: 0.1361
Figure 5. Results for Experiment Completed by
September Deadline.
Top ranked evaluation
Run number: 1
Num~queries: 6
Total number of documents over all queries
Retrieved: 1200
Relevant: 863
Rel ret: 147
Recall - Precision Averages:
at 0.00 0.3486
at 0.10 0.1676
at 020 0.0749
at 0~0 0.0279
at 0.40 0.0000
at 0.50 0.0000
at 0.60 0.0000
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.0563
Average precision for 3 intermediate points
(0.20, 0.50, 0.80)
3-pt Avg: 0.0250
Recall:
Exact: 0.1997
at Sdoes: 0.0033
at lOdoes: 0.0062
at 30does: 0.0382
atl00does: 0.1212
at 200 does: 0.1997
Precision:
Exact: 0.1225
at Sdocs: 0.1000
at lOdoes: 0.0833
at 30does: 0.1556
atlOOdoes: 0.1550
at 200 does: 0.1225
Figure 6. Results for Six Poorly Performing Queries using
Experiment Completed by September Deadline.
205
Top ranked evaluation
Run number: 1
Num~queries: 6
Total number of documents over all queries
Retrieved: 1200
Relevant: 863
Rd ret: 152
Recall - Precision Averages:
at 0.00 0.4585
atOlO 0.2113
at 0.20 0.0775
at 0.30 0.0315
at OAO 0.0000
at 0.50 0.0000
at 0.60 0.0000
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.0708
Average precision for 3 intermediate points
(0.20, 0.50, 0.80)
3-pt Avg: 0.0258
Recall:
Exact: 02109
at S does: 0.0072
at 10 does: 0.0128
at 30 does: 0.0469
atlOOdoes: 0.1374
at 200 does: 02109
Precision:
Exact: 0.1267
at Sdoes: 02000
at 10 does: 0.1833
at 30does: 0.1944
at 100 does: 0.1683
at 200 does: 0.1267
Figure 7. Results of Keywording with Document Divisions.
Top ranked evaluation
Run number: 1
Num~queries: 6
Total number of documents over all queries
Retrieved: 1200
Relevant: 863
Rel~ret: 157
Recall - Precision Averages:
at 0.00 0.40%
atOlO 0.2089
at 020 0.0832
at 0.30 0.0340
at 0.40 0.0000
at 0.50 0.0000
at 0.60 0.0000
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.0669
Average precision for 3 intermediate points
(0.20, 0.50, 0.80)
3-pt Avg: 0.0277
Recall:
Exact: 0.2167
at Sdoes: 0.0121
at lOdoes: 0.0167
at 30does: 0.0462
atl00does: 0.1392
at 200 does: 0.2167
Precision:
Exact: 0.1308
at Sdoes: 0.2667
at lOdoes: 0.1833
at 30does: 0.1833
at 100 does: 0.1650
at 200 does: 0.1308
Figure & Results of Keywording with Paragraph Divisions.
increase when comparing the 11-pt average here to the 11-pt
average of the originally retrieved text ~igure 0).
The re-ranking of documents using keywords was done in
order to establish a baseline for the semantic experiments
reported in the next subsection. The increase in retrieval
performance because of the re-ranking was a surprise.
According to Sparek Jones' criteria, the increases of 25.8%
and 18.8% would eachbe categoriied as "sigmIlcant" ~reater
than 10.0%) [4].
5.2 Semantic Expenments
Essentially, what we are trying to show in this section is
that semantics can be useful if documents do not get larger
than a paragraph.
Figure 9 displays results when semantic similarity is
combined with keyword similarity for the case that originally
retrieved ~fl~EC documents are used. These results can be
compared to those in Figure 7 (the same documents with a
strictly keywording approach). Comparing the 11-pt average
for the two methods shows a 0.7% ~when going from
the strictly keywording results to the combined semantic with
keywording results. According to Sparek Jones' criteria [4],
this is not a noticeable decrease. But, certainly, semantics
did not help.
Top ranked evaluation
Run number: 1
Num~queries: 6
Total number of documents over all queries
Retrieved: 1200
Relevant: 863
Rel~ret: 155
Recall - Precision Averages:
at 0.00 0A370
atOlO 0.2250
at 0.20 0.0789
at 0.30 0.0329
at 0.40 0.0000
at 0.50 0.0000
at 0.60 0.0000
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.0703
Av
(0~20era~(;~~e()ct8sI0o)n for 3 intermediate points
3-pt Avg: 0.0263
Recall:
The explanation for this is that when the size ofa document
is large, a greater number of semantic categories are triggered
in the document. Also, the probability present for each
category in a document is often very close to 100%. Con-
sequently, almost every semantic category becomes present
in every document causing the semantic category weights to
become very low and useless.
To remedy this problem, we used the database that was
constructed with paragraph divisions, and computed rele-
vancy lists using the combined semantic and keywording
approach explained in Section 3.3. The results are shown in
Figure 10. The statistics there can be compared to those in
Figure 8 (the same documents with a strictly keywording
approach). When going from a keywording approach to a
combined semantic and keywording approach, the 11-pt
average increased by 7.9%. According to Sparck Jones'
criteria, this change would be classified as "noticeable"
(greater than 5.0%) [4].
The two semantic experiments reported here demonstrate
the main thing that we have learned. Our semantic approach
to text retrieval is only useful when documents are no larger
than a paragraph.
Top ranked evaluation
Run number: 1
Num~queries: 6
Total number of documents over all queries
Retrieved: 1176
Relevant: 863
Rel ret: 158
Recall - Precision Averages:
at 0.00 0.4556
atOlO 0.2231
at 0.20 0.0845
at 0.30 0.0294
at 0.40 0.0000
at 0.50 0.0000
at 0.60 0.0000
at 0.70 0.0000
at 0.80 0.0000
at 0.90 0.0000
at 1.00 0.0000
Average precision for all points
11-pt Avg: 0.0720
~ for 3 intermediate points
3-pt Avg: 0.0282
Recall:
Exact: 0.2133 Exact: 0.2159
at Sdocs: 0.0072 at Sdocs: 0.00%
at lOdocs: 0.0128 at 10 does: 0.0171
at 30docs: 0.0508 at 30does: 0.0513
atl00docs: 0.1378 at 100 does: 0.1456
at 200 does: 0.2133 at200 does: 0.2159
Precision: Precision:
Exact: 0.1292 Exact: 0.1332
at Sdocs: 0.2000 at Sdoes: 0.2333
at lOdoes: 0.1833 at lOdoes: 0.2000
at 30docs: 0.2000 at 30does: 02167
atl00does: 0.1700 atl00docs: 0.1767
at 200 does: 0.1292 at 200 does: 0.1317
Figure 9. Results of Semantics with Document Divisions.
206
Figure 10. Results of Semantics with F~ragraph Divisions.
It should be noted that the semantic experiments reported
here were crude. Our lexicon did not have enough TREC
words, and we used a blend of keyword and semantic weights
that was not the best. So we expect that better semantic results
for paragraphs can be achieved (refer to Section 6).
6. Failure Analysis
In general our participation in the TREC experiments was
impeded by the following:
PC DOS Platform. This platform has a serious memory
addressing restriction which results in memory page swap-
ping and this seriously affects the speed of processing,
especially during creation of inverted files and index
structures. We can solve this problem by moving to an OS/2
or UNIX platform.
Extra Semantic Processing Time. Our semantic proba-
bilistic and statistical calculations more than double the
processing time for indexing and statistical ranking of
retrieved documents. Again, we can solve this problem by
moving to an OS/2 or UNIX platform.
Time to Build Semantic lexicon. We were only able to
incorporate 1000 frequently occurring words in the training
text within our semantic lexicon. We did not have enough
time to process the test text for the ad-hoc queries. This
problem can be solved by having archival data distributed
earlier. We suspect that by having more TREC words in our
semantic lexicon, better results could have been achieved in
Section 5.2 when paragraphs are used as the basis for retrievaL
Unknown Blend for Semantic and Keyword Weights.
There are three main aspects to our blend of semantic and
keyword weights within the vector processing model:
(i) The Proper Probabilities to Use for the Semantics
Triggered by a Word. For example, we let the word
"vapor" trigger State with 18% probability, Tem-
perature with 9% probability, and Motion with
Reference to Direction with 9% probability. We
have several techniques for determining probabili-
ties such as these.
(ii) The Scaling of Keyword Weights and Semantic
Weights. For example, in a Question/Answer
environment where queries are the length of a
sentence and documents are either a sentence or at
most a paragraph, we have been successful by
forcing semantic similarity to be approximately 1/3
of keyword similarity when the two are combined
in processing small document collections (1ess than
1000 documents). There was no scaling for the
experiments reported in Section 5.2; we suspect
better results could have been achieved.
(iii) Independent Semantic Weights and Keyword
Weights. A valid criticism of our research has been
that the semantic contribution from a word in a
document should be kept independent of the word's
own similarity contribution if the word is a keyword
in common with the query.
The overall problem of proper blend can be solved by
spending more time using TREC test documents, test topics,
and good test relevance judgments to run many retrieval
experiments to establish the correct blend.
207
Number of Semantic Categories. Mother way to solve
the problem of long documents causing semantic weights to
be of little value is to have more semantic categories. A large
number of "semantic" categories could be obtained (for
example) by using ~11 the categories and/or subcategories
found in Roget's Thesaurus, instead of the 36 semantic
categories we use. This would be a deviation from database
semantic modeling but it probably should be examined.
Block-split Tree Structured Files. The QA System used
B+ tree structures for implementing inverted files and this
actually slowed the system in our DOS environment. The
QA System also had severe storage overhead due to storage
of character strings in the B+ trees. We have solved these
two problems by implementing a separate system using a
hashing function to establish codes for strings.
TREC Document length. Semantic experiments like
those reported in Section 5.2 have shown that documents
larger than a paragraph cause our semantic approach to be of
little value. This problem can be corrected by considering
paragraphs as a basis for document retrieval.
Finally, we spent too much time on work that was never
incorporated in our experiments. We originally designed an
efficient method of inverting data files, but it could not be
used for routing queries. Also, trying to do semantic
part-of-speech tagging experiments using SQUDS slowed us
down.
References
[1) C. Buckley, SMART Evaluation Program (for TREC),
Cornell SMART Group, Cornell University.
[2] C. Date, An Introduction to Database Systems, Vol.1,
Addison Wesley, 1990.
[3] Hello Software, P.O. Box 494, Goldenrod, FL 32733.
[4] K. Sparck Jones and R. Bates, "Research on Automatic
Indexing 1974-1976," Technical Report, Computer
laboratory, University of Cambridge, 1977.
[5] J. Ii)vins, "Development of a Stemming Algorithm,"
Mechanical Translation and ComputationalLinguistics,
Vol.11, No.1-2, pp.11-31, March and June, 1968.
[6] Roget's International Thesaurus, Harper & Row, New
York, Fourth Edition, 1977.
[7] G. Salton,Automatic T&tProcessing, Addison-Wesley,
1989.
[8] D. Voss and J. Driscoll, "Text Retrieval Using a Com-
prehensive Semantic lexicon," Proceedings of ISMM
First International Conference on Informadon and
Knowledge Management, Baltimore, Maryland,
November 1992.
[9] E. Wendlandt and J. Driscoll, "Incorporating a Semantic
Analysis into a Document Retrieval Strategy," Pro-
ceedings of the Fourteenth Annual International
ACM/SIGIR Conference on Rese""rch and Development
in Information Retrieva4 Chicago, Illinois, pp. 270-279,
October 1991.