Text Retrieval with the TRW Fast Data Finder
Matt Mettler
TRW Systems Development Division
One Space Park R212194
Redondo Beach, CA 90278
(310) 814-4925
mat~wilbur.coyote.trw.com
1.0
TRW has been building high performance text processing and retrieval systems for a
number of years. Most of these systems have involved the application of the TRW Fast
Data Finder (FDF) text search hardware and have been designed to meet the requirements
of specific government customers. Our goal for the TREC conference has been to consider
and experiment with the FDF as a tool for more general purpose information retrieval, and
to determine the FDF's strengths and weakness compared to conventional information
retrieval techniques.
Introduction
Our experience with the TREC conference has left us encouraged about the ability of a text
scanning approach to be competitive with the more involved information retrieval
techniques. The inherent limitations of the FDF hardware do not prevent competitive
precision and recall for general information retrieval applications when the user topics are
properly understood and the topic queries are properly tuned to the dataset.
2.0 FDF Text Retrieval Approach
The Fast Data Finder is a hardware device that performs high-speed pattern matching on a
stream of 8-bit data. It consists of an array of identical programmable text processing cells
connected in series to form a pipeline processor. The cells are implemented using a custom
VLSI chip designed and patented by TRW. In the latest implementation, each chip contains
24 processor cells and a typical system will have 3,600 of cells. Each cell can match a
single character of query or perform all or part of a logical operation. The processors are
interconnected with an 8-bit data path and approximately 20-bit control path. To perform
a search, a microcode program is first downloaded into the pipeline to direct each
processor. The database is then streamed through the pipeline. The data bytes clock
through each processor in t~ until the whole database has passed through all processors.
As the data is clocking through, the processors alter the state of the control lines depending
on their program and the data stream values.
309
When the pipeline's processor cells detect that a series of database characters have passed
by that match the desired pattern, a hit is indicated and passed by external circuiuy back to
the memory of the host processor and to the user. The FDF pipeline runs at a constant speed
as it performs character comparisons and logical operations, regardless of query
complexity. The system we used for the TREC conference searched at 10 MB/sec.
The queries or patterns are specified in the FDF's Pattern Specification Language (PSL).
The hardware directly supports all the features in the PSL query language without the need
for software post-processing. The processors in the pipeline may all be used to evaluate a
single large query or may assigned to evaluate numerous smaller queries. The number of
pipeline cells a query needs is proportional to the size of the query. PSL provides numerous
search functions, which may be nested in any combination, including:
* Boolean logic including negative conditions
* Proximity on any arbitrary pattern
* Wildcards and "don't cares"
* Character alternation
* Term counting, thresholds, and sets
* Error t6lerance (fuzzy matching)
* Term weighting
* Numeric ranges
2.1 Advantages and Disadvantages of Hardware Scanning
There are four principle advantages in using a hardware scanning approach for information
retrieval. First, the FDF can perform pattern matching functions much faster and more
cost-effectively that a general purpose CPU. This benefit comes in part because of the
parallelism of the FDF architecture. Second, a hardware scanner like the FDF can begin
processing the data immediately upon its receipt. There is no need to wait for the data to
be preprocessed or~ndexed before it can be searched. This is especially important for
dissemination (routing) applications. Third, no extra disk space is needed to store inverted
index data or other vector data beyond the text itself. Finally, the system's response time
in evaluating a query is independent of the query's complexity and thus easily predictable.
The FDF can perform fuzzy pattern matching on a term like "krasnoyarsk" with three
missing, incorrect, or extra characters, as easily as performing an exact string match.
There are two disadvantages to the FDF scanning approach. First, it is moderately
expensive to buy the hardware and adapt application programs to work with it. The
approach is therefore not cost effective for low-end applications or systems that don't do
significant amounts of text processing. Second, since the search is not complete until the
entire database has been scanned, the time to complete a single simple query will be greater
than with indexing methods.
3.0 Query Generation Approach
Our objective for the TREC conference was to see if we could utilize the pattern matching
power of the FDF to achieve superior recall and precision. This in turn revolved around
how well we could construct the queries for the FDF to execute. We were able to identify
310
at least three possible approaches that could be used in preparing queries for execution by
the Fast Data Finder.
* Parse the topic narratives, extract key terms and phrases, expand the
terms where possible, and generate queries to find documents with the
same combinations of terms.
* Take a sample of relevant documents, extract common keywords and
phrases, especially those the occur multiple times, and generate queries
to fmd documents with at least some of the same phrases and keywords
within a sliding window of text about the size of a paragraph.
Construct the initial queries manually and refine them iteratively.
We elected to try both methods (ii) and (iii). To supply the relevant documents for the
statistical trials, we used the sample relevance judgments supplied by NIST in late May and
early June.
3.1 Automatic Query Generation
Our plan was to take sample documents for a particular topic, merge them together, and
build a PSL query that would find similar documents. Using the single document
WSJ870320-()()62 as a seed, the query would be something like:
(30 words -~
5+ (`cola'; `coca'; `coca cola'; `bottling';
enterprises'; `cola bottling'; `cola enterprises';
coca cola enterprises' ; ` coca cola bottling'; `atlanta'))
This query finds a document which contains a 30 word sliding window with 5 or more of
the specified terms or phrases. The term list is determined by removing stopwords and
counting the number of occurrences for each term, 2 word phrase, and 3 word phrase in the
seed document. The top 10 terms/phrases with the highest counts are selected. The "30
words" and "5 or more" values were selected arbitrarily and we'd planned to run a series of
trials to determine the optimal values.
The initial experiments with this inethod of query construction were not encouraging.
ran into three difficulties.
* The May/June NIST sample relevance judgments seemed incomplete
and inaccurate and were not giving us the statistical base we'd hoped for.
* This method assumes that the whole document in all on one subject.
Longer seed documents were contributing terms that had little to do with
the topic. Some method to segment the documents and indicate the
interesting section is required.
* This method wasn't capturing the subtlety of the topics. The query
shown above does an excellent job of finding documents about Coca
Cola Enterprises or bottling units in Atlanta but completely misses the
part about antitrust violations because it is only mentioned once in the
article.
311
We
3.2
The poor results from our initial trials with statistical query generation led us to fall back
on a purely manual (with feedback) approach. We extracted key concepts from the topic
description, added additional terms from outside knowledge or by observing them in
database documents. In building our multiple queries to provide a coarse grain ranking, we
favored documents where the subqueries matched in lead sentences or paragraphs. We
mostly ignored the May/June NIST sample judgements.
Manual Query Generation
Refinement of the queries was done manually by executing them, reviewing the results, and
modifying the queries. The easier topics required only a few iterations, while on some of
the more difficult topics we iterated several dozen times. We stopped working on a topic
when it seemed that the results were converging to practical limit for our approach, i.e.
when adding additional synonym keywords or altering the query structure wasn't producing
more reasonable results.
4.0
Table I shows our results for the TREC routing queries. Since our system doesn't really
rank the reujeved documents, we think the Table I presentation is more representative of
our performance than the 1 lpt averages. The ~rrst and last columns are the topic number
and description. The second column, "# Rel", is the number of relevant documents as
judged by NIST in the Volume II Corpus. The next three columns give an indication of
how the field did on the topic. "TRW Rel" is the number of relevant documents we
submitted out of the "TRW Submit" we sent in for each topic. Our scores are summarized
as follows:
Results and Analysis
High Above Med Median Below Med Low
8 15 4 19 2
Unlike most of the TREC participants, we did not submit the full 200 allowed documents
for each topic. This turned out to be a major blunder because the TREC scoring procedure
did not reward this self restraint. Many of our queries were too restrictive, achieving high
precision at the expense of recall and a good score for the conference.
This problem comes about because of the binary nature of the FDF's evaluation of a query
against a document. To operate properly against a routing data stream, it is necessary to
execute several queries for each topic, with each successive query aiming for higher recall.
When our queries were "tuned" properly, the results were quite good. Considering only
those queries where we made the full submission, the distiibution is well above the median.
High Above Med Median Below Med Low
6 8 3 5 0
Our analysis shows that we did well on topics where the ability to find phrases, acronyms,
numbers, and alphanumerics were important. We had the high score on topics 28 and 29,
both which involved finding references to AT&T. Since we retain and scan the full data
stream, we didn't have to worry about an indexing parser splitting "AT&T" into "AT" and
"T" and then throwing them both away. Our PSL subquery to find AT&T was
312
TABLE I - TRW TREC
Results for the Routing Topics
* Relevant Retr. 200 TRW TRW
Qry Rel. Best Median Worst Rel Submit Description
ol 216 62 30 0 49 200 Pending Antitrust
02 384 72 43 1 72 200 Foreign Acquistions
03 431 167 84 8 161 200 US-Japan Joint Vent.
04 48 33 18 2 10 49 Debt Rescheduling
Os 150 116 38 10 80 141 Japanese Dumping
06 137 78 45 15 44 200 Debt Relief
07 169 87 63 1 63 200 US Budget Deficit
08 159 43 18 3 28 72 Econimic Projection
09 638 117 87 8 91 200 Candidate Sightings
10 233 153 110 15 153 188 AIDS Treatments
11 196 89 52 7 35 128 Space Program
12 262 103 54 4 75 200 Water Pollution
13 112 111 46 5 111 113 Mitsubishi Heavy
14 203 85 48 0 46 56 Drug Approval
15 624 114 80 17 89 200 CEOs
16 88 44 24 1 1 13 Mkt Agrochemicals
17 303 154 81 0 33 58 Agrochemical Cntls
18 147 61 31 2 41 147 Japan Stock Trends
19 985 161 1Q2 74 85 140 Global Stock Trends
20 403 178 124 5 161 178 Patent Infringement
21 47 44 35 0 12 29 Superconductors
22 466 162 120 14 32 45 Counternarcotics
23 100 74 41 5 37 54 Legal Problems
24 345 113 59 11 11 21 Medical Technology
25 71 34 14 0 15 36 Chernobyl Effects
26 313 122 49 1 47 122 Multimedia Stds
27 232 109 91 10 80 200 Al in business
28 332 89 47 14 89 200 ATT in Comp/Comm
29 142 79 13 0 79 200 Foreign Acq ATT Tech
30 269 92 48 17 57 88 OS/2 Problems
31 156 66 31 0 36 200 OS/2 Advantages
32 119 52 15 0 6 12 Outsourcing
33 462 147 71 17 71 200 Doc Mngt Capable
34 303 129 104 6 107 200 ISDN Entities
35 270 139 113 0 98 200 Postscript Alts
36 158 110 50 0 10 11 Optical Disk Tech
37 409 189 158 19 158 200 SAA Components
38 810 169 120 37 98 156 Mini/Main Roles
39 501 184 117 24 142 156 Client-Server Plans
40 800 150 121 16 87 200 IS Impact on Orgs
41 144 34 11 0 31 200 Comp/Coinm Upgrade
42 696 131 92 10 92 112 End User Computing
43 125 Al Conferences
44 241 105 35 1 105 200 Layoffs at Companies
45 304 103 71 0 103 200 CASE Suceed/Fail
46 51 40 31 9 30 200 Virus Outbreaks
47 237 80 35 2 80 200 Contract > $1 mil
48 189 48 28 2 17 40 Purch Comm Equip
49 139 65 56 4 44 131 Who's in Supercomp
50 26 12 1 0 4 5 Virtual Reality
313
define ATT `[AT\&[Iamp\;)TIAT and TI\
American Telephone [and I\&) Telegraph)' end
The "[lamp;]" notation means to allow an optional "amp;" as was present in the Ziff
database. We had the high score on Topic 10 "AIDS Treatments". This may be due to our
ability to easily find phrases like "acquired immune deficiency syndrome" or "AIDS
related complex" in close proximity to drug names like "ThA", "5-fluorouracil", or "AZT".
We had the high score on topic 13 to find documents about Mitsubishi Heavy Industries.
Our query that found 111 of the 112 documents the NIST judged relevant, was simply to
find the two word phrase "Mitsubishi Heavy". Apparently the other TREC participants had
trouble either finding phrases or determining the need to find phrases during query
generation. The following sections discuss in detail topic 47 where we achieved the high
score, and topic 36 where we achieved a low score.
4.1 Example of Good Performance - Topic 47
Topic 47 was to find documents discussing new contracts for computer systems in excess
of $1 million. We found 80 good documents out of 200 submitted, the high score for this
topic. We believe we did well on this topic because we were able to look for various
numeric representations of $1 million in close proximity to keywords for new contracts and
computer systems.
To be relevant, a document must identify the selection of a
source for the development or delivery of information systems
products or services valued at more than $1 million dollars.
The PSL query for this topic used three subqueries: one each for the "selection of a source",
information systems products or services", and "more than $1 million dollars". The PSL
definitions were:
define award [3 words -> `[siqnjaward)*I and "contract") end
define computer
"[computer I communic network phone telecommi mainframe
StarlanlpBxlcyber IBM 30901X\-MPIY\-MPISCS\-4olinformation)" end
define million [1 word -> "[millionIbillion)~dollar" or
I [j [0-9)) [I [0-9)) [0-9] [ j\. ] [[0-9)] [I [0-9)) [I [0-9] ]\
[millionibillion)" or
I) [I [0-9)] [1(0-9]) [0-9) [I\,] [0-9] [0-9) [0-9) [ I\,)\
[0-9)[0-9)[0-9]"} end
The "award" definition requires the root words "sign" or "award" to be within 3 words of
contract" in the text. This word count includes stop words, acronyms, or any other
alphanumerics that were in the original text. This definition will find phrases like:
a contract was awarded
AT&T signed a new contract
Bellcore was awarded three new contracts.
The "computer" definition looks for any of the root terms shown. Note that looking for
alphanumerics such "X-MP" or "IBM 3090", which may include multiple character white
314
space; is no problem. The "million" subquery uses proximity and an alphanumeric
sequence pattern and will find items like the following:
a million dollar contract
a $2.3 million system
a $ 12 billion program
a $2000000 machine
a $ 2,000,000 machine
Note that the phrase "a 2,000,000 dollar award" would not be found by this definition. This
was an oversight. The winning query was then simply
[50 words -> award and computer and million)
This finds documents which contain a 50 word sliding window in which all three
subqueries match. Note how the "award" subquery that uses a 3 word sliding window can
be nested inside a query using a 50 word sliding window.
4.2 Example of Bad Performance - Topic 36
Topic 36 was to find documents discussing how rewritable optical disks work.
To be relevant, a document must describe how rewritable
optical disk technology works at length and in significant and
comprehensive technical detail.
This topic was particularly challenging because the topic narrative describes attributes the
documents must have rather than specific concepts or keywords. We started by defining a
subquery to find documents mentioning rewritable optical disks.
define optical disk
[10 word ->~"rewrit" and "optical [disk I drivel technolog]") end
To find documents that describe the technology "at length", we wrote a subquery to find
places where there were at least 5OOO characters between the definition and the
no TEXTEND) end
To find documents that contained "significant and comprehensive technical detail" we
manually extracted a list of keywords (Table II), and required that the documents to have
at least 10 or more of these terms present.
The tightest query (intended for high precision) was
[1 document -> optical~disk and LONG and 30+ I
The loosest (intended for high recall) was
[1 document -> optical~disk and 10+ )
315
TABLE II - Subterms used for Topic 36
"amorphous";
"[ISO CCITT]";
"Kerr effect"
SCSI fl;
"bias";
"binary";
"capacity";
"chemical";
"states";
cycles
"density";
spatial
"High Sierra";
"dye[ f\-Jpolymer";
"Curie temperature";
"gadolinium";
"lanthanide";
"birefringence";
"emerging technolog";
erasable
"fatigue";
"field";
"[frequency~Mhz]";
inductance
"[jukebox autochanger] ";
"laser";
4.3 Failure Analysis - Topic 36
"crystalline";
"operation";
"phase(l j\-
"phenomenon"
"polarit";
"polarized";
"principle";
``reflect''
"[sector %rack~cylinder]";
``[silver gold]'';
"Qersted
"surface reflectance";
"phase[ J\-]change";
"thin film";
"terbium";
"magnetization";
"substrate";
"speed";
"transfer";
"transluscent";
"Winchester";
"[mega[ Ibyte [A[az]]MB[A[az]]]II;
"[giga[ j.]byte [A[az]]CB[A[az]]]~~;
"magneto [II \- Joptical";
"media"; "magnet"
change";
Unfortunately, even our high recall queiy retrieved only 11 documents in the Volume II
Corpus of which 10 were judged relevant. (The 11th was discussing WORM technology
and only mentioned "rewritable optical drives" in passing.)
Upon examination of the NIST judgements, we made several observations about the
relevant documents. First, we missed the keywords "erasable" as a synonym for
rewritable" and "video" as a synonym for "optical". Second, the assessor accepted articles
about "optical recorders" and "optical image processing" systems. To pick up these
corrections we would change the optical~disk subquery to read as follows:
define optical disk [10 words ->
"[rewritf erasable]" and
"[videojoptical]" and
"[disk I drivel technolog recorder image processing]") end
We then threw out the length restriction and reran the query requiring differing numbers of
the technical terms to be present. The results from these runs are shown in Table III. This
table shows two things. First, for this topic, the number of technical terms is an excellent
"knob" to adjust the precision and recall. Second, the assessor was making a loose
interpretation of "comprehensive technical detail". If we'd completely ignored this part of
316
the query, we would have had 135 good documents out of 262. Turning in only 200, wed
expect to have around 103 relevant, which would have been near the high score.
Table III - Topic 36 Results as a Function of the
Number of Technical Terms
Num
Tech Rel Docs
Terms Ret Ret Prec Recall
10 26 31 0.84 0.17
9 34 42 0.81 0.22
8 41 59 0.69 0.26
7 50 77 0.65 0.32
6 58 100 0.58 0.37
5 77 139 0.55 0.55
4 96 176 0.55 0.62
3 111 210 0.53 0.71
2 120 222 0.54 0.77
1 131 240 0.55 0~. 84
0 135 262 0.52 0.87
5.0 Future Plans
During 1993 we hope to continue researching and evaluating better methods for query
construction. Our objectives will be:
* Design and test a method of sequencing the execution of FDF queries to
insure that 200 documents will be retrieved for each topic,
* Develop methods and algorithms to semi-automate manual query
construction,
* Use the extensive relevance judgements from TREC-I to test techniques
to generate FDF queries from statistical analysis of the relevant
documents for each topic, and
* Examine the feasibility of using the FDF's term weighting capability to
allow it to act as a back-end processor for other text retrieval techniques.
6.0 Acknowledgments
The FDF system is the result of extensive development by many people over the last 8
years. My role has been that of a reporter on the basic system's capabilities and the manner
in which they might be applied to a TREC-like problem.
317