Information extraction is the process of extracting relations out of natural language text. For example, from the sentence, "The Palo Alto headquartered Redwood Software, Inc. has sold off its internet branch BuyStuff.com to industry giant Huge Software Company for 30 million dollars.", the relation (BuyStuff.com, Huge Software Company, Redwood Software, 30 million) could be extracted into a database with fields (Acquired Company, Buying Company, Selling Company, Dollar Amount).

A large majority of the information on the internet and even on computers in general, whether itís in the form of web pages, emails, newsgroups, papers, or company documents, is just plain natural language text. The problem of extracting more structured information out of this data, therefore, is an important one. Creating information extraction tools would be very useful.

The goal of this project was to create the best information extraction tool I could. As a starting point, I used the three papers that were handed out in class. The most appealing approach to me was to build a maximum entropy model. These models allow the use of arbitrary features. For example, part of speech information can be combined with extraction patterns. For the sake of simplicity and for ease of implementation, I decided to build a Hidden Markov Model instead. Compared to Maximum Entropy Models, Hidden Markov Models are not quite as versatile but easier to implement.

The basis of the Hidden Markov Model I created was Dayne Freitag and Andrew McCallumís paper, "Information Extraction with HMM Structures Learned by Stochastic Optimization". I originally intended to automatically learn HMM structures the techniques they describe in the paper, but I found that just getting a hand-created structure to run was a lot of work for one person.

Hidden Markov Model Overview

A Hidden Markov Model is a finite-state automaton that has probabilities associated with each state transition and with what symbol is emitted at each state. A Hidden Markov Model, then, consists of three parts:

-a set S of n states

-an n by n matrix of transition probabilities A from each state to each state

-an n by v matrix B from of emission probabilities that express the probability of emitting each word in the vocabulary (of v words) from each state

Generally, the emission probabilities can depend on both the from state and the to state of a transition. I decided to have my emission probabilities depend only on the from state because it made intuitive sense and made learning parameters easier.

Specific to the information extraction problem, I used two ideas from the Freitag and McCallum paper. First, for each field that I am trying to extract from a document, I built a separate HMM. This is done so that each field is treated separately and can have parameters learned specific to that field. I think this would perform better than having only one HMM that had to try to encode the structure of all of the fields. Second, there are two kinds of states: target states and background states. Only target states emit the "target" information, or the field we are trying to extract.

The way a HMM can be used to solve the information extraction problem is as follows. The input document is a sequence of words. It is then assumed that some HMM was used to generate this sequence. The Viterbi algorithm, explained later, is used to find the most likely sequence of states that the HMM went through to output that document. Then, the words determined to be most likely emitted by the target states are extracted as the field we are looking for.

This describes how a HMM is used to do extraction. Before we can use a HMM, we have to build it. For this project, I started with this hand-built structure from the Freitag and McCallum paper.

The circles represent background states and the octagons represent target states. The target states are connected to themselves and each of the other target states. Only one background state goes to any target and it connects to all of them. The targets only connect to one background state and all of them connect to it. Intuitively, this structure seems like it would be good for extraction. The top node that connects to itself is a garbage node that the HMM stays at for most of the document. It transitions to itself with high probability. The sequence of four states before the target represents a prefix before the target. The four prefix nodes have high emission probabilities for words that are likely to precede the target. The four words after the target work similarly for common words that commonly follow the target.

To build a working model, I start this structure with reasonable transition and emission probabilities. Then I use the Baum-Welch parameter estimation algorithm and a large set of tagged training data to find better probabilities. This algorithm uses the current parameters and an output sequence to estimate new parameters. This algorithm converges to a local maximum in the space of parameters so I was careful to select reasonable parameters to start with, rather than random ones.

Algorithms

The two main algorithms are the Viterbi algorithm for finding the most likely state sequence and the Baum-Welch algorithm for parameter estimation. I implemented these algorithms exactly they are stated in Manning and Schutze p. 331-335 with the modification for target states described in Freitag and McCallum. In parameter estimation, we make the restriction that only target states can emit target words and only background states can emit non-target words. To do this, at all steps we let the probability of being in state s at time t be zero if the type of s (target or background) does not match the type of the output at the t position of the training sequence. For the Viterbi algorithm, there is no modification.

Implementation

I implemented the HMM using the Java programming language. To run my extraction tool, use

java HMM <training file> <testing file> <field name>

where training file and testing file contained tagged data (one of whose tagged fields is field name). The data files contain documents separated by the token "ENDOFDOC". The HMM will train itself using the tags in the training file and will then attempt to extract the given field from the documents in the testing file, and evaluate its performance against the tags in the testing file.

One decision I had to make was regarding how to do smoothing for the emission probabilities. If we let the probability of emission of unseen words be zero, then the HMM will not work as intended because all state sequences will have a probability of zero for that input sequence. When reading the training document, I kept a count for each word in the vocabulary. I started the emission probabilities for each word at each state as the count over the sample size with add-one smoothing. So unseen words are given a small probability. After training the emission probabilities change depending on the state, but I let the probability of unseen words remain the same at its small value. This avoids the problem of zero probabilities.

The most significant problem I ran into was floating point underflow. Because the HMM algorithms multiply large amounts of probabilities the numbers being manipulated are extremely small. I tried solving this problem using scaling factors. I would multiply the probabilities at each time by a constant depending exponentially on t. When computing probabilities across times the scaling factors cancel. I was able to get this to work, but not satisfactorily. I found that the scaling factor had to be carefully chosen in a range that depended heavily on the size of the document. If the chosen factor was too large, there would be floating point overflow. If it was too small, there would be underflow. Because the factor was dependent on the size of the document and the documents varied widely in size, I couldnít just find one scaling factor that worked. So I decided to work strictly with the logs of probabilities. It is easy to do this when multiplying probabilities; one must simply add the logs. Taking sums of probabilities is more difficult. To do this, I used the log_add algorithm given in Manning and Schutze p. 337. This worked very well, although it did slow the parameter estimation algorithm significantly.

I also ran into a problem with the structure of the HMM I was using. Extracting the acquired field from the sentence, "Company A acquired <acquired>B</acquired> and <acquired>C</acquired>.", my model would give a probability of zero. This was because the minimum prefix and suffix before a target in my model was length 9 words. To fix this, I added low probability transition to and from transitions between the garbage state and one of the targets. This fixed the problem.

Testing and Performance

I used the corporate acquisitions data set that consists of Reuters articles hand tagged by Dayne Freitag. I chose this data set because it was available and already tagged. This is a set of about 400 documents that each describe a corporate acquisition. They are tagged with a set of pertinent fields like seller, acquired, purchaser, dlramt (price of purchase), status (status of acquisition), acqloc (location of acquired company), and acqbus (business that the acquired company is in).

I divided the data into two sets. I used a random selection of about two thirds of the documents for training and the one remaining one third for testing. I tested the performance in two ways. I counted the number of documents for which the HMM perfectly extracted the exact set of correct words. For each document, I also calculated a "score" which for each document was equal to (true positives / (size of correct set + false positives)). In the case that the HMM correctly extracted an empty set, the score is 1. This score does a pretty good job of evaluating performance because the higher the percentage of extracted words correct, the closer the score to 1. Also, guessing at more words doesnít help because it penalizes for false positives.

Here are the results:

Field

% Perfect

% Score

acquired

16.4

28.1

seller

25.4

53.7

purchaser

23.2

36.6

status

39

44.2

dlramt

61.6

62.6

acqbus

55.4

58.6

acqloc

69.5

71

Average

41.5

50.7

The results for the fields that did well are a little misleading. All of the documents do not contain all of the fields. For the fields that did the best, a lot of the documents donít contain that field. So all the HMM had to do to get them right was correctly identify them as empty. At the same time, these documents were tagged with even more tags than is shown here. In addition to and acquired and seller field, for instance, there were acqcode, acqabr sellercode, sellerabr fields that contained codes and abbreviations for these companies. This made it more difficult for the HMM because if it identifies what was really the acqabr as the acquirer, this is counted as wrong.

These results might seem somewhat discouraging. On average, only about 40% of fields are extracted correctly. But information extraction is a hard problem. For each field, out of a document of about 80-250 words, around only three words must be identified. In the Freitag McCallum paper, using the same data, they report that their "grown" HMM (which has a learned structure) gets only 41.3% on the acquired field and only 54.4% on the dlramt field, with average performance of only 57.2% over a variety of tasks.