What is Sentence Detection?

This tutorial shows how to segment a text into its constituent sentences using
a LingPipe SentenceModel, and how to evaluate and tune sentence models.

It uses MEDLINE data as the example data.
MEDLINE is a collection of 13 million plus citations into the
bio-medical literature maintained by the United States National
Library of Medicine (NLM), and is distributed in XML format.
The MEDLINE Parsing and Indexing Demo
covers how to parse this data from XML into a structured Java object.

The first part of this tutorial shows how to segment a text into its
constituent sentences using a LingPipe SentenceModel.
The second part shows how to use the LingPipe SentenceEvaluator
together with a corpus of correctly annotated data (a gold standard)
to determine the accuracy of a model.
Finally, we discuss the existing sentence models in the API,
and ways to tune them.

Using Sentence Models

The SentenceModel Interface

The LingPipe
com.aliasi.sentences.SentenceModel interface
specifies a means of doing sentence segmentation from arrays of
tokens and whitespaces, namely the boundaryIndices method,
which takes an array of tokens, and an array of whitespaces, and returns
an array of indices of sentence-final tokens.

The SentenceBoundaryDemo.java
program shows how to use a sentence model to find sentence boundaries in a text.
It takes an input file of plain text.
It first processes the file into lists of tokens and whitespace,
and then uses the MEDLINE sentence model to find the sentence boundaries.
To run this from the command line, type the following on one line (if using Windows, replace the colon ":" with a semicolon ";"):

This tutorial also comes with an Ant
build.xml file which defines
targets used to run all of the demo programs.
To run the SentenceBoundaryDemo program
execute the Ant target findbounds:

> ant findbounds

which produces the following output (with the [java]
tags inserted by Ant removed for clarity):

findbounds:
INPUT TEXT:
The induction of immediate-early (IE) response genes, such as egr-1,
c-fos, and c-jun, occurs rapidly after the activation of T
lymphocytes. The process of activation involves calcium mobilization,
activation of protein kinase C (PKC), and phosphorylation of tyrosine
kinases. p21(ras), a guanine nucleotide binding factor, mediates
T-cell signal transduction through PKC-dependent and PKC-independent
pathways. The involvement of p21(ras) in the regulation of
calcium-dependent signals has been suggested through analysis of its
role in the activation of NF-AT. We have investigated the inductions
of the IE genes in response to calcium signals in Jurkat cells (in
the presence of activated p21(ras)) and their correlated
consequences.
150 TOKENS
151 WHITESPACES
5 SENTENCE END TOKEN OFFSETS
SENTENCE 1:
The induction of immediate-early (IE) response genes, such as egr-1,
c-fos, and c-jun, occurs rapidly after the activation of T
lymphocytes.
SENTENCE 2:
The process of activation involves calcium mobilization,
activation of protein kinase C (PKC), and phosphorylation of tyrosine
kinases.
SENTENCE 3:
p21(ras), a guanine nucleotide binding factor, mediates
T-cell signal transduction through PKC-dependent and PKC-independent
pathways.
SENTENCE 4:
The involvement of p21(ras) in the regulation of
calcium-dependent signals has been suggested through analysis of its
role in the activation of NF-AT.
SENTENCE 5:
We have investigated the inductions
of the IE genes in response to calcium signals in Jurkat cells (in
the presence of activated p21(ras)) and their correlated
consequences.

The tokenList and whiteList arrays produced
by the tokenizer are parallel arrays. The whitespace at index
[i] is that which precedes the token at index [i].
The tokenizer returns elements for the whitespace preceding the first token and
the whitespace following the last token. Therefore in the above example we see that
the whitespace array contains 151 elements, while the token array contains 150 elements.

We convert the ArrayList objects into their corresponding String
arrays, and then invoke the boundaryIndices method:

The boundaryIndices method returns an array whose values are the indices of the
elements in the tokens array which are sentence final tokens.
To extract the sentences we iterate through the sentence bounaries array,
keeping track of the indices of the sentence start and end tokens, and printing
out the correct elements from the tokens and whitespaces arrays.
Here is the code to print out the sentences found in the abstract, one per line:

The above code block prints every token in the tokens array,
and the whitespace following that token.
Because line breaks count as whitespace, the individual sentences show the same
pattern of spacing and linebreaks as in the input text.

Chunkings and Chunkers

In this section we show how to simplify the task of dealing with
sentences and sentence boundaries, by rewriting the
SentenceBoundaryDemo to use a
com.aliasi.sentences.SentenceChunker.

The rewritten program is SentenceChunkerDemo.java.
To run this program execute the Ant target findchunks as before,
which produces:

> ant findchunks
findchunks:
INPUT TEXT:
The induction of immediate-early (IE) response genes, such as egr-1,
c-fos, and c-jun, occurs rapidly after the activation of T
lymphocytes. The process of activation involves calcium mobilization,
activation of protein kinase C (PKC), and phosphorylation of tyrosine
kinases. p21(ras), a guanine nucleotide binding factor, mediates
T-cell signal transduction through PKC-dependent and PKC-independent
pathways. The involvement of p21(ras) in the regulation of
calcium-dependent signals has been suggested through analysis of its
role in the activation of NF-AT. We have investigated the inductions
of the IE genes in response to calcium signals in Jurkat cells (in
the presence of activated p21(ras)) and their correlated
consequences.
SENTENCE 1:
The induction of immediate-early (IE) response genes, such as egr-1,
c-fos, and c-jun, occurs rapidly after the activation of T
lymphocytes.
SENTENCE 2:
The process of activation involves calcium mobilization,
activation of protein kinase C (PKC), and phosphorylation of tyrosine
kinases.
SENTENCE 3:
p21(ras), a guanine nucleotide binding factor, mediates
T-cell signal transduction through PKC-dependent and PKC-independent
pathways.
SENTENCE 4:
The involvement of p21(ras) in the regulation of
calcium-dependent signals has been suggested through analysis of its
role in the activation of NF-AT.
SENTENCE 5:
We have investigated the inductions
of the IE genes in response to calcium signals in Jurkat cells (in
the presence of activated p21(ras)) and their correlated
consequences.

The above output is almost identical to that of SentenceBoundaryDemo except that
there is no tokenization information.
This is because the SentenceChunker handles tokenization.

A SentenceChunker is constructed from a
TokenizerFactory and a SentenceModel:

The SentenceChunker method chunk produces a
com.aliasi.chunk.Chunking over the text.
A Chunking is a set of
com.aliasi.chunk.Chunk objects
over a shared CharSequence.
The chunkSet method returns the set of (sentence) chunks,
and the charSequence method returns the underlying
character sequence.

Evaluating Sentence Models

In this section we show how to evaluate a sentence model.

To evaluate a sentence model, we need a reference corpus of text which
has a set of sentence boundary markers. For MEDLINE data, we can use
the GENIA
XML corpus as the gold standard. The GENIA XML corpus is a set of
2000 MEDLINE abstracts which have been annotated for sentence
boundaries and biomedical terms ("cons" elements). Here is
a sample abstract from this corpus (prettified with whitespace):

We use this as the reference chunking against which to evaluate the performance
of a sentence model by using a SentenceChunker, as in the
SentenceChunkerDemo, above:
first we create a SentenceChunker for the sentence model
we wish to evaluate. We invoke the Chunk method on the
text of the abstract (the charSequence of the reference chunking).
This gives us a response chunking.

To evaluate the response chunking against the reference chunking
we compare the members of the respective chunkSet objects,
that is, we compare the set of sentences that we know to be in the abstract
with the set of sentences found by the sentence model, using a 4-way classification:

True Positives (TP): sentences in the reference chunking and in the response chunking.

False Positives (FP): sentences in the response chunking which are not in the reference chunking.

False Negatives (FN): sentences in the reference chunking which are not in the response chunking.

True Negatives (TN): this number is always zero. It is the number of items which are neither
in the reference chunking nor in the response chunking.
Since we only collect the sentences from the GENIA corpus and the response chunking,
we have no true negatives.

A SentenceEvaluator handles reference chunkings by
constructing a response chunking and adding them to a sentence
evaluation. The resulting evaluation may be retrieved through the
method evaluation() at any time.

This evaluator class implements the ObjectHandler<Chunking>
interface.
The chunkings passed to the handle(Chunking)
method are treated as reference chunkings.
Their character sequence is extracted using Chunking#charSequence()
and the contained sentence chunker is used to produce a
response chunking over the character sequence.
The resulting pair of chunkings is passed to the contained sentence evaluation.

Running the evaluation is straigtforward:
we create a GeniaSentenceParser instance
and a SentenceEvalutator instance, and
then set the SentenceEvaluator as the default handler
for the GeniaSentenceParser.
The GeniaSentenceParser parses each abstract into a reference chunking,
and then invokes the handle(Chunking) method of the
the SentenceEvaluator.
The SentenceEvaluator creates the response chunking from the
reference chunking, and adds the pair of reference, response chunkings to the
evaluation, so that parsing and evaluation are carried out in tandem.
The SentenceEvaluator object contains a
com.aliasi.sentences.SentenceEvaluation object,
which contains all the evaluation cases, (the pairs of reference and response chunkings),
and the evaluation metrics, which are updated as each new case is added to the evaluation.

The SentenceEvaluation contains a
com.aliasi.chunk.ChunkingEvaluation object,
which evaluates the sentences qua chunkings.
The SentenceEvaluation also evaluates the sentence model solely
in terms of the sentence end boundaries.
As we saw in the first part of this tutorial,
the SentenceModel doesn't identify sentence initial tokens,
only the sentence-final tokens.
Implicit in this model is the assumption that all tokens belong to a sentence,
therefore once we have found the end token in a sentence, we know that the start
token of the next sentence must be the following token.
Evaluations which score chunking errors and evaluations which score sentence end boundary errors
yield different counts of the errors made by the sentence model.
Consider the case where the sentence model fails to identify a sentence boundary in a sentence:

The reference chunking will contain two Chunk objects,
with start and end values of (0,13), (14,27) respectively. The
response chunking will contain one Chunk object, which
start and end values (0,27). The ChunkingEvaluation will
add the two reference chunking chunks to the set of false negatives,
and one response chunking chunk to the set of false positives. The
SentenceEvaluation will compare sets of end boundaries.
The reference chunking end boundaries set contains the values 13 and
27, while the response chunking contains only 27, therefore the
SentenceEvaluation counts the missed sentence boundary at
position 13 as a single false negative. This approach to counting
errors has two advantages: the statistics returned by counting only
end boundary errors are better, since the overall number of false
positives and false negatives is lower; and the sets of false
positives and negatives contain only examples where the sentence-final
boundary was incorrect. This latter point is relevant for the
developer who is building or tuning the sentence model and will be
covered in detail in the third part of this tutorial.

The SentenceModelEvaluator.java
program shows how to construct and run an evaluator, and report the
results of the evaluation. This program runs the
GeniaSentenceParser over the GENIA XML corpus, and prints
out the result of the evaluation.

Accuracy is just (TP+TN)/(TP+FP+FN+TN).
Because there are no TNs, accuracy reduces to the
Jaccard measure TP/(TP+FP+FN).

Note: The evalaute ant task assumes that the GENIA corpus
has been dowloaded per instructions above, and that the files
"GENIAcorpus3.02.xml" and "gpml.dtd" are in the
lingpipe/demos/data directory.
If either of these files are missing, the task will fail with a java.io.FileNotFoundException.

The SentenceModelEvaluator program is straightforward.
First we create a SentenceChunker (as we did in the
SentenceChunkerDemo.java program in section 1.2, above),
and pass it in to the SentenceEvaluator constructor:

The name of GENIA XML corpus file is passed in as a command line argument
to the program.
As the parser parses the corpus, SentenceEvaluator
adds pairs of reference, response chunkings to the evaluation,
therefore the only call that we need to carry out evaluation
is the call to the parser's parse method:

File inFile = new File(args[0]);
parser.parse(inFile);

Once the file has been parsed, we obtain the results of the evaluation
from the
com.aliasi.sentences.SentenceEvaluation
object that the SentenceEvaluator contains.
Both the chunking evaluation and the sentence end boundary evaluation
use a
com.aliasi.classify.PrecisionRecallEvaluation object
to tally their results.
This class
contains suite of descriptive statistics for binary classification
tasks.
The toString method returns a formatted representation
of these statistics.

The errors made by the sentence model are written to two files:
EvaluatorFalseNegatives.txt
and
EvaluatorFalsePositives.txt.

EvaluatorFalseNegatives.txt contains
a listing of sentences in the reference set (GENIA corpus) which
are not in the response set (the sentence chunking returned
by the MEDLINE sentence model), i.e. these are the sentences where
the sentence model missed an end boundary.
Here is an excerpt from this output file:

EvaluatorFalsePositives.txt contains sentences in the
response chunking which are not in the reference chunking, i.e.
these are chunks that the sentence model incorrectly identified
a token as a sentence-final token.
Here is an excerpt from this output file:

This output is generated by iterating over the set of false negatives and false positives
returned by the SentenceEvaluation object.
The members of this set are
com.aliasi.chunk.ChunkAndCharSeq objects.
A ChunkAndCharSeq object is a composite, containing
a Chunk and the character sequence that contains it.
This allows us to examine the start and end points of the sentence in context,
using the spanStartContext and spanEndContext methods:

These files are mainly of interest to the model developer who wishes to
identify the kinds of errors made by the sentence model.

Developing and Tuning Sentence Models

In this section we show how to develop and tune a sentence model,
again using the GENIA corpus as a gold standard. The source code for
this demo contains a class DemoSentenceModel.java.
The reader is encouraged to try successive modifications to the
DemoSentenceModel program, and to use the
SentenceModelEvaluator.java program to assess the impact
of these changes on the model's performance.

Like the MedlineSentenceModel, the DemoSentenceModel extends the
com.aliasi.sentences.HeuristicSentenceModel class.
A HeuristicSentenceModel determines sentence
boundaries based on sets of tokens, a pair of flags, and an
overridable method describing boundary conditions, the
bounaryIndices method. The gist of the
HeuristicSentenceModel.bounaryIndices algorithm is
that sentence boundaries are identified by looking at a token together
with the tokens which precede and follow it. If a token is a
sentence-final token, then the sentence boundary is the index of the
character one past the last character in that token. In order for a
token to be a sentence-final token, it must be a member of the set of
sentence-final punctutation tokens, such as periods (.)
and question marks (?). Furthermore, it must be followed
by whitespace, and the following token (if any) must be a legal start
token for a sentence. Sentences containing abbreviations such as
"Mr. Smith" are problematic because a simplistic sentence
model will treat the period following "Mr." as a
sentence-final token. Therefore it is necessary to check the
penultimate token in the sentence, and disallow common abbreviations.

The heuristic sentence model uses three sets of tokens:

Possible Stops: These are tokens that are allowed
to be the final token in a sentence.

Impossible Penultimates: These are tokens that may
not be the penultimate (second-to-last) token in a sentence.
This set is typically made up of abbreviations or acronyms such as
"Mr".

Impossible Starts: These are tokens that may not
be the first token in a sentence. This set typically includes
punctuation characters that should be attached to the previous
sentence such as end quotes ('').

A further condition is imposed on sentence initial tokens by method
possibleStart(String[],String[],int,int). This method
checks a given token in sequence of tokens and whitespaces to
determine if it is a possible sentence start.

There are also two flags in the constructor that determine aspects of sentence boundary detection:

Force Final Boundary: If this flag is set to
true, the final token in any input is taken to be a
sentence terminator, whether or not is a possible stop token. This
is useful for dealing with truncated inputs, such as those in
MEDLINE abstracts.

Balance Parentheses: If parentheses are being balanced,
then as long as there are open parentheses that have not been
closed, the current sentence may not end.
Square brackets
("[", "]") and round brackets ("(", ")"),
are balanced separately, so that a close square bracket doesn't close an open paren,
and visa versa.
The heuristic sentence model doesn't keep track of nested parenthesis, and the first
close paren following any number of open parens closes all parens, and
any extra close parentheses (")") and brackets
("]") are ignored.
This approach avoids the pitfall of missing all sentence boundaries past a
missing close paren if only one close paren is used to close multpile open parens.

The initial version of the DemoSentenceModel
defines minimal sets of penultimate stops, impossible penultimates,
and impossible starts, and
doesn't override any methods in HeuristicSentenceModel.
Here is its constructor:

This model preforms quite well, with overall accuracy and F-measures above 99%.
The number of false positives and false negatives is markedly higher than the
corresponding numbers for the MEDLINE sentence model, therefore we examine
the EvaluatorFalseNegatives.txt
and EvaluatorFalsePositives.txt output files.
Here are the first 20 false negatives (sentence boundaries that the
DemoSentenceModel failed to identify:

Roughly have of the above entries are because the MEDLINE abstracts
are sometimes truncated, and these truncated abstracts don't end with
proper punctuation. In the GENIA corpus, these are labeled as
sentences. To handle this, we change the constructor setting the
forceFinalStop argument in the superclass's constructor
to true:

This change cuts the number of false negatives from 140 to 75.
The number of false positives remains unchanged.
Now we look at the entries in the file
EvaluatorFalsePositives.txt.
These are places where the DemoSentenceBoundary
mistakenly identified punctuation as a sentence boundary.
Here are the first 20 entries:

Entry #15 is typical of many of these errors. MEDLINE abstracts frequently contain
citations to other journal articles.
These citations contain many abbreviations,
both of names and journal titles, and the periods are mistakenly identified as
end of sentence markers.
Since these citations are almost always offset by parentheses or brackets,
using the parenthesis balancing feature of the HeuristicSentenceModel
will eliminate this error.
Therefore we change the DemoSentenceModel
constructor again, this time to:

Entry #3 shows a remaining problem for this model: there are
biological names which are never capitalized, such as "p65",
"mRNA", "alpha-IFN", or beta-Catenin",
therefore determining a possible sentence start cannot be done on the
basis of initial capitalization. Examination of these names shows
that most of them contain digits or uppercase letters. Many names
contain hyphens, such as "alpha-IFN" and "c-FOS".
These names are problematic since the Indo-European tokenizer will
break them into a sequence of three tokens: "c",
"-", "FOS", therefore is it necessary to look
through the next several tokens following a the possible sentence
boundary token to determine whether or not what follows is a good
sentence start.

The MEDLINE sentence model class overrides the method
possibleStart to allow for names like these. The
MedlineSentenceModel.possibleStart
method allows any sequence of contiguous tokens
containing a non-lowercase character to be a good sentence start.
The arguments to this methods are the arrays of tokens and whitespace
that the tokenizer produces from the text of the abstract, along with
indices into these arrays that give the region of the tokenization that
needs to be checked for a possible start.
Here is a (slightly simplified) version of this method:

Once again we have reduced the number of false negatives by half.
This performance is almost as good as that of the LingPipe MEDLINE sentence model
(reported in section 2 of this tutorial).
The interested reader is encourged to examine the code of the MedlineSentenceModel class
to see further possible refinements.

At this point we have acheived very high accuracy against the GENIA corpus.
It is not clear how much futher tuning of the model will be useful for the general
task of processing the MEDLINE citation index.
The GENIA corpus contains only 2000 MEDLINE abstracts, while the number of abstracts
in the MEDINE citation index stands at around 10 million. Continuing to tune and
evaluate the DemoSentenceModel model against the GENIA corpus runs the
risk of overfitting the model to the data, and might actually detract from overall
accuracy when processing new data.
Therefore we conclude this tutorial here.