Description

The models trained with the UIMA component give wrong begin/end offset despite the fact they manage to split text in sentences.
I observed that the begin of a current sentence starts including as a first token the punctuation character of the previous one while the
previous one does not include it as its last one.

Joern Kottmann
added a comment - 22/Jun/11 10:19 Can you confirm that this issue only occurs when you use a model trained by the UIMA Sentence Detector Trainer?
So you do the following:
1. Traing a model with the UIMA Sentence Detector Trainer
2. Load the model from 1. and run it over text
Then you observe the wrong offsets, right?
But when you use a pre-build sentence model the offsets are correct, right?

Following the steps you described I observed the wrong offsets.
The offsets were correct when I used the pre build sentence model (like the English one comming from [1]) or when I built a model by using the command line instruction.

For all of these building configurations I used the UIMA OpenNLP SentenceDetector.

Nicolas Hernandez
added a comment - 22/Jun/11 10:45 Yes I confirm.
Following the steps you described I observed the wrong offsets.
The offsets were correct when I used the pre build sentence model (like the English one comming from [1] ) or when I built a model by using the command line instruction.
For all of these building configurations I used the UIMA OpenNLP SentenceDetector.
[1] http://opennlp.sourceforge.net/models-1.5/

Joern Kottmann
added a comment - 22/Jun/11 12:13 This issue is linked to the usage of the "useTokenEnd" option, if it is false the code which computes the span makes the above described off by one error.
For now I suggest the UIMA Sentence Detector Trainer uses the same default as the cmd line version. Beside that we should fix the issue in the Sentence Detector ME code.

Joern Kottmann
added a comment - 22/Jun/11 12:39 Check out and build the current head trunk. The parameter is not configurable. I will hopefully soon have time to work on a little sentence detector refactoring.

Nicolas Hernandez
added a comment - 24/Jun/11 09:15 All right.
I confirm it works now.
I tested by using a sample of the europarl-v6 corpus [1] .
cat europarl-v6.fr-en.fr | perl -ne "if (/ [\.\?\!\:\;\'\"»…] $/g)
{ print;}
"| head -n 1000 > europarl-v6.fr-en.fr.1KSent
I used the Apache whitespace tokenizer then the OpenNLP UIMA SentenceDetectorTrainer to build a model.
And I tested the model with the OpenNLP UIMA SentenceDetector.
[1] http://www.statmt.org/europarl/

You might encounter one more issue. The sentence detector labels each potential end of sentence character as either a sentence-end or no-sentence-end. Based on your input file such samples are generated for training. In the input file each sentence is written in a line, and the sample generation code assumes that the last end of sentence character in the line is the true sentence-end.

In your europarl file there are lines which do not end with a end sentence character but might contain tokens with end of sentence characters.
For example:

Dr. Smith said: <- In this sample the dot in Dr. would be mistaken for a sentence end.

Joern Kottmann
added a comment - 24/Jun/11 09:28 You might encounter one more issue. The sentence detector labels each potential end of sentence character as either a sentence-end or no-sentence-end. Based on your input file such samples are generated for training. In the input file each sentence is written in a line, and the sample generation code assumes that the last end of sentence character in the line is the true sentence-end.
In your europarl file there are lines which do not end with a end sentence character but might contain tokens with end of sentence characters.
For example:
Dr. Smith said: <- In this sample the dot in Dr. would be mistaken for a sentence end.

As shown in my previous file, I filtered the lines which do not end with an assumed end sentence character.

Anyway I tested a model trained on 1 million of europarl sentences.

In my text test, I have one 'M.' ('Mr.' in French) in the middle of one sentence and two occurences as the first token of a sentence.
The two occurrences begining a sentence are wrongly split but not the other one.
I do not infer anything but I note it partially works.
'M.' occurs 58 times in the training corpus.

Nicolas Hernandez
added a comment - 24/Jun/11 11:15 As shown in my previous file, I filtered the lines which do not end with an assumed end sentence character.
Anyway I tested a model trained on 1 million of europarl sentences.
In my text test, I have one 'M.' ('Mr.' in French) in the middle of one sentence and two occurences as the first token of a sentence.
The two occurrences begining a sentence are wrongly split but not the other one.
I do not infer anything but I note it partially works.
'M.' occurs 58 times in the training corpus.

I usually debug such issues by closely inspecting the training data. Is there a case in the training data where it splits after M. ? Are there samples in the training data where M. occurs at the begin of the sentence?

Could also be caused by encoding issues. Sure the models could also just make a classification mistake.
I suggest to use the integrated evaluation and more samples to get meaningful results, for english we end up somewhere in the 99% accuracy range.
If you do not want to prepare a test file you could use cross validation.

We should also consider adding direct support to OpenNLP to train it on europarl files.

Joern Kottmann
added a comment - 24/Jun/11 11:24 I usually debug such issues by closely inspecting the training data. Is there a case in the training data where it splits after M. ? Are there samples in the training data where M. occurs at the begin of the sentence?
Could also be caused by encoding issues. Sure the models could also just make a classification mistake.
I suggest to use the integrated evaluation and more samples to get meaningful results, for english we end up somewhere in the 99% accuracy range.
If you do not want to prepare a test file you could use cross validation.
We should also consider adding direct support to OpenNLP to train it on europarl files.

Joern Kottmann
added a comment - 06/Jul/11 09:20 It was confirmed that the workaround fixed this issue, a more proper fix needs to be done as part of an refactoring of the Sentence Detector. See OPENNLP-205 for details.