Benajiba’s ANER Corpus

It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). Here’s a sample (rendered with style value direction:rtl to get the ordering right; the actual file’s in the usual character order):

Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo.

LingPipe Model and Evaluation

I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries).
There’s absolutely nothing Arabic-specific about the model.

Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results:

Type

Precision

Recall

F1

LOC

0.782

0.788

0.785

PERS

0.634

0.657

0.645

ORG

0.609

0.527

0.565

MISC

0.553

0.421

0.478

COMBINED

0.685

0.661

0.673

Dictionaries improved recall a bit, but hurt precision even more. Bigger dictionaries and more training data would certainly help here.

References

For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see:

12 Responses to “Arabic Named Entity Recognition with the ANER Corpus”

I am using Lingpipe for named entity recognition in my research. I want to create my own model. For that I am using same code as shown in named entity tutorial and used for CONLL2002 dataset. But I am getting erroors saying illegal line =Codexis B_ORG. First line in my training file is Codexis B_ORG. Same code is working for training files of CONLL2002. I created training file manually. Is there anything special I need to consider. This error is coming from LineTaggingParser.java class and parseString() in that file. Here it checks whether line matches the pattern or not. I dont see any mismatch with the pattern. Please help me out.

You may have just gotten bitten by my updating the tutorials to reflect the actual tags used in CoNLL, B-ORG, not B_ORG. The last release of LingPipe (3.9.1), wouldn’t parse CoNLL, but would’ve parsed data with tags like B_ORG>

All you need to do is convert your tags to CoNLL format, B-ORG, not B_ORG, and you should be good to go.

This really isn’t the best place for a discussion of bugs in LingPipe because I don’t check it as often as mail and it’s not where others go to find bug reports and/or patches.

We have a mailing list linked from the web page and a direct mail address, bugs@alias-i.com.

To answer the question, you need to upgrade to LingPipe 3.9.2. I fixed the bug I introduced in 3.9.1 with respect to NE parsing in CoNLL format. I tested the new parser configurations with all the data we talk about, and it now works with B- and I- tags.

I’m not sure what you mean — there are a bunch of things people often call “efficiency”. The big ones are memory and time. Our three built-in statistical NER systems vary on both of these. If you go to the last section of our NER tutorial, you’ll see a small comparison. The absolute amounts will depend on the number of categories, amount of pruning in the parameters, and the amount of training data and number of features.

The best thing to do is find the NER you want and then time it on your hardware. For absolute throughput, keep in mind all our NER systems can be multithreaded with a single model in memory.

Hey LingPipe, you have made my life easier with your NLP tutorials and resources. I have successfully built NER model for Tigrigna Language (one of under resourced language like Arabic) using the resources in the LingPipe(ANERXval, an HMM model). But I need to build a model using CRF(Experiment Set-up — Partition: Dividing the entire corpus in to training(i.e. 2/3 or 80%) and testing set(i.e. 1/3 or 20%), and also Cross Validation: experiment with fold cross validation same as ANERXval using CRF to build the NER system (especially for scarce corpora) reading UNICODE format in text file.
What do you recommend me to do so? Please indicate me resources and other links to perform such NLP tasks? Thank you a lot for your excellent tutorials and supports.