Croatian and Slovene NERC models for Stanford NERC

hrStanfordNERC, slStanfordNERC

ID:

324

Stanford NER model for named entity recognition and classification (NERC) in Croatian texts is built by using the Stanford Named Entity Recognizer tool (URL http://nlp.stanford.edu/software/CRF-NER.shtml) based on Conditional Random Fields (CRF). The model was trained on a portion of texts crawled from the Vjesnik news portal and manually annotated for ENAMEX TYPE={LOCATION, ORGANIZATION, PERSON}. The manually tagged portion of the text consists of 200.006 tokens in 7.358 sentences, containing 5.966 person tokens, 6.897 organization tokens and 4.784 location tokens. Tokens and named entity tags were used as features in the training procedure.
Stanford NER model for named entity recognition and classification (NERC) in Slovene texts is built by using the Stanford Named Entity Recognizer tool (URL http://nlp.stanford.edu/software/CRF-NER.shtml) based on Conditional Random Fields (CRF). The model was trained on a portion of texts selected from the SSJ-500k corpus of Slovene (URL http://www.slovenscina.eu/tehnologije/ucni-korpus) and manually annotated for ENAMEX TYPE={LOCATION, ORGANIZATION, PERSON, MISC}. The manually tagged portion of the text consists of 216.011 tokens in 9.663 sentences, containing 4.204 person tokens, 2.526 organization tokens, 2.421 location tokens and 1.143 miscellaneous tokens. Manually assigned POS/MSD tags and lemmas for the 216.011 tokens were extracted from the corpus and used as features in the training procedure.