JRC-Names

JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.).

The named entity resource file with the list of spelling variants is accompanied by Java-implemented demonstrator software that (a) allows to produce - for any input name - a list of known spelling variants, and that (b) analyses UTF8-encoded text files to find known entity mentions, returning the name variant found, the preferred display name for that entity, the unique name identifier for that name, the position of the entity name in the text, and its length in characters.

To see examples, go to any of the over one million entity pages on EMM-NewsExplorer (e.g. that for the United Nations) to see the list of spelling variants automatically collected for that entity. Below, you see known spelling variants for the person name Muammar Gaddafi:

The data release by the Joint Research Centre (JRC) is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

JRC-Names is a technical resource that can be used to find names even if they are spelled differently, but it is also a useful ingredient for IT systems that process text, e.g. for text mining.

The tool serves many purposes and addresses various problems, including the following:

Proper names are a problem when searching databases, the internet and other repositories, because variants of searched names are often not found. This results in non-optimal use and exploitation of repositories for documents, images and audio-visual content. JRC-Names allows standardising the names and thus improving retrieval;

Names are a known problem for machine translation as they should not be translated like other words; names can be extracted before the translation process and the foreign language variant can be re-inserted in the target language to solve this problem;

Lists of names in two different scripts are often used to learn transliteration rules;

Names can be recognised and marked up in text to use as seeds when training a machine learning named entity recognition system;

Social networks are less biased by national viewpoints if produced using multi-national sources and entity lists;

Recognition of names is useful as input to the computational linguistics tasks of opinion mining, co-reference resolution, summarisation, topic detection and tracking, cross-lingual linking of related documents across languages, and more.

JRC-Names is a by-product of the analysis of about 100,000 news reports per day by the Europe Media Monitor (EMM) family of applications.

It was mostly compiled automatically, by analysing hundreds of millions of news articles since the year 2004 in up to twenty languages, identifying names of entities (mostly persons, but also organisations, event names, and more), and detecting which of these newly found names are variant spellings of each other. Most name variants in JRC-Names are thus spellings that were found in real-life text (including frequent spelling mistakes). Additionally, for a subset of the collection of entities, software automatically extracted spelling variants in many further languages (e.g. Chinese, Thai, Japanese, ...) from the cross-lingual links in Wikipedia. For highly frequent or otherwise important names, the named entity resource was additionally manually verified. As JRC-Names was mostly produced automatically, it will contain some errors.

JRC-Names contains the most important names of the EMM name database, i.e. those names that were found frequently or that were verified manually or found on Wikipedia.

The first release of JRC-Names (September 2011) contains the names of about 205,000 distinct known entities, plus about the same amount of variant spellings for these entities. Additionally, it contains a number of morphologically inflected variants of these names. The resource grows by about 230 new entities and an additional 430 new name variants per week (status July 2011).

EMM identifies new names every day, and a file including also the most recently found names and name spellings is available for daily download from the JRC's web pages.

As of July 2011, the database included names spelt in 27 different scripts. The most frequently used scripts are Latin (including English and most other European languages), Cyrillic (e.g. Russian and Bulgarian), Arabic (including Farsi), Japanese (Han, Hiragana and Katakana) and Chinese Han (simplified variant).

64% of the names in JRC-Names do not have additional spelling variants. For 28% of the names, JRC-Names knows two or three spellings. There are 3760 entities with ten spellings or more, and 37 entities with over 100 spelling variants. The names with the most spelling variants are Muammar Gaddafi (413 spellings), Mikhail Saakashvili (256) and Mahmoud Ahmadinejad (246) (status July 2011).

Depending on your needs, you may want to download part or all of the following components:

JRC-Names Java demonstrator code: This .jar file allows to analyse UTF8-encoded text files to recognise known named entities. It also allows to generate a list of all known variants for any input name; Needs to be used in combination with the entity resource file.

JRC-Names named entity resource file: This file contains the list of names and their variants. It is planned that this file will be updated daily in order to include the most recently added entity names. (filename: entities.gzip; zipped size: ca. 4MB; unzipped: ca. 13MB).

Mission

As the Commission's in-house science service, the Joint Research Centre's mission is to provide EU policies with independent, evidence-based scientific and technical support throughout the whole policy cycle. Working in close cooperation with policy Directorates-General, the JRC addresses key societal challenges while stimulating innovation through developing new methods, tools and standards, and sharing its know-how with the Member States, the scientific community and international partners.