The Identrics Case – Automatically Extract Context from News Articles

Identrics gives you an opportunity to build an NLP algorithm that extracts key words from news about companies.

0

votes

Identrics is a dedicated research and development unit focused on AI and automation, and part of A Data Pro group and its 400-strong team.

Through Identrics’ technological advancements, clients get more out of data, do more with it, and access it more easily. By training the technologies with clients’ own datasets, our team can reach an unrivalled precision and quality rate of up to 95%, usually only achievable through manual work.

The Business Case

The task is to extract words, representative for entity’s activities and properties in а specified context of interest in news articles.

Identrics extracts knowledge from unstructured text. They apply classifiers to differentiate the content between a set of predefined classes. A common client request is not just to classify the full text, but also the named entities mentioned within the documents.

This problem spans further than classifying entities to basic types, such as person, organisation or geographical location. Entities can be classified with custom categories like positive/negative sentiment for persons/products, financial growth detection for companies, or simply identification of the entities with a central role in the described events.

To achieve this kind of analysis, one sophisticated preprocessing step is needed – to recognise and collect a set of words (tags), which directly describe entity behavior in the context of a document. This set of words actually represents the context of mentions of a particular entity in the document. Then, these contextually related word sets could be used as training instances for the machine learning process if people manually annotate them to certain classes.

Right now Identrics are looking for a solution that can efficiently extract the entities’ contexts in order to apply it to our machine learning process. Context extraction needs some concrete steps like tokenization, lexical and syntactic analysis, coreference chains resolution and dependency parsing. Identrics challenge you to dive deeper into semantic analysis and the complexity of language structure.

In this Datathon, Identrics are providing you with a database containing news in English, as well as the extracted entities of an “Organisation” type. The desired output is a dataset with contextually relevant words (tags), referring to the individual entities.

Your goal will be to create instances for machine learning with words related to the entity. The type of entities is “Organisation” and the context is stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements.

This context would give us the ability to make future predictions for the company’s development. Every instance in this dataset should be created from the best coreference-chained sentences for the organisation in the document, and we need just the dependent to the entity context-friendly words. In most of the documents the mentioned companies are two and more. There are even review articles containing dozens of entities of the “Organisation” type and the context is different for them.

Effective and precise NLP is essential to the task. The English Universal Dependency TreeBank is a good starting point. Dependency parsing and understanding of the relations between the words in the sentence will give us greater knowledge about which of them are valuable for machine learning.

The Research Problem

Usually the clients provide Identrics with the documents and their id-s. So Identrics is extracting the entities, choosing just the entities of the searched class. Than we make coreference chains for the entities, from which we choose the longest chain of mentions for them to create a short extractive summary for the entity and use it as an instance for further machine learning purposes. These summaries consist the hole sentences, which often mention more than one entity in comparative way or in the same context. They may share a relation of similarities or differences between two or more entities. These expressions often use comparative or superlative forms of an adjective or adverb. So if we parse the text, we should at least know which of the organizations are in a subject or in the object position and decide if they are mentioned in the same context . The final goal is to classify organizations activities and properties, found in the unstructured news articles with the clients custom taxonomy. Choosing the right content from the text is crucial to the classifier`s predictions accuracy.

Recommended data science approach, methods and/or techniques. Useful insights, cross-sections, distributions, etc
Dependency parsing and understanding of the relations in the sentence is a good approach to this preprocessing step.
We think that a good understanding of Universal Dependency parsing and semantic analysis on the English Universal Dependency TreeBank is a good starting point.

We encourage the participants to use different approaches to the problem of choosing the relevant context to the entity words, like TF/IDF scoring , word embeddings or some other state-of-the-art methodologies.
The case is divided into two sub tasks:
The first sub task is to find the best coreference chains of mentions for the organizations . For this one we have tested Stanford CoreNLP library for the coreference resolution, wich is available here: https://stanfordnlp.github.io/CoreNLP/coref.html
Their online demo provides good visualization of the referent mentions here:http://nlp.stanford.edu:8080/corenlp/
But there are available different approaches to this task and we are interested if they can give better coreference resolution in means of easy to use output, language independance or easy training for new languages or even better chains of mentions. Some possible solutions to this task are:http://www.bart-coref.org/https://github.com/huggingface/neuralcorefhttp://nlp.cs.berkeley.edu/projects/coref.shtml
At least we are open for discussions on these and other alternatives to the task of finding all the mentions of one entity in the document.

The second sub task in this case is to extract the relevant words to the entity and the given context. The desired output are just the relevant words from the news, excluding context relevant to some other event or subject mentioned. One sophisticated solution in this field is the use of Universal Dependencies (UD) Treebanks available in more than 70 languages and Universal Dependencies parsers. We have tested the Stanford UD Parser, available in Stanford CoreNLP :https://nlp.stanford.edu/software/stanford-dependencies.shtml , but we are open to try another approaches for the same task. UD Tools Page here:
http://universaldependencies.org/tools.html#ud-maintained-tools is showing just a part of them.
More for Universal dependencies project is available on the projects website.
We are looking for a data-driven, cross lingual dependency parser. Malt Parser is well published http://www.maltparser.org/publications.html and available for download here:http://www.maltparser.org/download.html

Other solutions to the problem are welcome and we are going to discuss them too.

Data description
The data will be provided to you in a MySql dump and will contain business news in English and extracted organizations of our interest, mapped by doc_id .
MySQL dump contains the SQL schema and data for two tables – one for documents and one for entities extracted. Generated document identifiers give the one-to-many relationship between both tables.
Documents table structure:
doc_id – column containing generated integer identifiers for documents
body – column containing longtext content of the documents
Column doc_id has primary key constraint for the table

Entities table structure:
entity_id – column containing generated integer identifiers for entities
doc_id – column containing generated doc_id that references to the same doc_id in table `documents`.
word – string representation of mentioned entity
type – mentioned entity type. In this dump all entities are of type “ORGANIZATION”
start_offset – the position number of first entity character in document text
end_offset – the position number of last entity character in document text
Column entity_id has primary key constraint for the table. Column doc_id hasn’t foreign key constraint to doc_id in `documents` table.

Please note, that the content of `entities` table is stored in sequential denormalized form. It means that all words of every individual organization is stored in separate row. It should be convenient in terms of using coreference chains where in the sentences just part of the entity ocurres. Entities can be easily normalized to their full literals using start_offset and end_offset. If the start_offset value of some entity exceeds with 1 the end_offset of previous entity in the table, then both records are parts of bigger actual record of entity mentioned.
For example:
id word start end
1 Bucharest 140 149
2 Stock 150 155
3 Exchange 156 164
the actual full entity name is “Bucharest Stock Exchange”. This kind of normalization is needed for output format of the solution where extracted context words should be assigned to entire entity record – with example above it is something like:
Bucharest Stock Exchange – stock indices mostly green higher turnover indicated Current MySQl dump need to be imported into database named `datathon`, but if you are not familiar with MySQL, or using database isn’t convenient way to use the data, we also provide same database exported as CSV files with exactly the same schema described above.
The archive named “business-news-2015-mysql-dump.zip” contains both MySQL dump and CSV files.

Expected results
What we expect as results is CSV file containing following fields:
doc_id – identifier of the document for which named entities has extracted context words
entity – normalized form of the entity with it’s full name
context – list all words describing the context of the entity mentions for the document separated by interval

The result must cover all documents in the database. (There is no limitation about number of words in the context.)

Acceptable accuracy measure, KPI-s
● The case is going to be divided into two sub tasks assessed separately.
● The quality of the chosen coreference chained entities and their
mentions in the text is crucial for the first sub task.
● The second sub task is to create training examples, containing just relevant
words from the text.
● For the second sub task we are looking for parsers or other solutions with
good language coverage or to be easy implementable in different
languages.
● We will appreciate and evaluate alternative approaches to the proposed
above.
● The created instances per entity are going to be revised by us for relevancy
to the context of interest and to the entity of interest.

Iva has been with Identrics since its inception in 2015, initially as a DevOps Engineer, and currently as Data Science Researcher. As an integral part of the A Data Pro Group’s innovation team, she investigates and evaluates new approaches to Natural Language Processing problems, building and improving AI solutions from development to production.

With a decade in the field, Deyan is a veteran in the development of semantic technologies and methods for knowledge extraction from unstructured text. He joined the A Data Pro Group in 2015 as Chief Technology Officer of the innovation hub Identrics, and has since been the go-to man for AI and automation-led process optimisation across our services.

Deyan’s specialties include software integration in business environments, the creation of semantic data models and services, building software solutions for machine learning and – not least – the flute, which he plays in the office after hours. Beyond his vast know-how and passion for his craft, what makes Deyan such an asset to the Identrics team is his congeniality and ability to relate complex machine learning concepts in lay terms.

Article instructions

The main focal point for presenting the results from the Datathon from each team, is the written article. It would be considered by the jury and it would show how well the team has done the job.

Considering the short amount of time and resources in the world of Big Data Analysis it is essential to follow a time-tested and many-project-tested methodology CRISP-DM. You could read more at http://www.sv-europe.com/crisp-dm-methodology/
The organizing team has tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), so that the best solutions should have the best results in phase 5. Evaluation.
Phase “6. Deployment” mostly stays in the hand of the case-study providing companies as we aim at continuation of the process after the event. So stay tuned and follow the updates on the website of the event.

1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

3. Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

4. Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

5. Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

6. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.