Project links

This section provides links to the source code, the documentation and the project timeline.
The main purpose of the section is to list all the places in which you can find work implemented during Google Summer of Code.
There is extensive analysis of the results of the work in the Results section and in the Deliverables section.
However, if you need a direct tour to the whole project, this section is for you.

Note: There are two repositories and two wiki pages. The first repo and the corresponding wiki page include everything that
has to do with the addition of the Greek language to spaCy. The second repo and the corresponding wiki page include everything that has to do with the
implementation of a demo app on top of Spacy that demonstrates its' capabilities and supports various features such as sentiment analysis, topic classification, etc.

arrow_right_alt
Project repository and Wiki page

The project proposal was mainly about adding Greek language support to spaCy platform.

This goal is accomplished and the source code
is provided in the following repository:

There is extensive documentation of every aspect of the process of adding Greek language to spaCy in the following Wiki page:

arrow_right_alt
Demo repository and Wiki page

NLPBuddy is a demo produced during the Google Summer of Code. It is built on top of spaCy and it implements various interesting
tasks, all supported for Greek language too.

It makes use of the first part of the Google Summer of Code project, the addition of Greek language to spaCy, and it has some quite interesting
features such as syntax analysis, emotion analysis, topic classification and a lot more.

DISCLAIMER: Due to the huge complexity of the project, it is almost impossible to list everything that was implemented
during Google Summer of Code 2018. There are over 50 Completed Tasks in the Timeline, but the list may be enriched in the near future.

Problem Statement - Project Goals

We live in the era of data. Every minute, 3.8 billion internet users, produce content; more than 120 million emails, 500,000 Facebook comments, 3 million Google searches. If we want to process that amount of data efficiently, we need to
process natural language. Open source projects such as spaCy, textblob, or NLTK contribute significantly to that direction and thus they need to be reinforced.

This project is about improving the quality of Natural Language Processing of Greek Language.

The project goals can be categorized as following:

Addition of the Greek language to spaCy platform Status: Complete

Production of models for Part-Of-Speech (POS) tagging, Dependency Analysis (DEP) and Named Entities Recognition (NER), with and without word vectors. Status: Complete

An open source text analysis tool (demo) in which everyone can perform common NLP tasks in 7 languages. Status: Complete

Bonus goal: Usage of the addition of Greek language for sentiment analysis and other challenging NLP tasks. Status: Complete

Note : All the project goals have been achieved. Added to this, there are a lot more side results that have been produced during Google Summer of Code 2018.
Analysis of the achievements (with pull requests, links to production ready modules, etc) follows in the next two sections.

Results - Production ready tools

arrow_right_alt
Addition of Greek language support to spaCy.

Greek language has been successfully added to spaCy, which was actually the most important goal of the project.

Two pull requests have been made; the first pull request is about the initial addition of the language and the second pull request contains important optimizations and additions that enrich the features Greek language class supports.

Addition of the language: You can see the first pull request here (Status: Merged)

Optimizations to the Greek language class: You can see the second pull request here (Status: Merged)

Each part of the process of integrating Greek language to spaCy is discussed in detail in the Wiki page of the project.

arrow_right_alt
Greek language models

Two models for Greek language have been produced.

There is an ongoing process of uploading them to spaCy. After that, you will be able to install them with the folllowing commands:

Greek language models support most of the capabilities that you will find in the deliverables section. Sentence splitting, tokenization, Part Of Speech Tagging, Syntax Analysis using DEP tags, Named Entities Recognition,
lexical attributes extraction, norm exceptions and stop-words lists, are all included the Greek language models. The big Greek model (el_core_web_lg) includes word vectors so it supports features such as similarity detection between texts.

Text Categorization among the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science. The Greek classifier is built with FastText and is trained in 20,000 articles labeled in
these categories. Accuracy reaches 90% .

Text subjectivity analysis.

Emotion analysis. It detects the main text emotion among the following emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise.

Lexical attributes. Get numerals, urls and emails from the text.

Noun chunks. Get noun phrases from your text, such as "the red bicycle".

The supported languages at the moment are the following: Greek, English, German, Spanish, Portuguese, French, Italian and Dutch.

Text can either be provided or imported from a URL. For the preprocess of the text imported from a URL, the following libraries are used: python readability, BeautifulSoup4.

Note: All the functionalities that demo supports (and some more) are implemented as modules so anybody can use them independently.
Those modules are extensively discussed in the deliverables section. The central idea is that this Google Summer of Code project should produce results that are going to be used later on from people all around the world.
For that reason, together with my mentor, Markos Gogoulos, we have implemented an API for the Demo so anybody can access the results that it provides (see more here).

arrow_right_alt
Improvements in spaCy

A side goal of the project is to empower spaCy itself. There is an open-dialogue with the creators of spaCy, who we would like to thank for their continuous support and enthusiasm.

crop_square
Documentation Improvements

A pull request for documentation improvements was successfully merged.

The pull request was about a small error found in the spaCy documentation in the pseudocode provided for overriding the spaCy tokenizer.

crop_square
Sharing awareness

I am invited to write an article for Explosion AI Blog regarding the integration of Greek language to spaCy due to the innovative approaches followed during Google Summer of Code 2018.
There is an ongoing process of writing and evaluation of this article till its' publication which may be after the end of Google Summer of Code.

A link to the post will be published here when it's ready.

crop_square
Innovative approaches

In the process of integrating Greek language to spaCy some new approaches are followed. Hopefully, these approaches will inspire other languages too.

The Greek language is the second language that follows a rule based lemmatization procedure.

There were no available data for training NER classifier, so there was a need for creating data. A fast procedure of annotating data using Prodigy annotation tool is proposed for future reference. Learn more about it in the corresponding wiki page.

Deliverables

Deliverables are independent functionality submodules or/and useful resources that were produced either during the process of integrating Greek language to spaCy or during the process of experimenting with the functionalities of spaCy and the demo implementation.

A list of the deliverables and a short description of each of them follows. You can find the functionality submodules in the res/modules folder of the project repo (here), serving as examples for usage.

Each of the deliverables is labelled with one of the following tags:
greek-spacy-support , nlp-task, resource.

greek-spacy-support tag stands for modules that were required for the integration of Greek language to spaCy.

nlp-task stands for submodules that provide useful functionalities for some nlp task. Those modules may be implemented for more than one languages or only for Greek language.

resource is a tag that stands for useful resources for greek language that can be datasets that were created during the processs of integrating Greek language to spaCy.

If you want to learn more, there is an individual page for each of them in the project wiki or the demo wiki.

Deliverables list:

Tokenizer. greek-spacy-support

You can use this submodule having one of the produced greek models in order to split your sentence(s) to tokens, independently of the others
spaCy modules.

In computing, stop words are words which are filtered out before or after processing of natural language data. Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used
by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

The stop-words wiki page is available here. The final list with the stop-words of Greek language can be found here.

Norm exceptions list. resource

spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations.
This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".

The norm-exceptions wiki page is available here. The final list with the stop-words of Greek language can be found here.

Named Entities annotated dataset. resource

For Greek language, there was no available dataset for Named Entities. So, we had to create our own annotated dataset using Prodigy.
The annotated dataset is available here.
You can learn more about NER and Prodigy in the following links: Link 1, Link 2.

Lexical Attributes Functions. greek-spacy-support

Each token of a spaCy doc is checked against some potential attributes. In this way, urls, nums and other types of special tokens can be
seperated from the normal tokens.

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as
the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

The greek language models support the following NER tags: ORG, PERSON, LOC, GPE, EVENT, PRODUCT. Having one of the greek models, you can use the NER tagger:

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world's largest tech fund".

In the latest pull request noun chunks for Greek language are supported.

This submodule is for text classification. It can categorize text in the following categories: Sports, Science, World News, Greek News, Environment, Politics,
Art, Health, Science.
Currently available only for the Greek language.

Future Work

In this section, some suggestions for future work are listed. There are difficulty labels assigned to each task and some guidelines to start with. There are also labels which explain if each task refers to the improvement of Greek language support or
to the addition/improvement of a general nlp task. For more info on contribution, you can always have a look at the contribute page of the project
wiki.

Add more rules to lemmatizer.
greek-spacy-support Difficulty: easy

The Greek language follows a rule based lemmatization technique. It is highly suggested to have a look in the lemmatizer wiki page to understand more
about the approach followed. If you do, you will find out how scalable Greek language lemmatization is. Adding rules should be as easy as completing some lines in this file. For more info, check the contribute wiki page.

Overwrite the spaCy tokenizer.
greek-spacy-support Difficulty: hard

Each language modifies the spaCy tokenization procedure by adding tokenizer exceptions. The tokenizer exceptions approach is not scalable for languages such as Greek. The reasons are pretty much the same as with the lemmatizer. A new approach,
rule-based tokenization is proposed. The suggested steps are the following:

Rewrite the spaCy tokenizer in pure Python, following the pseudo-code provided here. This is already done, you can find the code here.