Whiteboard with Yellow Stickers to Manage Consolidated and New Ideas Respectively

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

As a further outreach point for research communities, we have submitted a full article to the Semantic Web Journal, [6] among the top ones worldwide in the Information Systems field [7].
The whole process is known to be time-consuming: we have so far uploaded a first version, [8] focusing on past efforts carried out with DBpedia. It has passed the first round of reviews.
We are currently working on a major revision that will include more details concerning StrepHit.

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

From a technical perspective, the project has so far delivered software and content.
Specifically, with respect to software, the following modules have reached a mature state:

Web Sources Corpus[9], i.e., a set of Web spiders that harvest data from the selected biographical authoritative sources [10];

Corpus Analysis[11], i.e., a set of scripts to process the corpus and to generate a ranking of the Candidate Relations;

Commons[12], i.e., several facilities to ensure a scalable and reusable codebase. On the general-purpose hand, these include parallel processing, fine-grained logging, and caching. On the specific Natural Language Processing (NLP) hand, special attention is paid to foster future multilingual implementations, thanks to the modularity of the NLP components, such as tokenization [13], sentence splitting [14], and part-of-speech tagging [15].

The following modules have been started and are in active development:

Extraction[16], i.e., the logic needed to extract different set of sentences, to be used for training and testing the classifier, as well as for the actual production of Wikidata content;

Annotation[17], i.e, a set of scripts to interact with the CrowdFlower crowdsourcing platform APIs, in order to create and post annotation jobs, and to pull results.

The ranking is composed of verbs discovered via the corpus analysis module.
Each of them will trigger a set of Wikidata properties, depending on the number of FEs (cf. the set above).

Currently, a total of 173 distinct FEs is extracted. The final amount of Wikidata properties will rely on a mapping, planned as per Work Package T8.1.
We have already implemented a straightforward automatic mapping facility, based on string matching.

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far?

Yes.
Compared to the initial plan, the only variance (below 10% of the overall budget) is the NLP developer's starting date, which was expected to be at the beginning of the project (11th January 2016), but was actually 1st February 2016.
Consequently, the expense difference should move to the project leader budget item.
The Finances page is updated accordingly: items are converted in USD and rounded to fit the total budget.

Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

The dissemination budget item may be lower than the planned one, due to the relatively low cost of the scheduled events.
We expect that the variance will be neglectable: we will eventually adjust the item and feed the training set creation or the project leader ones, as needed.

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

Presentations and networking at the scheduled events are crucial to engage data donors from third-party Open Data organizations;

during the development of StrepHit, the team is brought to contribute to external software via standard social coding practices. These are tremendous opportunities that may have a great impact: for instance, we have submitted a pull request to the popular Python documentation generator Sphinx, in order to support the Mediawiki syntax.

We are definitely considering a renewal of this IEG to extend StrepHit capabilities towards widespread languages other than English (i.e., the current implementation).
The project leader has both linguistic and NLP skills to foresee the implementation for Spanish, French, and Italian.