Description

Many tasks in NLP can be computationally intensive, and there is no "one fits
all" NLP approach when analysing text. Therefore, we wanted to have a NLP
infrastructure that can be configured and wired together as needed for the
specific use case, with several specialised modules that can build upon each
other but many of which are optional.

2. provide a unified data model for representing NLP text annotations

In many szenarios, it will be necessary to implement custom engines building on
the results of a previous "generic" analysis of the text (e.g. POS tagging and
chunking). For example, in a project we are identifying so-called "noun
phrases", use a lemmatizer to build the ground form, then convert this to
singular nominative form to have a gramatically correct label to use in a tag
cloud. Most of this builds on generic NLP functionality, but the last step is
very specific to the use case.

Therefore, we wanted also to implement a generic NLP data model that allows
representing text annotations attached to individual words or also to spans of
words.

Rupert Westenthaler
added a comment - 14/Dec/12 07:14 The remaining sub-tasks where converted into own issues (all of type "new feature"). All core functionalities are resolved. Therefore this can be resolved as well.
Documentation is available at
http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/

The patch provided by Sebastian Schaffert was applied with revision 1387488 [1]. I also added the data files used by the contributed Engines to

{stanbol}

/data. The German noun phrase chunker was added to the o.a.s.data.opennlp.lang.de module. For the sentiment related data files new modules and a sentiment bundlelist was created. I also added a special Laucher (nlp-launcher) intended to be used for testing developments in the nlp-processing branch.

In a second commit [2] I slightly changed the default configuration of the Engines so that they can use ConfigurationPolicy.OPTIONAL - meaning that an instance of those Engines is active by default. Also a "nlp-processing" chain configuration was added to the default launcher.

The nlp-processing branch is now in a state that early adopters might start to test it. I will continue to work on the adaption of the CELI Lemmatizer Engine (STANBOL-739) and the usage of the nlp-processing results by the KeywordLinkingEngine (STANBOL-740)

Rupert Westenthaler
added a comment - 20/Sep/12 04:43 Status update:
The patch provided by Sebastian Schaffert was applied with revision 1387488 [1] . I also added the data files used by the contributed Engines to
{stanbol}
/data. The German noun phrase chunker was added to the o.a.s.data.opennlp.lang.de module. For the sentiment related data files new modules and a sentiment bundlelist was created. I also added a special Laucher (nlp-launcher) intended to be used for testing developments in the nlp-processing branch.
In a second commit [2] I slightly changed the default configuration of the Engines so that they can use ConfigurationPolicy.OPTIONAL - meaning that an instance of those Engines is active by default. Also a "nlp-processing" chain configuration was added to the default launcher.
The nlp-processing branch is now in a state that early adopters might start to test it. I will continue to work on the adaption of the CELI Lemmatizer Engine ( STANBOL-739 ) and the usage of the nlp-processing results by the KeywordLinkingEngine ( STANBOL-740 )
[1] http://svn.apache.org/viewvc?rev=1387488&view=rev
[2] http://svn.apache.org/viewvc?rev=1387596&view=rev

Sebastian Schaffert
added a comment - 17/Sep/12 14:37 A patch containing NLP enhancement engines for Apache Stanbol addressing the goals mentioned in the issue. This excludes all data files, they can be found at https://www.dropbox.com/home/Public/stanbol