From general-return-27109-apmail-incubator-general-archive=incubator.apache.org@incubator.apache.org Fri Nov 19 09:48:42 2010
Return-Path:
Delivered-To: apmail-incubator-general-archive@www.apache.org
Received: (qmail 40655 invoked from network); 19 Nov 2010 09:48:39 -0000
Received: from unknown (HELO mail.apache.org) (140.211.11.3)
by 140.211.11.9 with SMTP; 19 Nov 2010 09:48:39 -0000
Received: (qmail 60007 invoked by uid 500); 19 Nov 2010 09:49:10 -0000
Delivered-To: apmail-incubator-general-archive@incubator.apache.org
Received: (qmail 59658 invoked by uid 500); 19 Nov 2010 09:49:10 -0000
Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: general@incubator.apache.org
Delivered-To: mailing list general@incubator.apache.org
Received: (qmail 59650 invoked by uid 99); 19 Nov 2010 09:49:09 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 09:49:09 +0000
X-ASF-Spam-Status: No, hits=2.2 required=10.0
tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 209.85.215.175 as permitted sender)
Received: from [209.85.215.175] (HELO mail-ey0-f175.google.com) (209.85.215.175)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 09:49:03 +0000
Received: by eya28 with SMTP id 28so2533586eya.6
for ; Fri, 19 Nov 2010 01:48:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:received:received:message-id:date:from
:user-agent:mime-version:to:subject:content-type;
bh=LKMaiGBDK6+gpPz14gx9PYVD6D2FMWR1+0qbX22qTTQ=;
b=IZT7+Swi4e3SBZrYpYOVwrfa3g+fUWcRi1GZ8RuJsFmRrTnaP09WTcnTxX9AA1p8vk
oyIxjci1Ftb60oUsjXrn7SdNgoIQUPQVYM5YEWGp3F+l1eV9ME6tWs/PEYG3cJek+W3f
nJkaZGNK1IBt5Auj75GIdPJkBghTeCCiYJSvU=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=message-id:date:from:user-agent:mime-version:to:subject
:content-type;
b=mBM8fu+PFJGVbehFXgfBxC+VbJhnDq+yTqzmmpi2IV9sjvPjBouOLYub48XHroA+GD
K/c06kUqLCIrMayyxo8f0ZXEUBSDC8ffXDE5IbVLmkCK+cC1e0e5B7X70Mouo22yvuWp
0Nv3Ac8oudijwa5ACXpPnX3n+rtiR8tNkG7Rc=
Received: by 10.14.47.78 with SMTP id s54mr1244593eeb.21.1290160121721;
Fri, 19 Nov 2010 01:48:41 -0800 (PST)
Received: from karkand.infopaq.net (dkcphfw01.infopaq.dk [213.150.59.2])
by mx.google.com with ESMTPS id q58sm1298110eeh.3.2010.11.19.01.48.40
(version=SSLv3 cipher=RC4-MD5);
Fri, 19 Nov 2010 01:48:40 -0800 (PST)
Message-ID: <4CE647F7.7000603@gmail.com>
Date: Fri, 19 Nov 2010 10:48:39 +0100
From: =?ISO-8859-1?Q?J=F6rn_Kottmann?=
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6
MIME-Version: 1.0
To: general@incubator.apache.org
Subject: [VOTE] Accept OpenNLP for incubation
Content-Type: multipart/alternative;
boundary="------------070303050202090803040109"
--------------070303050202090803040109
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable
Hi,
lets vote on the acceptance of the OpenNLP Project for incubation
at the Apache Incubator.
The proposal is on the wiki
http://wiki.apache.org/incubator/OpenNLPProposal
and a copy is included below.
The discussion thread can be found here:
http://mail-archives.apache.org/mod_mbox/incubator-general/201011.mbox/%3=
C4CE4F1F4.3010909@gmail.com%3E
Please cast your votes:
[ ] +1 Accept OpenNLP for incubation
[ ] +0 Don't care
[ ] -1 Reject for the following reason:
The vote is open for at least 72 hours.
Thanks!
J=F6rn
=3D OpenNLP Proposal =3D
The following is a proposal for a new top-level project within the ASF.
=3D=3D Abstract =3D=3D
OpenNLP is a Java machine learning toolkit for natural language processin=
g (NLP).
=3D=3D Proposal =3D=3D
OpenNLP is a machine learning based toolkit for the processing of natural=
language text. It supports the most common NLP tasks, such as tokenizat=
ion, sentence segmentation, part-of-speech tagging, named entity extracti=
on, chunking, parsing, and coreference resolution. These tasks are usual=
ly required to build more advanced text processing services.
The goal of the OpenNLP project will be to create a mature toolkit for th=
e abovementioned tasks. An additional goal is to provide a large number =
of pre-built models for a variety of languages, as well as the annotated =
text resources that those models are derived from.
=3D=3D Background =3D=3D
OpenNLP was started in 2000 by Jason Baldridge and Gann Bierner while the=
y were graduate students in the Division of Informatics at the University=
of Edinburgh. OpenNLP, broadly speaking, was meant to be a high-level or=
ganizational unit for various open source software packages for natural l=
anguage processing; more practically, it provided a high-level package na=
me for various Java packages of the form opennlp.*. The first OpenNLP sof=
tware package was the Grok natural language parsing toolkit, which was al=
so the genesis of what is now called the OpenNLP Toolkit. The software re=
leased on the OpenNLP sourceforge site (started in 2000, along with Grok)=
was simply a set of interfaces defined in the package opennlp.common and=
referred to as the OpenNLP Java API. The actual implementations of natur=
al language processing components were provided in Grok, along with code =
for sentence parsing with Combinatory Categorial Grammar. This code was u=
sed heavily in both Baldridge's and Biern
er's dissertations. The first paper that used Grok, and especially the co=
mponents that would become the OpenNLP Toolkit is [[http://comp.ling.utex=
as.edu/jbaldrid/papers/hockenmaier_etal_ESSLLI2000.pdf|Hockenmaier, Biern=
er and Baldridge (2000)]] (later updated as the journal article [[http://=
comp.ling.utexas.edu/jbaldrid/papers/HockenmaierEtal2004.pdf|Hockenmaier,=
Bierner, and Baldridge (2004)]]).
In 2003, it was decided to remove the NLP infrastructure from Grok as the=
re was a clear separation between the basic text processing components an=
d the syntactic and semantic analysis components. At the same time, Grok =
was rebranded as OpenCCG (openccg.sf.net). The final release of the OpenN=
LP Java API was made in March 2003; the new OpenNLP Toolkit was created f=
rom the API and the Grok text processing components, with version 1.0 bei=
ng released in April 2004. The OpenNLP Toolkit and OpenCCG have evolved i=
ndependently since then and have mostly independent and active developer =
and user communities. OpenCCG is primarily used in the academic community=
, while OpenNLP has considerable use in both academia and industry. As in=
indication of the academic impact of OpenNLP, a search on Google scholar=
(done in March 2010) returned about 650 publications citing the package.=
Some of these include the OpenNLP website and a few non-publications plu=
s some self-citations. Based on a scan of
these results, we estimate that about 500 actual publications have used=
OpenNLP in their work, and there are an addition 50 or so quasi-publicat=
ions like surveys and instruction manuals.
The activity level of the OpenNLP project has fluctuated over that past 1=
0+ years, with a large uptick in the last two years especially. Most rece=
ntly, due both to the availability of new documentation and the release o=
f version 1.5 , there have been many more downloads and page views for th=
e OpenNLP project. In fact, September 2010 had the most downloads (1,561)=
and project web hits (226,391) of any month since the project's beginnin=
g in 2000, and October is keeping pacing with that figure so far. As a re=
sult, OpenNLP has gone from being in the 2000th to 4000th ranked project =
(between January and May, 2010) to being ranked 570, 314, 181 and 439 for=
July, August, September, and October respectively. Full details are avai=
lable on the Sourceforge statistics page for OpenNLP. (There are 240,000=
projects hosted on SourceForge, though this figure includes many, many p=
rojects that never actually get started: it seems that about 7-10% of the=
se are stable, active projects base
d on a review done in 2007.)
=3D=3D Rationale =3D=3D
OpenNLP fills a significant gap at the ASF in regards to human language p=
rocessing tools. While Lucene/Solr, UIMA and Mahout all have some tools =
in this area, none of them are solely focused on tools specifically for w=
orking with natural language like OpenNLP.
=3D=3D Initial Goals =3D=3D
The initial goals of the proposed project are:
* Bring the community together at the ASF and make the development proc=
ess transparent for them
* Write user documentation about all major components
* Automated build including train and evaluate regression tests
* Produce an Incubating release
=3D=3D Current Status =3D=3D
=3D=3D=3D Meritocracy =3D=3D=3D
Some of the initial committers are familiar with Apache's idea of meritoc=
racy, others aren't. We will get everybody on the same level as part of =
the incubation process.
=3D=3D=3D Community =3D=3D=3D
OpenNLP already has a considerable user base, both in industry and academ=
ia.
=3D=3D=3D Core Developers =3D=3D=3D
See the initial committer list.
=3D=3D=3D Alignment =3D=3D=3D
OpenNLP has tie-ins with several existing Apache projects. We have been =
distributing wrappers for UIMA for some time now (two UIMA committers als=
o contribute to OpenNLP). We expect this collaboration to strengthen fur=
ther after our move to Apache.
Another obvious connection exists to some of the projects under the Lucen=
e umbrella. On the one hand, projects like Solr may benefit from the Ope=
nNLP analysis capabilities to create specialized search for particular do=
mains. On the other, OpenNLP may benefit from the machine learning code =
that is being developed in Mahout, and maybe get some people from that co=
mmunity to lend a hand.
=3D=3D Known Risks =3D=3D
=3D=3D=3D Orphaned products =3D=3D=3D
The project has been around for quite a number of years already, it has a=
well-established user community and a diverse set of committers.
=3D=3D=3D Inexperience with Open Source =3D=3D=3D
OpenNLP has been an open source project for quite some time. Many of the=
developers are already familiar with both open source in general and the=
ASF in particular.
=3D=3D=3D Homogenous Developers =3D=3D=3D
The current group of developers is very diverse, no two developers work f=
or the same organization.
=3D=3D=3D Reliance on Salaried Developers =3D=3D=3D
Most of the developers are not paid to work on OpenNLP, so there is littl=
e reliance on salaried developers.
=3D=3D=3D Relationships with Other Apache Products =3D=3D=3D
NLP is often used in search and other algorithms that work with unstructu=
red data, thus OpenNLP is likely to be useful to the Lucene and Solr comm=
unities. It also aligns nicely with both Mahout and UIMA.
=3D=3D=3D A Excessive Fascination with the Apache Brand =3D=3D=3D
We think the project aligns nicely with the goals of the ASF to dissemina=
te source code to the public free of charge. NLP has long been the subje=
ct of cutting edge research, but is often lacking in community and shared=
knowledge. We believe that by bringing OpenNLP to the ASF, the Apache b=
rand will help deliver NLP capabilities to a much larger audience and lik=
ewise a cutting edge project like OpenNLP can further the ASF brand by pr=
oviding users with tried and true, as well as new, natural language proce=
ssing capabilities.
=3D=3D Documentation =3D=3D
*http://opennlp.sourceforge.net/README.html
*http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=3DMain_P=
age
=3D=3D Initial Source =3D=3D
The source code is maintained in two CVS repositories on SourceForge.
OpenNLP Maxent:http://maxent.cvs.sourceforge.net/viewvc/maxent/
OpenNLP Tools and OpenNLP UIMA:http://opennlp.cvs.sourceforge.net/viewvc/=
opennlp/
=3D=3D Source and Intellectual Property Submission Plan =3D=3D
The OpenNLP source code is already open source under the AL 2.0.
=3D=3D External Dependencies =3D=3D
||'''Library''' ||||