2016年8月26日 星期五

[ Java 套件 ] CoreNLP - Simple CoreNLP API

Source From HereSimple CoreNLP
In addition to the fully-featured annotator pipeline interface to CoreNLP, Stanford provides a simple API for users who do not need a lot of customization. The intended audience of this package is users of CoreNLP who want “import nlp” to work as fast and easily as possible, and do not care about the details of the behaviors of the algorithms. An example usage is given below:

The API is included in the CoreNLP release from 3.6.0 onwards. Visit the download page to download CoreNLP; make sure to include both the code jar and the models jar in your classpath!

Advantages and Disadvantages
This interface offers a number of advantages (and a few disadvantages – see below) over the default annotator pipeline:

* Intuitive Syntax Conceptually, documents and sentences are stored as objects, and have functions corresponding to annotations you would like to retrieve from them.* Lazy Computation Annotations are run as needed only when requested. This allows you to “change your mind” later in a program and request new annotations.* No NullPointerExceptions Lazy computation allows us to ensure that no function will ever return null. Items which may not exist are wrapped inside of an Optional to clearly mark that they may be empty.* Fast, Robust Serialization All objects are backed by protocol buffers, meaning that serialization and deserialization is both very easy and very fast. In addition to being easily readable from other languages, our experiments show this to be over an order of magnitude faster than the default Java serialization.* Maintains Thread Safety Like the CoreNLP pipeline, this wrapper is threadsafe.

In exchange for these advantages, users should be aware of a few disadvantages:

* Less Customizability Although the ability to pass properties to annotators is supported, it is significantly more clunky than the annotation pipeline interface, and is generally discouraged.* Possible Nondeterminism There is no guarantee that the same algorithm will be used to compute the requested function on each invocation. For example, if a dependency parse is requested, followed by a constituency parse, we will compute the dependency parse with the Neural Dependency Parser, and then use the Stanford Parser for the constituency parse. If, however, you request the constituency parse before the dependency parse, we will use the Stanford Parser for both.

Usage
There are two main classes in the interface: Document and Sentence. Tokens are represented as array elements in a sentence; e.g., to get the lemma of a token, get the lemmas array from the sentence and index it at the appropriate index. A constructor is provided for both the Document and Sentence class. For the former, the text is treated as an entire document containing potentially multiple sentences. For the latter, the text is forced to be interpreted as a single sentence. An example program using the interface is given below:

Supported Annotators
The interface is not guaranteed to support all of the annotators in the CoreNLP pipeline. However, most common annotators are supported. A list of these, and their invocation, is given below. Functionality is the plain-english description of the task to be performed. The second column lists the analogous CoreNLP annotator for that task. The implementing class and function describe the class and function used in this wrapper to perform the same tasks.

Patches for incorporating additional annotators are of course always welcome!

Miscellaneous Extras
Some potentially useful utility functions are implemented in the SentenceAlgorithms class. These can be called from a Sentence object with, e.g.:

* headOfSpan(Span)Finds the index of the head word of the given span. So, for example, United States president Barack Obama would return Obama.* dependencyPathBetween(int, int) Returns the dependency path between the words at the given two indices. This is returned as a list of String objects, meant primarily as an input to a featurizer.

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence "John sees Bill", a constituency parse would be:

A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of "John sees Bill", would be: