Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words. Nonprojective dependency grammars may generate languages that are not context-free, offering a formalism that is arguably more adequate for some natural languages. Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing approximate inference can greatly increase performance.

This package contains a C++ implementation of a dependency parser based on the papers [1,2,3,4] below.

This package allows:

learning a parser/tagger from a treebank,

running a parser/tagger on new data,

evaluating the results against a gold-standard.

Changes to previous version:

This version introduces a number of new features:

The parser does not depend anymore on CPLEX (or any other non-free LP solver). Instead, the decoder is now based on AD3, our free library for approximate MAP inference.

The parser now outputs dependency labels along with the backbone structure.

As a bonus, we now provide a trainable part-of-speech tagger, called TurboTagger, which can be used in standalone mode, or to provide part-of-speech tags as input for the parser. TurboTagger has state-of-the-art accuracy for English (97.3% on section 23 of the Penn Treebank) and is fast (~40,000 tokens per second).

The parser is much faster than in previous versions. You may choose among a basic arc-factored parser (~4,300 tokens per second), a standard second-order model with consecutive sibling and grandparent features (the default; ~1,200 tokens per second), and a full model with head bigram and arbitrary sibling features (~900 tokens per second).

Note: The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.
To run this software, you need a standard C++ compiler. This software has the following external dependencies: AD3, a library for approximate MAP inference; Eigen, a template library for linear algebra; google-glog, a library for logging; gflags, a library for commandline flag processing. All these libraries are free software and are provided as tarballs in this package.

This software has been tested on Linux, but it should run in other platforms with minor adaptations.

Other available revisons

The parser does not depend anymore on CPLEX (or any other non-free LP solver). Instead, the decoder is now based on AD3, our free library for approximate MAP inference.

The parser now outputs dependency labels along with the backbone structure.

As a bonus, we now provide a trainable part-of-speech tagger, called TurboTagger, which can be used in standalone mode, or to provide part-of-speech tags as input for the parser. TurboTagger has state-of-the-art accuracy for English (97.3% on section 23 of the Penn Treebank) and is fast (~40,000 tokens per second).

The parser is much faster than in previous versions. You may choose among a basic arc-factored parser (~4,300 tokens per second), a standard second-order model with consecutive sibling and grandparent features (the default; ~1,200 tokens per second), and a full model with head bigram and arbitrary sibling features (~900 tokens per second).

Note: The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.
To run this software, you need a standard C++ compiler. This software has the following external dependencies: AD3, a library for approximate MAP inference; Eigen, a template library for linear algebra; google-glog, a library for logging; gflags, a library for commandline flag processing. All these libraries are free software and are provided as tarballs in this package.

This software has been tested on Linux, but it should run in other platforms with minor adaptations.