Optimising the speed and accuracy of a Statistical GLR Parser

This technical report is based on a dissertation submitted September
2007 by the author for the degree of Doctor of Philosophy to the
University of Cambridge, Darwin College.

Abstract

The focus of this thesis is to develop techniques that optimise both the
speed and accuracy of a unification-based statistical GLR parser.
However, we can apply these methods within a broad range of parsing
frameworks. We first aim to optimise the level of tag ambiguity resolved
during parsing, given that we employ a front-end PoS tagger. This work
provides the first broad comparison of tag models as we consider both
tagging and parsing performance. A dynamic model achieves the best
accuracy and provides a means to overcome the trade-off between tag
error rates in single tag per word input and the increase in parse
ambiguity over multipletag per word input. The second line of research
describes a novel modification to the inside-outside algorithm, whereby
multiple inside and outside probabilities are assigned for elements
within the packed parse forest data structure. This algorithm enables us
to compute a set of ‘weighted GRs’ directly from this structure. Our
experiments demonstrate substantial increases in parser accuracy and
throughput for weighted GR output.

Finally, we describe a novel confidence-based training framework, that
can, in principle, be applied to any statistical parser whose output is
defined in terms of its consistency with a given level and type of
annotation. We demonstrate that a semisupervised variant of this
framework outperforms both Expectation-Maximisation (when both are
constrained by unlabelled partial-bracketing) and the extant (fully
supervised) method. These novel training methods utilise data
automatically extracted from existing corpora. Consequently, they
require no manual effort on behalf of the grammar writer, facilitating
grammar development.