our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages

it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences

our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)

The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.

New Classifiers

The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.

Github

NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.

Sphinx

Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.

New Corpora

Since the 0.9.9 release, a number of new corpora and corpus readers have been added:

The Future

I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.

If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:

These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.

Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag

Found

Actual

Precision

Recall

#

46

47

1

1

$

2122

2134

1

0.6

‘

1811

1809

1

1

(

0

351

None

0

)

0

358

None

0

,

13160

13160

1

1

-LRB-

351

0

0

None

-NONE-

59

0

0

None

-RRB-

358

0

0

None

.

10800

10802

1

1

:

1288

1285

0.7143

1

CC

6589

6586

0.6875

0.7333

CD

10325

10233

0.972

0.9919

DT

22301

22355

0.7826

1

EX

229

254

1

1

FW

1

42

1

0.0455

IN

27798

27835

0.7315

0.7899

JJ

15370

16049

0.7372

0.7303

JJR

1114

1055

0.5412

0.575

JJS

611

451

0.6912

0.7966

LS

13

0

0

None

MD

2616

2637

0.7143

0.75

NN

38023

36789

0.7345

0.8441

NNP

24967

24690

0.8752

0.9421

NNPS

589

550

0.4553

0.3684

NNS

17068

16653

0.8572

0.9527

PDT

24

65

0.6667

1

POS

2224

2203

0.6667

1

PRP

4620

4634

0.8438

0.7941

PRP$

2292

2302

0.6364

1

RB

7681

7961

0.8076

0.8582

RBR

288

392

0.5

0.3684

RBS

90

240

0.5

0.1667

RP

634

95

0.1176

1

SYM

0

6

None

0

TO

6257

6259

1

0.75

UH

2

17

1

0.1111

VB

6681

7286

0.9042

0.8313

VBD

8501

8424

0.7521

0.8605

VBG

3730

4000

0.8493

0.8603

VBN

5763

5867

0.8164

0.8721

VBP

3232

3407

0.6754

0.6638

VBZ

5224

5561

0.7273

0.6906

WDT

1156

1157

0.6

0.5

WP

637

639

1

1

WP$

38

39

1

1

WRB

566

571

0.9

0.75

“

1855

1854

0.6667

1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

Share this:

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

Tag

Found

Actual

Precision

Recall

#

16

16

1

1

$

724

724

1

1

‘

694

694

1

1

,

4887

4886

1

1

-LRB-

120

120

1

1

-NONE-

6591

6592

1

1

-RRB-

126

126

1

1

.

3874

3874

1

1

:

563

563

1

1

CC

2271

2265

1

1

CD

3547

3546

0.999

0.999

DT

8170

8165

1

1

EX

88

88

1

1

FW

4

4

1

1

IN

9880

9857

0.9913

0.958

JJ

5803

5834

0.9913

0.9789

JJR

386

381

1

0.9149

JJS

185

182

0.9667

1

LS

12

13

1

0.8571

MD

927

927

1

1

NN

13166

13166

0.9917

0.9879

NNP

9427

9410

0.9948

0.994

NNPS

246

244

0.9903

0.9533

NNS

6055

6047

0.9952

0.9972

PDT

21

27

1

0.6667

POS

824

824

1

1

PRP

1716

1716

1

1

PRP$

766

766

1

1

RB

2800

2822

0.9931

0.975

RBR

130

136

1

0.875

RBS

33

35

1

0.5

RP

213

216

1

1

SYM

1

1

1

1

TO

2180

2179

1

1

UH

3

3

1

1

VB

2562

2554

0.9914

1

VBD

3035

3043

0.9902

0.9807

VBG

1458

1460

0.9965

0.9982

VBN

2145

2134

0.9885

0.9957

VBP

1318

1321

0.9931

0.9828

VBZ

2124

2125

0.9937

0.9906

WDT

440

445

1

0.8333

WP

241

241

1

1

WP$

14

14

1

1

WRB

178

178

1

1

“

712

712

1

1

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.

Share this:

If you liked the NLTK demos, then you’ll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you’d like to do more, please fill out this survey to let me know what your needs are.

If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.

NLTK Training Sets

For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.

The ClassifierBasedPOSTagger is not necessarily more accurate than the bcraubt tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.

Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.

I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:

The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.

This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.

Here’s the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.

File Size

There’s also a significant difference in the file size of the pickled taggers (trained on treebank):

Tagger

Size

raubt

272K

braubt

273K

cpos

3.8M

bcpos

3.8M

postag

8.2M

Fin

I think there’s a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don’t even bother. In that case, stick with a simpler tagger that’s nearly as accurate and orders of magnitude faster.