In a previous article, we studied training a NER (Named-Entity-Recognition) system from the ground up, using the Groningen Meaning Bank Corpus. This article is a continuation of that tutorial. The main purpose of this extension to training a NER is to:

Replace the classifier with a Scikit-Learn Classifier

Train a NER on a larger subset of the training data

Increase accuracy

Understand Out Of Core Learning

What was wrong with the initial system you might ask. There wasn’t anything fundamentally wrong with the process. In fact, it’s a great didactical example, and we can build upon it. This is where it was lacking:
1. If you did the training yourself, you probably realized we can’t train the system on the whole dataset (I chose to train it on the first 2000 sentences).
2. The dataset is so huge – it can’t be loaded all in memory.
3. We achieved around 93% accuracy. That might sound like a good accuracy, but we might be deceived. Named entities are probably around 10% of the tags. If we predict that all words have O tag (remember, O stands for outside any entity), we’re achieving a 90% accuracy. We can probably do better.
4. We can come up with a better feature set that better describes the data and is more relevant to our task.

Out-Of-Core Learning

We are used to showing all the data we have at once to our classifier. This means that we have to keep all the data in memory. This can get in our way if we want to train on a larger dataset. Keeping the dataset out of RAM is called Out-Of-Core Learning.

There are certain types of classifiers that accept the data to be presented in batches. Scikit-Learn includes a few such classifiers. Here’s the list: Scikit-Learn Incremental Classifiers. The process of learning from batches is called Incremental Learning.

The classifiers that support Incremental Learning implement the partial_fit method.

Using generators

In the previous tutorial, we created a method of reading from the corpus that didn’t keep the whole dataset in memory. It was making use of the concept of Generator.

Unfortunately, because we had to present the whole data, we were transforming the generator into a list, thus losing the advantage of working with generators. Since we don’t need all the data this time, we’ll be slicing batches from the generator every time we call the partial_fit method. Let’s include the corpus reading routine, from the previous article here:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

import os

from nltk import conlltags2tree

def to_conll_iob(annotated_sentence):

"""

`annotated_sentence` = list of triplets [(w1, t1, iob1), ...]

Transform a pseudo-IOB notation: O, PERSON, PERSON, O, O, LOCATION, O

to proper IOB notation: O, B-PERSON, I-PERSON, O, O, B-LOCATION, O

"""

proper_iob_tokens=[]

foridx,annotated_token inenumerate(annotated_sentence):

tag,word,ner=annotated_token

ifner!='O':

ifidx==0:

ner="B-"+ner

elif annotated_sentence[idx-1][2]==ner:

ner="I-"+ner

else:

ner="B-"+ner

proper_iob_tokens.append((tag,word,ner))

returnproper_iob_tokens

def read_gmb_ner(corpus_root):

forroot,dirs,files inos.walk(corpus_root):

forfilename infiles:

iffilename.endswith(".tags"):

with open(os.path.join(root,filename),'rb')asfile_handle:

file_content=file_handle.read().decode('utf-8').strip()

annotated_sentences=file_content.split('\n\n')

forannotated_sentence inannotated_sentences:

annotated_tokens=[seq forseq inannotated_sentence.split('\n')ifseq]

standard_form_tokens=[]

foridx,annotated_token inenumerate(annotated_tokens):

annotations=annotated_token.split('\t')

word,tag,ner=annotations[0],annotations[1],annotations[3]

ifner!='O':

ner=ner.split('-')[0]

standard_form_tokens.append((word,tag,ner))

conll_tokens=to_conll_iob(standard_form_tokens)

yield conlltags2tree(conll_tokens)

Better features

The feature detector created in the previous article wasn’t at all bad. In fact, it includes the most popular features and they have been adapted to achieve better performance. We’re going to make a few adjustments. One of the most important features in the task of Named-Entity-Recognition is the shape of the word. We’re going to create a function that describes particular word forms. You should experiment with this function and see if you get better results. Here’s my function:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

import re

def shape(word):

word_shape='other'

ifre.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$',word):

word_shape='number'

elif re.match('\W+$',word):

word_shape='punct'

elif re.match('[A-Z][a-z]+$',word):

word_shape='capitalized'

elif re.match('[A-Z]+$',word):

word_shape='uppercase'

elif re.match('[a-z]+$',word):

word_shape='lowercase'

elif re.match('[A-Z][a-z]+[A-Z][a-z]+[A-Za-z]*$',word):

word_shape='camelcase'

elif re.match('[A-Za-z]+$',word):

word_shape='mixedcase'

elif re.match('__.+__$',word):

word_shape='wildcard'

elif re.match('[A-Za-z0-9]+\.$',word):

word_shape='ending-dot'

elif re.match('[A-Za-z0-9]+\.[A-Za-z0-9\.]+\.$',word):

word_shape='abbreviation'

elif re.match('[A-Za-z0-9]+\-[A-Za-z0-9\-]+.*$',word):

word_shape='contains-hyphen'

returnword_shape

Here’s the final feature extraction function (I also added one more IOB tag from history):

Learning in batches

After getting the corpus reading and the feature extraction out of the way, we can focus on the cool stuff: training the NE-chunker. The code is fairly simple, but let’s first state what we want to achieve:

The training method should receive a generator. It should only slice batches from the generator, not load the whole data into memory.

We’re going to train a Perceptron. It trains fast and gives good results in this case.

Keep in mind that we will use the partial_fit method.

Because we don’t show all the data at once, we have to give a list of all the classes up front.

Let’s build out NE-chunker:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

import itertools

from nltk import tree2conlltags

from nltk.chunk import ChunkParserI

from sklearn.linear_model import Perceptron

from sklearn.feature_extraction import DictVectorizer

from sklearn.pipeline import Pipeline

classScikitLearnChunker(ChunkParserI):

@classmethod

def to_dataset(cls,parsed_sentences,feature_detector):

"""

Transform a list of tagged sentences into a scikit-learn compatible POS dataset

Related

26 Comments

Hello,
Thanks for this nice tutorial. I have been trying out some examples with your code. I run into problems when running the train methos. How do you call the method to train the corpus und test it agianst new data?

In your previous tutorial you call it like that:
print chunker.parse(pos_tag(word_tokenize(“I’m going to Germany this Monday.”)))

This is not working any more with sklearn. I keep getting the error:
sklearn.exceptions.NotFittedError: This Perceptron instance is not fitted yet

Nice tutorial I enjoyed it! Only question I have is with regards to the dictionary vectorizer. This is only being fitted on the first iteration so your vector space is surely defined by the first batch of data. What if you have words, features, word shapes etc that haven’t been seen in this first iteration? Looking at the sklearn docs it seems their feature values will always be 0? Therefore I’m struggling to see how this can ever outperform the in-core approach to fitting the classifier. Do you know of a way to account for this?

Indeed that’s the case. The vectorizer, as well as the classifier, are fitted only at training phase. Here’s a scenario:

– We fit the vectorizer and classifier on some labelled data
– We get new unlabelled data. We “re-fit” the vectorizer with the new data. What should the classifier do with the new “words” provided by the vectorizer since it doesn’t have labels for the new samples?

Thanks for the response! But I mean even during training the vectorizer is only fitted on the first batch so any new words, parts of speech, stems etc in the subsequent partial-fits will all be given a value of 0. Whereas with the in-core approach they will not as they will all have gone through the fitting and transforming of the vectorizer.

So i tried the same code with python 2 and python3. When run on python2 it does not produce any error however when run on python3 it produces the error. I think the cause of the error is the difference in behavior of zip() between python3 and python2.

Indeed, I’ve noticed that myself and I started inspecting the corpus. You’ll notice that very few “artefacts” are tagged and even the annotated ones are very noisy. That’s the reason why the NER doesn’t pick up ART tags. Sorry to deliver the bad news 😛

I haven’t used next-iob and next-next-iob simply because when you tag an unknown sentence (no from the corpus), there’s no way of knowing what those are.
You only know what has already been tagged (prev-iob, prev-prev-iob).