Problem with Brill Tagger - Using NLTK library

Having a small user community, for the library I'm encountering difficulties with, I feel I need to open up this problem to the wider communities in the hope that someone is able to offer some advice.

Basically, I'm using a library called nltk, to perform some Natural Language Processing on some text files. Before giving details about the specifics of the problem, for anyone who wishes to help out, you will need to do the following steps first:

1. Go to: http://nltk.sourceforge.net/install.html and install the appropriate version of the library.

2. Check out: http://nltk.sourceforge.net/api-1.4/index.html for documentation on all classes within this library.

I will now go on to explain the problem before supplying the code which you can copy and dump into a python file and then just run the .py file and see the error message for yourself!!!

The problem I'm having difficulty with is making use of the Brill Tagger supplied with the nltk library. I seem to have run into trouble invoking the tag method of the 'BrillTagger' class.

I've managed to train the brill tagger on the 'treebank' corpus, but when I come to invoke the 'tag' method, I receive the error 'KeyError: SUBTOKENS'. I can't seem to find the reason for it throwing up this error, even though I understand what the error is referring to. The error is basically indicating that the method: 'tag(self, token)' requires a token instance in order to assign pos tags and for some reason its not liking the variable I'm passing!!

Below is the code I'm using which you can just copy and run directly to see the error message I receive.

items = treebank.items('tagged')
for item in items[:100]:
item = treebank.read(item)
for sent in item['SENTS']:
train_tokens += sent['WORDS']
train_tokens = [train_tokens[i] for i in range(len(train_tokens))
if train_tokens[i]['TEXT'][0] not in "[]="]

As regards the contents/output of the corpusTS token, it produces a set of tokens, all encapsulated within a token i.e. a token with many subtokens, as below:

<[<some>, <text>, <to>, <be>, <assigned>, ... , <text>]>

The outer angled brackets being the the token with each inner set of angled brackets being a subtoken. Each subtoken at this point consists of one attribute, 'TEXT'. The idea of the above supplied code is to add another attribute 'POS' (part-of-speech) to each subtoken!!!

Hope you can try running it for yourselves now!!! I'm just going to run a few more tests on the corpusTStoken to see if I can gain a better understanding in relation to your question about the structure/format of this variable.

Will be obliged for any advice anyone is able to suggest from running the above code.

Although can't manually construct a token in its raw form i.e. <'value'> , I've assigned corpusTStoken to: Token(TEXT='this') and also separately Token(Token(TEXT='this')) but receive the same error i.e. KeyError: 'SUBTOKENS'.

From my limited understanding, the documentation states that the tag method requires a token which itself contains a set of tokens (sub-tokens of the token), to be passed as its argument. As far as I'm aware, that's exactly what I'm passing, but in spite the error still persists!!!

By removing the property argument SUBTOKENS='WORDS' out of the constructor for the WhitespaceTokenizer, the tokenize method, did proceed to create a 'SUBTOKENS' property for the corpusTStoken. I call this progress!!!!

Anyhow, I'm now receiving another error which I'm not entirely sure what it is about. The traceback which you'll get if you invoke the above code is:

Well the error simply says that the key 'TAG' was expected but not found. This is much the same as the previous error. It suggests that while you have made progress, what is being passed to tag is still not yet structured as tag expects.

I suggest a review of the BrillTagger is required - it might help you to better understand what you need to provide. I am afraid that I don't need to know NL processing.