This notebook follows section 6.1 of the NLTK Book on Supervised Classification, specifically the examples on part-of-speech tagging (pp. 229-230 in the print edition). The example in the NLTK Book uses the tagged Brown corpora. In order to get a similar collection, I retrieve token and POS information from G. Celano's Lemmatized Ancient Greek XML repository, specifically the 'Unique Values' dataset, a collection of all of the unique tokens in the Morpheus and PerseusUnderPhilologic databases. The requests_html package is used to parse the XML and return a list of tuples of this form: [(token1, pos1), (token2, pos2), ...].

With this tagged collection, we get the top 100 ending of either 1, 2, or 3 letters. We then use this featureset to train the DecisionTreeClassifier. (Not the fastest process btw!) The results are honestly not impressive, consistently around 59-60% accurate. Adding additional similar features, e.g. 4 and 5 letter terminations, does not improve accuracy.

One nice feature of the DecisionTreeClassifer is that you can generate pseudocode to trace the decision making process. Here is a sample decision tree for this test:

Not much more to add here—I'd say that the results (i.e. 59-60%) strike me as not unlike being in first-year Greek where endings are helpful—and some endings are much, much more helpful than others, e.g. -εσθαι—but not always. The amount of ambiguity in the endings remains a challenge that only becomes easier to overcome with a better handle on working with words (and word endings) in context. And context is where the next tutorial will take us. [PJB 3.10.18]

# Get stats on datasetprint('This dataset consists of {} tokens from the Lemmatized Ancient Greek XML Unique Tokens corpus'.format(len(records)))print('Here is a sample of the dataset: {}'.format(records[:10]))

# Get a list of the op word terminations of 1, 2, and 3 characterssuffix_fdist=nltk.FreqDist()greek_words=[wordforword,_inrecords]forwordingreek_words:word=word.lower()suffix_fdist[word[-1:]]+=1suffix_fdist[word[-2:]]+=1suffix_fdist[word[-3:]]+=1common_suffixes=[suffixfor(suffix,count)insuffix_fdist.most_common(100)]print(common_suffixes)