Stemming early English

Links to resources

The question occasionally arises of how far the English (or earlier Porter)
stemming algorithm can be adapted to handle older forms of the English
language.

Historically, English is usually divided into three periods of development,

1) Old English (or Anglo-Saxon), the language of Beowulf,
2) Middle English, the language of Chaucer,
3) Modern English, the language of Shakespeare, Dickens, and people today.

Old English is so different from Modern English that it may be regarded as a
distinct language.

Middle English is problematical for a number of reasons. There is no standard
spelling in the original texts, and the grammatical differences between Middle
and Modern English prevent the spelling from being simply ‘modernised’. It is
however possible to normalise the spelling according to some modern scheme, but
again there is no standard modern scheme. Middle
English itself had great regional variations, so that for example the
English of Chaucer and his contemporary the Gawain poet (both late 14th century)
are strikingly different. Finally, grammar was fluid even for one writer, so
Chaucer might use they love or they loven, he
sitteth or he sit.

We may take Modern English to mean English which can be cast into a modern
spelling form without too much damage being done to the original. From this
point of view Shakespeare and the Authorised Version of the Bible are in Modern
English. The ending structure of words in early Modern English differ from
contemporary English in the est and eth endings of verbs in the present
indicative,

I bring
thou bringest
he bringeth
we bring
you bring
they bring

Both of these endings underwent rapid decline. The eth form occurs in
Shakespeare, but is much rarer than the modern s form. The language of the
Authorised Version,
in which both forms abound,
seemed archaic even on its first publication. Consequently
the eth form survives now only in the language of the traditional Bible and
Book of Common Prayer. The est form disappeared more slowly, as the use of
thou became displaced by you in conversation.

As far as the Snowball scripts are concerned, the endings 'est' 'eth' must
be added against ending 'ing'.

The inclusion of these endings does produce certain ‘side effects’. est is
the ending of adjectival superlatives (greatest, unkindest), where it
will also be removed. Words like brandreth, deforest will be mis-stemmed.
Nevertheless, for the vocabulary of the Bible, the inclusion of these extra
endings is not harmful (see
this demonstration —
for example, search for the text love in 1000 verses).