Glossary

a-suffix

An a-suffix, or attached suffix, is a particle word attached to another
word. (In the stemming literature they sometimes get referred to as
‘enclitics’.) In Italian, for example, personal pronouns attach to
certain verb forms:

mandargli =

mandare + gli

=

to send + to him

mandarglielo =

mandare + gli + lo

=

to send + it + to him

a-suffixes appear in Italian and Spanish, and also in Portuguese, although
in Portuguese they are separated by hyphen from the preceding word, which
makes them easy to eliminate.

i-suffix

An i-suffix, or inflectional suffix, forms part of the basic grammar of a
language, and is applicable to all words of a certain grammatical type,
with perhaps a small number of exceptions. In English for example, the past
of a verb is formed by adding ed. Certain modifications may be required
in the stem:

fit + ed

->

fitted (double t)

love + ed

->

loved (drop the final e of love)

d-suffix

A d-suffix, or derivational suffix, enables a new word, often with a
different grammatical category, or with a different sense, to be built from
another word. Whether a d-suffix can be attached is discovered not from
the rules of grammar, but by referring to a dictionary. So in English,
ness can be added to certain adjectives to form corresponding nouns
(littleness, kindness, foolishness ...) but not to all adjectives (not for
example, to big, cruel, wise ...) d-suffixes can be used to change
meaning, often in rather exotic ways. So in italian astro means a sham
form of something else:

medico + astro

=

medicastro

=

quack doctor

poeta + astro

=

poetastro

=

poetaster

Indo-European languages

Most European and many Asian languages belong to the Indo-European language
group. Historically, it includes the Latin, Greek, Persian and Sanskrit of
the ancient world, and with the rise of the European empires, languages of
this group are now dominant in the Americas, Australia and large parts of
Africa. Indo-European languages are therefore the main languages of modern
Western culture, and they are all similarly amenable to stemming.

The Indo-European group has many recognisable sub-groups, for example
Romance (Italian, French, Spanish ...), Slavonic (Russian, Polish,
Czech ...), Celtic (Irish Gaelic, Scottish Gaelic, Welsh ...). The
Germanic sub-group includes German and Dutch, and the Scandinavian
languages are also usually classed as Germanic, although for convenience we
have made a separate grouping of them on the Snowball site. English is also
classed as Germanic, although it has been classed separately by us. This is
not for reasons of narrow chauvinism, but because the suffix structure of
English clearly lies mid-way between the Germanic and Romance groups, and it
therefore requires separate treatment.

Uralic languages

The Uralic languages are spoken mainly in Northern Russia and Europe. They
are divided into Samoyed, spoken mainly in the Siberian region, and
Finno-Ugric, spoken mainly in Europe. Although the number of languages in
the group is substantial, the total number of speakers is relatively small.
The best known Uralic languages are perhaps Hungarian, Finnish and
Estonian. Finnish and Estonian are in fact fairly similar. On the other
hand Hungarian and Finnish are as different as are, say, French and Persian
in the Indo-European group.

Like the Indo-European languages, the Uralic languages are amenable to
stemming.