Post navigation

Pluralising isiZulu nouns, automatically

Generating the plural of a noun in the singular looked so trivial… one set of ‘boring’ rules from the prefixes listed in the standard table of the noun class system for isiZulu—neatly paired by singular and plural—and you’re done. That is also why we had in the original verbalisation algorithm in the RuleML14 paper just one method [1]. And then came the testing with a set of nouns, where some nouns did not quite stick to that neat table; e.g., indoda ‘man’ is in noun class 9, so ought to take the prefix for noun class 10 as plural, but it is amadoda ‘men’ (noun class 6), or take noun class pair 7 and 8, with prefixes isi- and izi-, for singular and plural, respectively: that generally holds, just not in those cases where the stem commences with a vowel. We bumped into a few of those, requiring us to take a step back and investigate pluralising isiZulu nouns systematically, how well that ‘standard table’ actually works, what can be said about those nouns that don’t quite adhere to it, and whether their underlying causes occur also in other Bantu languages.

Pluralising nouns is nothing novel per se, and the general approach is to take some grammar book and write a bunch of regular expressions or rules, or use a data-oriented approach and find all the rules that way. That doesn’t quite work for isiZulu due to it being an underresourced language, so we ended up designing a set of basic rules manually, and then find other rules through experimentation, i.e., a combined knowledge and data approach.

The knowledge-based part was concerned, first, with the choice whether regular expressions would suffice (syntax-only based approach), designing a few automata by availing of the prefixes in that standard table. This made evident that that was unlikely to work well: some noun classes, e.g., 1 and 3, take the same prefix (um or umu) yet have a different prefix for their plural counterpart (aba for noun class 2 and imi for noun class 4), plus it does not say when it is um and when umu. The latter prefix is used with monosyllabic stems, but we have no way of identifying a monosyllabic stem. Thus, we’d need the semantics (noun class) to go with the regular expression as input, which is unlike any pluralisation algorithm for other languages (that we know of). For the size of the automaton, it doesn’t matter if one checks first the prefix and then the noun’s noun class or vv.

For the experiment, we compiled two set of nouns: one was constructed in English by taking class names for multiple ontologies (set 1), the other was compiled by picking each first-listed noun on the left-hand page of an isiZulu dictionary (set 2). To test for generalizability, we took Runyankore, a language spoken in Uganda, which is in a different subfamily from isiZulu.

So, just how bad is nouns-only, i.e., just some regular expression based on the standard table? The accuracy was about 50%, which is pretty dismal. Just adding a noun’s noun class made the accuracy jump up to about 80-90%! The accuracy of each iteration is shown in the table below.

Accuracy of pluralisation with the incremental versions of the pluraliser (source: [2])

So, what were the other culprits? First, compound nouns, such as indawo yokubhukuda ‘swimming pool’, for it is not only the main noun that has to be pluralised (indawo to izindawo), but the other word still has to be in agreement with the first, so it is not yokubhukuda but zokubhukuda for the plural. We managed to devise 4 additional rules for those cases. Another issue were the mass nouns (amount of matter): they can exist in the singular and stay in the singular (iwayini ‘wine’), or are already in a plural class (amanzi ‘water’). There is no way to recognise those, so this has to be annotated. The plural exceptions (true exceptions) and prefix exceptions (regular ‘exception’) were mentioned earlier in this post, and some new rules were added for that. Adding all that generated 92% accuracy for the second set, and 100% for the first set of nouns. For the remaining errors, we have some ideas for how to resolve it, but linguists first have to check (see paper for details).

The initial results for Runyankore were better than for isiZulu, thanks to fewer recurring prefixes, but, interestingly, Runyankore had similar issues overall, notably with the compound nouns, mass nouns, and true exceptions, and some issues with loan words. Tone, not indicated in the orthography, popped up as well, which also holds for some isiZulu nouns (but they didn’t happen to have been in the test sets).

The data, analysis, and the software (including the source code) are available from the GeNI project page. The isiZulu pluraliser was coded up in Python and the Runyankore one in Java. Note that they are at the proof-of-concept level, not industry-grade tools, but you’re free to take it and make industry-grade level software out of it :-).