NetHack in other languages

NetHack's text output is in English. Although the program's structure does not easily lend itself to localization because English morphology and syntax are hard-wired into the source code on all levels, several localization projects currently exist.

German

Tony Crawford and Karl Breuer have completed a German-localized version called NetzHack (note the 'z'), which runs on Linux, *BSD, and OS X (console and X11), and on Win32 (console and Windows graphics). Source and binaries available here.

A different German translation attempt by Patric Mueller called NetHack-De was released as a playable, although incomplete, alpha release on 11 October 2007. The latest release includes source code, a Debian package and a graphical Windows binary.

Japanese

The Japanese version JNetHack by Issei Numata has been in existence for several years. For those who don't read Japanese, there's some outdated information in English at jnethack.org.

Spanish

Incomplete or stalled translations

On January 28th 2009 a Chinese translation called nethack-cn was begun on Google Code but the last update was on June 25th 2009.

A SourceForge project for a French translation called nethack-fr was registered on August 6th 2009. The last update was on October 29th 2009. The project is flagged as no longer under active development; the last commit was on January 20th 2011. There is a French translation of the guidebook and some spoilers.

The first commit of GitHub project for a Italian translation called nethack-it was on December 4th 2009. The last commit so far was on January 27th 2010.

Internationalization

Ray Chason has launched the NetHack-i18n project, also called Internationalized NetHack, which is aimed at adapting NetHack for easier translation to other languages. The project was active as of April 2013.

Current NetHack localization strategies

The problem

Because NetHack has output text in the form of string literals scattered throughout the code, the customary approach is for the translator to go through the source code and substitute translations for the string literals. What complicates this process is the fact that many messages are composed of elements that can vary with the runtime context. For example, in an output statement like this:

pline("%s hits %s.", objectname, monstername);

the variables "objectname" and "monstername" may be singular or plural, masculine or feminine, and may be introduced by "a" or "the". The words to be inserted must be formed appropriately before the output function call.

At various points in the program, NetHack's output messages vary with second and third person verb forms, singular and plural verb forms, and noun inflections by case, gender, and number.

In English, this is easy: word forms do not change with grammatical gender or case, and most nouns simply change from singular to plural by the addition of a trailing 's'. There are only one form of the definite article ("the"), and two forms of the indefinite article ("a" and "an") which are grammatically equivalent. In other languages, morphology can be much more complex: Spanish has four forms of the definite article, depending on whether a noun is singular or plural, masculine or feminine; German has six.

(As it happens, monsters in NetHack always act or are acted upon singly, not collectively, which simplifies matters sometimes. On the other hand, objects are named differently at different times – by name, by description, or by class – and so an object name or a pronoun that replaces "it" can vary even for the same object.)

Word order can also change depending on certain conditions, such as whether the subject is a common noun, a proper noun or a pronoun.

Furthermore, some languages have mandatory contractions (Spanish contracts the preposition and article "a"+"el" into "al"; French contracts the preposition and article "de"+"le" into "du", etc.).

Some examples of word and sentence morphology in Spanish:

"¡Idefix golpea al orco!" (subject and object are both nouns)

"¡Idefix lo golpea!" (object is a pronoun, and goes before the verb)

"¡Golpeas al orco!" (subject is a pronoun ("tú") and is omitted; verb changes to second person singular)

"¡Lo golpeas!" (both modifications apply)

The message generation must also correctly capitalize after such rules are applied.

Original NetHack contains a few functions to modify linguistic elements for output, such as vtense and makeplural in objnam.c, and s_suffix in hacklib.c. But since English is not a highly inflected language, even these do not actually operate on grammatical categories, but tend to manipulate words by superficial characteristics: an for example chooses between the indefinite article forms "a" and "an" merely on the basis of the following word's first letter, and has no concept even of subject or object case. NetHack's function the prefixes a definite article to any noun, but it becomes useless in German, for example, because the form of the definite article depends on the noun's gender and number, and on the grammatical case in which it is used.

These technical and grammatical problems are all in addition to the fundamental problems inherent in any translation. NetHack in particular is famous for the humor it incorporates, much of which depends on English wordplay (jokes about pit vipers in pits, for example), idiomatic expressions ("everything but the kitchen sink"), and American cultural references ("core dumped", Keystone Kops, ...). The stock in trade of a translator is to achieve an equivalent tone and mood in the target language. For NetHack, that means translating wordplay where possible, replacing untranslatable puns with others as the opportunity arises, and generally choosing similarly humorous wording in the target language in keeping with the spirit of the original game.

Localization approaches

NetHack-i18n

Internationalized NetHack aims to systematize the process of string replacement using Gettext together with a scriptable printf-like system to handle the grammar bits.

Gettext's grammar support is minimal. It supports plurals. NetHack-i18n needs such things as support for changing word order and noun cases, and encodes them in two ways:

by extending the printf-like syntax to include formatters such as %3${g/handsome/beautiful}, where the number after the % is a parameter number (this is a POSIX extension to printf) and the part between the braces is interpreted by a Ruby script; and

by defining "joining rules" at the start and end of each substitution, to handle mandatory contractions and such rules as "a/an".

This is C++ rather than C, and the NHFormat class overloads the << operator and the cast to std::string to make this work; it's rather similar to Boost Format. "%1${Nt$}" means substitute the first parameter, and use a locale-specific formatting with "Nt$" to indicate the specific formatting.

The code for the English locale interprets "Nt$" with a monster parameter as follows:

Initial "n" means the name of the monster;

The "n" is made capital, to indicate the output should be capitalized;

"t" means prefix "the" if appropriate; and

"$" means show the saddle if the monster does not have a name. ("s" would mean "always show the saddle.")

With an object parameter, initial "n" means show the name, and "t" again means use "the" if appropriate.

T_() consults the message catalog, which uses the gettext syntax, but does not support plurals. The message catalog for the Spanish locale has this entry:

Note that the first parameter is substituted twice. This is permitted, and indeed very frequent. The substitutions are as follows:

%1${:es_intrans,Nl$,es}: Both the English and the Spanish locales adopt the convention that a format string beginning with a colon names a method in the Ruby code. Thus ":es_intrans,Nl$,es" invokes a method called es_intrans. (The name is a misnomer: you use :es_trans if the direct object is a monster, and :es_intrans otherwise.) The commas (any non-alphanumeric character may be used) delimit parameters to es_intrans. "Nl$" is the formatter for the monster, with "l" indicating the definite article, and "es" is the verb. If the monster cannot be seen, the format routine returns "él" or "Él", and es_intrans omits it and capitalizes the verb if appropriate. (This pattern is overkill for the particular case, as the message does not appear if the monster isn't visible, but it frequently appears elsewhere.)

%1${oa}: "oa" means substitute "o" if the parameter is a masculine noun, or "a" if feminine. There are several other such substitutions, and they may be used with strings or objects – or the hero ("¡Destruid a %0${el} ladr%0${ón}, mi%1${p} mascota%1${p}!").

%2${nl}: Show the name of the object with the definite article.

Spanish NetHack

Spanish NetHack handles grammar rules by coding special routines to handle them, much as the unpatched NetHack does. For example, the output statement in mthrowu.c#line227,

Monnam, the, and xname retain their names from the original code, though "the" in fact uses the appropriate Spanish article. mon_gender returns nonzero if the monster's name is a feminine noun.

NetzHack

NetzHack began with the idea that the developers just wanted to translate, not to rewrite the program. Or, in other words: NetHack is a prime example of how you don't code for localization, and trying to fix that was pretty near hopeless. So the localization strategy was as follows:

Translate string literals in the source code

Create a new data type, usage_t, to contain the usage information of each context in which a noun, adjective or pronoun might appear: number (singular or plural), case (nominative, genitive, dative or accusative), gender (masculine, feminine or neuter), and determiner (the, a/an, this, your, or none).

Write a new module, german.c, with the functions necessary to inflect German nouns and adjectives for a specified usage, and add a dictionary, nouns_de.h, which associates each German noun with a reference to its declension paradigm.

Replace functions that produce an object or monster name, such as doname or mon_nam, with expanded versions that take a usage_t argument.

Write human-readable macros in a new header, german.h, to call those functions with specific values of the usage parameters, then apply the macros as drop-in replacements for the original functions to provide German grammar throughout the code. For example, the output statement in mthrowu.c#line227,

Monnam_nomsing and the_xname_dat are macros that call German grammar-sensitive versions of mon_nam and xname, passing them the appropriate usage parameters for this message. The macro definitions (in german.h) look like this:

The replacement functions, with names ending in '-g' for German, take the same arguments as the original naming functions (in this case, a pointer to a monster or object structure), plus a usage argument that specifies number, gender, case and determiner. In our example, the noun phrase that designates the monster must be in the nominative case, singular, and capitalized; the noun phrase for the thrown object must be in the dative case and have a definite article. The grammatical gender depends on the exact word that ends up being used to designate the monster or object, so it is specified as "unknown" in these function calls. (Actually, NetzHack hijacks the names of the original functions in extern.h to make them point to the nominative-singular macros, so that the original Monnam(mtmp) call above doesn't really need to be edited at all.) Since the determiner is a necessary part of the usage parameter – that is, it influences the form of the noun – the nested call the(xname(...)), a frequent occurrence in NetHack, is always replaced (as in the example) with a single function call via one of the macros the_xname_{nom, gen, dat, acc} (for nominative, genitive, dative or accusative case).

The frequent dictionary look-ups to determine the necessary declension pattern for each monster or object noun used might be a drawback if computing power had not grown tremendously since NetHack was young. Recent look-ups are cached, though, which is especially helpful since nouns are often repeated in output in a given game context. There are 1622 nouns in the dictionary.

The minimal-effort strategy does not bring the game any closer to UTF-8 compatibility; however, since the changes from the original program structure are limited, there might be hope of patching in a future UTF-8 port of NetHack without too much adaptation.

Monster and object names

The English names of monsters and objects are string literals in monst.c and objects.c. The NetHack build process compiles and invokes the utility makedefs to convert these names into preprocessor symbols, contained in the files include/pm.h and include/onames.h. The program then identifies objects and monsters by the numeric constants associated with those preprocesor symbols. The problem for translation is therefore that changing the names in monst.c and objects.c would change the preprocessor symbols, and almost every other part of NetHack would then have to be edited accordingly.

Spanish NetHack and NetHack-de solve this problem by replacing each string in monst.c and objects.c with a preprocessor symbol, and providing new headers to substitute either the original English or translated names for these symbols. In this way, distinct versions of objects.o and monst.o are built with the names in English and in the target language.

NetzHack, on the other hand, adds an element to the object and monster data types, struct obj and struct mon, so that each kind of monster and object has both its translated German name and, invisibly to the user, its original English name too. Thus pm.h and onames.h are generated using the original names as before.

NetHack-i18n, because it has Gettext available, leaves the monster and object tables in English and converts them at run time. Another approach might be to bite the bullet and replace the preprocessor symbols in pm.h and onames.h with their translated versions. No known translation takes this approach.

Input parsing

The largest problem here is support for wishes. Every translation must rewrite the readobjnam function to parse an object name according to the rules of the target language.

NetHack-i18n first removes the dungeon feature wishes, replacing them with a new extended command, called "dfeature" in the English locale; and then splits the rest into a parser, which is placed in the Ruby script, and a rule-enforcer, which remains in the core code.

Character sets

ASCII is inadequate for most languages other than English. All translations use a larger character set for messages. Case mappings and fuzzy matches for wishes and other inputs must take the character set into account; if the user wishes for "cota de escamas de dragon gris", he should get a gray dragon scale mail, even though the correct spelling is "dragón".

JNetHack uses EUC-JP, with tests in the code to detect if the source has been converted to Shift-JIS; EUC-JP is adapted for Unix-like environments, and Shift-JIS for Microsoft Windows.

Spanish NetHack encodes all messages in ISO-8859-1, while leaving the map symbols in code page 437. Reduced IBMgraphics modes are available for users who do not have code page 437 configured. Slight hackery is needed to support the different character sets, because map symbols can appear outside the map in three places:

As NetHack-i18n is meant to be language-neutral, it uses Unicode throughout. Any user input is encoded in Unicode, and user interfaces are expected to support it. The TTY interface is abandoned in favor of a modified Curses interface, and the Curses library must support wide characters.

NetHack-De encodes all messages in ISO 8859-1. As a result, IBMgraphics doesn't work (because it uses a different character set), although DECgraphics does. User wishes are normalized before being parsed so that the user can enter wishes in any charset: to wish for "Rüstung" ("armor"), for example, the user may type "ruestung" in ASCII (the German letter ü originated as a combination of 'u' and 'e', hence "ue" is a conventional alternative where ü is not available), or "Rüstung" in ISO-8859-1, or "RÃ¤stung" in UTF-8. (This feature is part of a preliminary UTF-8 support: a UTF-8 capable terminal would show "Rüstung", but be unable to display umlauts in the rest of Nethack-De's ISO 8859-1-encoded messages.)

NetzHack is also in ISO-8859-x. The MS Windows console version actually uses two charsets (or "code pages" in Microspeak): the dungeon map is drawn in the system's default code page, while the Windows 1252 code page, containing the German characters ÄÖÜäöüß, is used for text messages.