This Page

Why Language Tags?

In the online archive world, there are two primary reasons for associating documents with specific languages – facilitatate global technology and facilitate metadata search in archives. Although the two reasons are valid, they are by no means identical. In some situations, one goal may be more than important than another.

Facilitate Global Technology

How do you select the right spell checker to use (French vs. English), the right font (Arabic vs. Urdu), the right way to pronouncec'est la vie (French "Say la vee" vs. English "Sest la v-eye" or the right set of "Quote Marks" (English) vs. «Quote Marks» (Spanish)?

You tag documents with a language and program utilities that behave differently depending on the target language identified. This allows the same product (e.g. Microsoft Word) to be used but to include plugin spell checkers for different languages.

The caveat is that only written languages are usually targeted for these kinds of utilities. For instance, Microsoft has utilities for standard American English and standard British English, but not for spoken varieties Brooklyn English. Although a "Brooklyn" spellchecker and "Brooklyn" speech synthesizer could be programmed, many "Brooklyn" native speakers would probably find them condescending and not use them.

Facilitate Metadata Search in Archives

Aside from spellcheckers and speech synthesizers, researchers into specific dialects or historical forms need a way to tag their material into very narrow categories that would be irrelevant to most software vendors.

The caveat here is that a tag may be registered, but only supported by a very narrow range of specialized applications. An example of this would be the need for a Celtic database to distinguish Gaulish (xcg) vs. Celtiberian (xce) - two distinct ancient Celtic languages. On the other hand, it is unlikely that any speech synthesizer will pronounce words from these languages correctly.

When deciding how to tag documents, it may be important to consider whether you are tagging for general usage or for a narrow research purpose.

Some Common Language Codes

Language codes are primarily taken from the list of ISO-639 language codes. Some common codes, including all the languages taught at Penn State are listed
below. For the most
part, they are based on the native name (i.e. Español (es) for Spanish).

This language code list has recently been expanded to a three letter set (e.g. "eng" for English), from an older two-letter set. Therefore, some languages (particularly ancient languages) may have a three-letter code listed.

The By Language pages list the codes for each language, but common codes are listed below.

Commonly Taught Languages

European Languages

en: English

es: Spanish

fr: French

it: Italian

pt: Portuguese

de: German

ru: Russian

Non-European Languages

ar: Arabic

zh: Chinese (Mandarin)

he: Hebrew

ja: Japanese

ko: Korean

sw: Swahili

Ancient Languages

grc: Ancient Greek (vs. el: Modern Greek)

la: Latin

he: Hebrew

ang: Old English (Anglo-Saxon)

enm: Middle English

Other Codes

These are codes where the language name diverges significantly from English.

sq: Albanian

hy: Armenian

eu: Basque

nl: Dutch

ka: Georgian

gd: Scottish Gaelic

ga: Modern Irish

fa: Persian (Farsi)

bo: Tibetan

cy: Welsh

Note on Screen Reader Support: Only the most recent versions of JAWS and Home Page Reader support the LANG tag
for French, Spanish, Portuguese, German and Finnish. To support other languages,
it is recommended that users install plug-ins or screen reader software designed
for other language.

Specifying Language Dialects and Varieties

Language codes can be followed by an optional variety code, but note that not all codes are recognized by all vendors and that the line between "language" and "dialect" can be very fuzzy in some situations.

By Country

Until recently, the only way most vendors (e.g. Microsoft or Apple) distinguished languages was by attaching a ISO-3166 country code code after it. Although some "country codes" can be linguistically inaccurate, they may be the most standardized.

en-US: American English

en-GB: British English

es-ES: Castillian Spanish (Spain)

es-MX: Mexican Spanish (Standing for Latin American Spanish)
See also es-419 for Lating American Spanish

RFC 4646 Tag Syntax

Recently, there has been an attempt to codify other types of regional varieties as part of the RFC 4646 project, but it is still a work in progress. Below are some guidelines for forming different types of varieties, but note that not all of them may be registered.

Check Registry First: Before using any subtag, confirm that it has been registered first in the IANA Language Subtag Registry. Otherwise assume it is a tag only you may be using.

By Script

If a language can be written in more than one script, then you may need to specify which script is in use, some of which are implemented in modern software systems such as Windows Vista. Common examples (all of which are registered) include:

az-Arab - Azerbaijani, Arabic script

az-Cyrl - Azerbaijani, Cyrillic script

az-Latn - Azerbaijani, Latin script

bs-Cyrl - Bosnian Cyrillic Script

bs-Latn - Bosnian Latin Script

zh-Hans - Simplified Chinese script

zh-Hant - Traditional Chinese script

Many languages written multiple scripts have IANA registered variant, but not all of them do. If your language script variant does not exist, then the following script subtags can be used.

Another theoretical example could be en-021 (American and Canadian English), although this variant is NOT registered.

Unregistered Codes

If you need a code not registered with the IANA, you can create new codes following suggested guidelines, but you may need to add an x-prefix to indicate that is it unregistered.

By the way, Anyone can request a new variant code at, but the process is lengthy.

Dialects within a Country

The RFC 4646 permits codes to be combined. So if you need to specify the Baltimore dialect of English, you could create a code such as

en-US-Baltimore (theoretical)

Please note that no regional varieties from the United States are registered with the IANA (and only three from Britian).

Thus you can either use the code x-en-US-Baltimore to indicate it is not registered or just en-US-Baltimore depending on your needs. It is very likely most software packages would interpret the string as just en-US.

By Era

The RFC 4646 does not specify how to indicate time within a particular language, but some registered codes indicate dates for when spelling changes in a language were enacted. Some examples include: