In this article we refer to the value of a language attribute such as fr-CA as a language tag. The fr and CA parts are referred to as subtags when described as parts of a tag. When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.

Language tags are used to indicate the language of text or other items in HTML and XML documents. Use the lang attribute to specify language tags in HTML, and the xml:lang attribute for XML

In both cases, language
information is inherited by elements inside the one where the declaration was made, unless one of those elements declares a different language (in the same way).

RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.

You used to find subtags by consulting the lists of codes in various ISO standards, but
now you can find all subtags in the IANA Language Subtag Registry. We will describe the new registry below.

Note! If you want step-by-step guidance for choosing a language tag, you should read Choosing a language tag. What follows here provides more of a high-level overview of the syntax and concepts involved in language tags, as described by BCP 47.

Most language tags consist of a two- or three-letter language subtag. Often this is followed by a two-letter or three-digit region
subtag. RFC 5646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include
extended language, script, variant, extension and private-use subtags.

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other
subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless
there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

Examples:

Code

Language

Subtags

en

English

language

mas

Masai

language

fr-CA

French as used in Canada

language+region

es-419

Spanish as used in Latin America

language+region

zh-Hans

Chinese written with Simplified script

language+script

HTML and XML also provide a means to prevent inheritance of language using the empty string, ie. xml:lang="". Essentially, this says: I do not want to associate any language with this information.

The remainder of this article provides additional detail on how to construct language tags.

Some of the key differences between RFC 5646 and earlier specifications such as RFC 3066 are:

there is just one place to look for valid subtags, the new IANA
registry

subtags have fixed positions and lengths, which makes for easier matching of language tags

there is more flexibility around the potential components of a language tag.

RFC 3066 essentially allowed you to compose language tags that were either a language
code on its own, a language code plus a country code, or one of a small number of specially registered values in the IANA language tag registry.

RFC 5646 caters for more types of subtag, and allows you to combine them in
various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter
- however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in
a number of ways – for one thing, there is only one place you need to look now for valid subtags.

Although it provides some additional options for identifying common language variations, RFC 5646 includes all of
the tags that were previously valid. If you have been using RFC 1766, RFC 3066, or RFC 4646 you do not need to make any changes to your tags.

The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the
sections that follow.

language-extlang-script-region-variant-extension-privateuse

The entries in the registry follow certain conventions with regard to upper and lower letter-casing. For example, language tags are lower case,
alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you
are free to do as you like, unless you are constrained by the rules of the system you are working with. For HTML and XML language markup, the case should not matter.

As mentioned above, you used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all
subtags in one place. The IANA registry looks a little complicated at first,
compared to the ISO code lists, but it is easy enough to use once you understand its structure.

The registry is a long text file. To find a language subtag, search the page for the name of that language, in English. If we search
for 'French', we find a record that looks like this:

Note that the type of this record is language. What you are looking for is the code labeled Subtag, which indicates a value of fr.

You can find other tags in the same way. For example, to create a tag fr-CA (French as used in Canada), you would next search for Canada, and check that you had found a tag of type region.

There are, however, some additional things you need to bear in mind when choosing subtags. For example, you should avoid subtags that are described in the registry as redundant or deprecated, and you need to use variant subtags in combination with certain other prescribed subtags. For more information about choosing subtags, read Choosing a language tag.

ast (Asturian - no two-letter code exists for Asturian in the ISO lists)

These codes come from, and are kept up to date with, ISO 639 language codes.

Because RFC 3066 didn't provide a list of valid subtags
and just referred users to ISO 639, there was sometimes confusion about how to tag languages when the ISO code lists contained both two-letter and
three-letter codes (and sometimes more than one three-letter code). Now all valid subtags are listed in a single IANA registry, which adopts only one value from the ISO lists per language. If
a two-letter ISO code is available, this will be the one in the registry. Otherwise the registry will contain one three-letter code. This should make
things simpler.

When RFC 5646 was published, over 7,000 new ISO 639-3 three-letter codes were added to the Subtag Registry.

This is an example of the primary language subtag for Spanish, es, in the registry:

We will refer to extended language subtags as extlang subtags. An extlang subtag must always be preceded by a specific primary language subtag, there can only be one in a language tag, and it comes before any other subtags.

Examples of language tags including extlang subtags are:

zh-yue (Cantonese Chinese)

ar-afb (Gulf Arabic)

Language+extlang combinations are provided to accommodate legacy language tag forms, however, there is a single language subtag available for every language+extlang combination. That language subtag should be used rather than the language+extlang combination, where possible. For example, use yue rather than zh-yue for Cantonese, and afb rather than ar-afb for Gulf Arabic, if you can.

Extlang subtags are always three letters long. Each extlang entry in the registry contains a Prefix field that specifies the language that must precede the extlang subtag. Entries also include a Preferred-Value field that indicates the equivalent language tag.

This is an example of the extlang code for Gulf Arabic, afb, in the registry:

Macrolanguages The primary language subtags used with an extlang subtag are known as macrolanguages, and encompass a number of languages with more specific primary language subtags. The macrolanguage subtag can be used on its own, but unless there is some convention about its meaning in the context where it is used, it is not necessarily precise enough.

For example, zh means Chinese, but it covers many Chinese dialects, often mutually incomprehensible. When zh is used on its own, it is usually used to mean the predominant language in the encompassed range, although this is not explicitly specified in BCP 47. For example, conventionally zh is considered to represent the predominant, Mandarin form of Chinese. Where absolute clarity is needed you can use cmn instead as long as that doesn't break interoperability, however, if you are using zh to represent a language which is not Mandarin, such as Hakka Chinese, you are better off using the explicit code (in that case, hak).

On the other hand, zh-Hans uses zh in its generic sense. This is a useful way to describe writing in Simplified Chinese, since Chinese tends to be written in the same way, regardless of the dialect of the reader.

az-Latn (Azerbaijani, written in Latin script - since Azerbaijani can also be written using the Arabic script)

The script subtag was first introduced in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.

Only one script subtag can appear in a language tag, and it must immediately follow the language or any extlang subtag. It is always four letters
long.

You should only use script tags if they are necessary to make a distinction you need. As RFC 4646 co-author, Addison
Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".

If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.

Actually, many language subtag entries in the registry strongly discourage the use of script tags by including a Suppress script field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the Latn subtag should normally not be used with es.

This example shows the registry entry for Cyrillic script, Cyrl, used for languages such as Russian:

Although for common uses of language tags it is not likely that you will need to specify the script, there are one or two situations
that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but
these dialects are all written using either Simplified or Traditional Chinese script. People typically want to label Chinese text as either
Simplified or Traditional, but until recently there was no way to do so. People had to bend something like zh-CN (meaning Chinese as spoken in China)
to mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese as spoken in Taiwan) for Traditional Chinese. (Other people, however, use zh-HK for Traditional Chinese.) The availability of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve
consistency and accuracy, and is already becoming widely used, although of course you may need to continue to use the old language tags in some cases for consistency.

The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new
registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries.
For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the
Spanish of any particular country. With RFC 5646 this is possible; the appropriate language tag is es-419.

Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script tags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are
used to for language tags such as en-US.

Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to
highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The
same goes for any other possible combination.

These examples from the registry show the codes for Austria, AT, and Northern Africa, 015:

Variant subtags are values used to indicate dialects or script variations not already covered by combinations of language, script and
region subtag. The
variant subtags must appear after any language, script or region subtags, but script and region subtags do not need to precede them.

It is unlikely that you will need to use variant subtags unless you are working in a specialized area.

The following examples may help you understand what these subtags do.

sl-nedis (the Nadiza dialect of Slovenian)

sl-rozaj (the Rezijan dialect of Slovenian)

sl-IT-nedis (the specific variant of the Nadiza dialect of Slovenian that is spoken in Italy)

de-CH-1901 (the variant of German orthography dating from the 1901 reforms, as seen in Switzerland)

This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:

In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the primary language subtag) by the 'Prefix' field. The nedis example shown above should
only be used with Slovenian.

If you need to express a particular dialectal or script nuance that is not currently available, you should propose a variant subtag or subtags for inclusion in the
registry using the registration procedure outlined in RFC 5646.

If you feel you really need to use these subtags, you should read the specification, rather than this article.

Extension and private use subtags are introduced by a single letter tag, or 'singleton'. An organization can propose a singleton for an extension. Its intended use must be described by an RFC (IETF specification). The singleton will be added to the registry if it successfully passes a review. The singleton x is reserved for private use. Multiple subtags are allowed after the singleton; however, as for all subtags, they must each be 8 or less characters in length.

Extension subtags allow for extensions to the language tag. For example, the extension subtag u has been registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.

For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.

de-DE-u-co-phonebk

The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

The following example of a private use subtag may identify a specific type of US English, but only within a closed community. Outside of
that private agreement, its meaning cannot be relied upon.

Grandfathered tags are special cases, provided for backwards compatibility. They are subtags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.

Redundant tags are language tags composed of a sequence of subtags and registered before RFC 4646 that can now be formed by combining separate subtags from the current registry. The original registrations remain in the registry mostly 'as a matter of historical curiosity'.

Many grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the jbo language subtag instead of art-lojban.

Matching different language tags is important for a number of applications. According to BCP 47 en can be said to match en-GB. For
example, the following CSS code colors all English text red in browsers that support the pseudo-attribute :lang.

:lang(en) { color: red; }

In the following code, the text described as lang="en-GB" will be red.

With the availability of additional tags in RFC 5646, matching is a little more complicated. In addition, its companion, RFC 4647 Matching of Language Tags, describes more than one possible approach to matching.
Matching will be described in another article.

Note there have been changes to ISO language codes.
In 1989 iw, in, and ji were withdrawn and replaced by he, id, and yi. More recently, the ISO country code cs, that used to represent Czechoslovakia,
was changed to represent Serbia and Montenegro. Such changes can lead to confusion when comparing codes that were assigned to text over a long
period. The new IANA subtag registry allows for tags to be deprecated and superseded by new tags, but will never remove or change the meaning of a
subtag. It is expected that ISO will also follow a similar policy for the future.

Many other W3C and Web-related specifications use language tags:

XHTML 1.0 uses language tags in the HTML lang attribute and the XML xml:lang attribute, as well as the hreflang attribute.

HTTP uses language tags in the Accept-Language and Content-Language headers.

SMIL and SVG can use language tags in the switch statement.

CSS and XSL use language tags for detailed style control.

Note also that language information can be attached to objects such as images and included audio files.