Preferred Languages: Research

Following the first meeting of the Preferred Languages project, some time was spent on researching popular platforms and other systems to see how they handle the issue of setting multiple preferred languages. To recap, this is needed to show things in the most suitable language for users in case their requested language is not installed.

Browsers & Operating Systems

All major browsers and operating systems have some UI where the user can set their preferred languages. Mostly used for the Accept-Language HTTP header, one can set multiple languages and order them by preference. These systems usually know multiple variants like German and German (Switzerland), but not formal / informal variants.

Firefox

Chrome

macOS (overview)

macOS (selection)

Worth noting that on these systems you can often choose more settings that are influenced by the language, like temperature units and date formats. Related: #18146.

The Accept-Language HTTP Header

RFC 7231 about the HTTP/1.1 standard says the following about this header:

The “Accept-Language” header field can be used by user agents to indicate the set of natural languages that are preferred in the response. Each language-range can be given an associated quality value representing an estimate of the user’s preference for the languages specified by that range.

This header is quite powerful. For example, Accept-Language: da, en-gb;q=0.8, en;q=0.7 would mean: “I prefer Danish, but will accept British English and other types of English”. I can only recommend reading more about it in the RFC, as it helps getting a better understanding of the problems it tries to solve.

This header is usually used by websites to redirect users to the correct version or display a hint like “This content is also available in XY”. While each language in the header has a specific numeric priority, there is usually only a drag & drop interface to determine the order.

Unicode Common Locale Data Repository

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. It contains an interesting chart about Language Matching, data that is used to match the user’s desired language/locales against an application’s supported languages/locales.

There’s also a technical standard about Unicode Locale Data Markup Language (LDML), an XML format for the exchange of structured locale data. The section about Locale Inheritance and Language Matching is particularly interesting. For instance, it describes finding the most well suited language using a weighted graph and gives a better picture of dealing with more complex language fallbacks.

It also takes geographic “closeness” into account, arguing that English (Slovakia) should fall back to something within Europe (e.g. British English) in preference to something far away and unrelated like English (Singapore). It’s clearly stated that these fallbacks aren’t as simple as just saying Spanish (Mexico) -> Spanish (Spain) -> English (US).

Note that this technical standard is about finding the best supported locale based on the requested list of languages. The requested list could come from different sources, such as such as the user’s list of preferred languages in the OS Settings, or from a browser’s Accept-Language list.

Wikipedia

Wikipedia is built on the MediaWiki software, which has a hardcoded list of fallback chains for some locales. This is where MediaWiki will fall back on a different language if it cannot find what it needs. For example, French (Cajun) automatically falls back to French (France) when it doesn’t have all messages defined in it. Unfortunately, there’s only little documentation about this.

Apart from that, MediaWiki distinguishes between various kinds of languages: the site content language, the user interface language and the page content language. The latter can differ from the first two and it influences the language the user views the page in, which depends on the user’s preferences, the available languages, and the defined fallbacks.

It’s worth noting that on Wikipedia, you can not only select your preferred language, but also your preferred gender. In addition to that, you can select multiple different locales for both display and input.

Wikipedia Internationalization Settings

Wikipedia Language Chooser

MediaWiki Fallback Chain Graphic

Joomla

I tried the official Joomla Demo to test its language management which is a bit overwhelming at first. After installing all the available languages the user is eventually able to select their preferred language for the front end and the back end. There are a few other settings for language switching on multilingual sites, but that’s pretty much it. So basically the same settings as currently in WordPress 4.7.

Joomla User Settings

Drupal

Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.

The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.

Drupal Language Hierarchy

TYPO3

TYPO3 4.6 comes with a clever fallback mechanism when a label is not found in the requested language; instead of returning the default (English) version, it allows you to define your own hierarchy of locales.

By default, French (Canada) will first use French before falling back to English. Similarly, missing labels in Brazilian Portuguese will first try to return Portuguese labels before the English ones. This feature also accommodates completely custom fallbacks.

The same goes for the underlying Flow Framework and the Neos CMS. This is only a front end thing though. On the back end, a user can only configure one language. If there’s no translation in that language, it will fall back to the site’s default and eventually to English.

TYPO3 Back end Language Setting

WordPress.com

On WordPress.com there’s a user interface language selection in the account settings. One cannot select multiple preferred languages though, but only one.

WordPress.com User Interface Language Setting

Facebook

Facebook only has a really simple language settings. Although a user can select multiple languages they understand for use in the News Feed, they can only choose one specific locale for the overall UI.The locales that Facebook supports are referenced in the Facebook Locales XML file. That file includes multiple variants for various languages, but only one variant for others. For example, there’s only de_DE, but no de_CH. Plus, you can’t choose between formal and informal variants.

Facebook User Language Setting

Summary

We’ve now covered a bunch of well-known web platforms and content management systems to see how they are handling this problem. Various techniques exist to assist with finding the right translation. Sometimes they are automated, but most of the time the choice is left up to the individual user.

This research should help with the next steps of the Preferred Languages project as these observations need to be adapted for WordPress so that we can learn from them. Feel free to leave any comments about this research in the comments. After that, I try to schedule a new meeting in December where the next steps can be discussed.

> I thinks it is better to consider text direction (LTR vs RTL: Right To Left), like Drupal.

So you want to use English, but from right to left? That’s the first time I’m hearing this suggestion personally. Right now, the text direction is defined by the locale you are using, e.g. English is LTR, Arabic is RTL.

Currently site’s direction is only controlled by site’s language (1) not user’s language (2):

1. Site frontend & backend language & direction

Domain/site-name/wp-admin/options-general.php

2. User backend language to override site language

domain/user-name/wp-admin/profile.php

It means that admin can set the first one in Persian (RTL site), but a user may set his profile language to English: in this case his site backend direction is still RTL. I tested it in this site (a site in Persian) and a user profile in English. So, I think it is good to have a backend direction option in user profile, too.

Also, in some cases, we need to switch temporarily backend of RTL sites from RTL to LTR. Because some plugin / themes backend .css files do not support RTL and show awkwardly in RTL backends. As I said, currently I use Parsi Date plugin to solve this problem.

Jalali calendar in core

Jalali calendar (Official calendar in Iran) is not supported in core, and I use the above mentioned plugin. Yes, it is good to consider it in core.

Lunar calendar

Sometimes, I prepare Arabic language websites. But, interestingly I did not find any plugin to change the site dates to lunar calendar (official calendar of many Arab countries). I searched the for it in repository.

Greg Ichneumon Brown
10:20 pm on November 28, 2016

I was recently looking at how to do better language guessing on WP.com so let me highlight a few other non-obvious pieces:

On signup WP.com tries to guess the user’s language based on the following lookups: HTTP accept-lang headers are used if they are set, otherwise the user’s IP address is used to do a country lookup.

When selecting a language with this lookup the language is chosen based on how well translated it is. There is a hard coded list of well translated languages that we try to steer the user towards.

There are a lot of cases where the locale is ignored for various features. Really its a pretty simplistic way to do the fallback, the language matching chart you found is really interesting. For instance, when searching/indexing content we mostly use just the language portion of the

We’ve discussed supporting multiple languages for things like the Reader, but don’t love the idea of adding more options to the admin to support it.