Problem: Serving different renderings for the same URL messes with web caches.

Solution: Force uselang based on some part of the URL path, similar to how language variants are handled

For multilingual wikis such as Wikidata, but also Commons and perhaps mediawiki.org and meta.wikimedia.org, it would be useful if anonymous visitors could browse the wiki using their preferred user language. We currently do not allow this, since serving pages localized for different languages from the same URL would poison the web cases. There is at the moment no way to bookmark or link to a specific language version of a page, and search engines will only index one language version.

Simply disabling the web caches for such wikis, or at least bypass such caches if a selang or uslang cookie is set, might be feasible if anonymous traffic on the relevant wiki is low enough. Another option would be to vary (split) the cache based on a language cookie. However, both options still do not allow linking to a specific language version, or indexing by search engines.

Proposed solution:

encode the user's preferred language in the URL path, and use it to set the value of the uselang URL parameter via some kind of rewrite magic. A similar approach is already used for wikis that support language variants.

$wgArticlePath needs to be automatically adjusted based on uselang, so that all generated links point to pages under the current per-language url path. (We may run into trouoble witht the message cache here)

the (old) language neutral path should be rewritten to some special page which redirects to the user's preferred version of the page, similar to how Special:InMyLanguage works. The user's preferred language could be determined by ULS via a hook.

Logged in users would also be using the per-language paths for consistency, but would bypass the web caches as before. When viewing a page in a path that disagrees with the user language from their preferences, some kind of notification bar should be shown, with easy access to the language rendering in the user's normal (as per preferences) language.

Note: variants apply to the pages content language, while this RFC is concerned with the user language. How user language and content language relate, in particular for localizable page content and variants, is not in scope of this RFC. This should rather be discussed in the context of T114640: RFC: make Parser::getTargetLanguage aware of multilingual wikis.

If we do this, what should the path scheme look like? /wiki-fr/Foo or just /fr/Foo or something else? Should the path pattern be the same as for variants, or should it be different, so both can be used at once?

Can we first try this without the automatic rewrite of the classic /wiki/ path? On which wiki shall we try this first?

How do we make a wiki-link to a specific language version of a page? Do we need a {{#link:Foo}} function?

Totally agree with @Purodha and @coren , the use of the /xx suffix is preferable, not only because other wikis use it, but also because it's easier to link to random languages from non-Lua templates without using string parsing templates.

It's not currently clear what purpose this new URL format would serve, so
it's hard to judge whether the proposal is sufficient or overkill for the
goal. The title is misleading, in that "Per-language URLs for pages of
multilingual wikis" already exist in practice (see below). This is, in fact,
the main characteristic of multilingual wikis in MediaWiki as opposed to
other wikis or CMSs: to change language, you nearly always have to change
URL.

This is about multilingual pages, where there is only one source, but what you see depends on your language. This is the case for example for wikidata, and for (some) file descriptions on commons.

Having separate pages for each language already works, of course. But the wiki won't automatically show you the one you understand.

page content depending on the user's interface language (for instance
wikidata.org and commons.wikimedia.org, or other wikis using the Translate
extension). All language versions (renderings) of a page are served from
the same URL

This is not about translation. It's about showing the *same* information (e.g. the license of an image) in the user's language.

Commons uses AnonymousI18n.js and the uselang URL parameter, by making
content language rely on interface language.

That's more along the lines I was thinking of. It would be nice if we could actually serve content in the user's desired language in the first place, though.

So you suggest to force the content language based on the interface language?
I hope not; that's something Commons (and the old Meta-Wiki
LanguageSelect/LangSwitch) do out of necessity, not because it's a good
thing.

Not "force", but "default", yes. It works perfectly fine for wikidata. What problems do you see?

warnign bar is shown at the top

This is the AnonymousI18n.js approach. FWIW, we have developed better ways in
the meanwhile (except that WMF ops don't support them).

My take-away from the comments so far is that I need to clarify that this is not about manually translated text. It's about displaying structured data in the user's preferred language. For translations, I also prefer suffixes.

My take-away from the comments so far is that I need to clarify that this is not about manually translated text. It's about displaying structured data in the user's preferred language. For translations, I also prefer suffixes.

It should be about both IMHO (e.g. s/nice/compulsory in the last challenge). It doesn't make sense for the linkers to know whether the namespace you want to link to has structured data or is manually translated - the link to Qxxxxx and File:Yyyy.jpg should have the same structure.

I do understand now that using that suffix might create a confusion, but the new system should be aware of the old one.

I think we need very clear rules about how we structure our URLs. If we already have en.wikipedia.org, Page/en, and /zh-tw/ (variants?), it's time to figure out and explicitly define our long-term URL scheme. This process would include clearly defining which language codes are used, where they'll appear in the resource locator string, and what the expected behavior is when they're requested in specific ways (anonymous HTTP GET, authenticated HTTP POST, etc.).

In other words, we need a spec.

(I intentionally include en.wikipedia.org as that is really a choice we make. We could move everything to www or no-www wikipedia.org.)

This is not about translation. It's about showing the *same* information (e.g. the license of an image) in the user's language.

I don't understand this distinction. This seems the definition of translation to me.

The difference is that you can view a wikidata item in different languages, without having to define the item in different languages. The content of the page is not translated, it's language-neutral (except for the label and description, which are multilingual).

file descriptions on Commons and other autotranslate-like pages, which already have per-language URLs (Page?uselang=de, Page?uselang=it etc.).

Indeed, this RFC is about the second use case. uselang=it etc causes the web cache (varnish) to be bypassed, and it's not persistent - when you click a link, you are back to the standard language. I would like to change that. The preferred way to change that is to put the language into the path instead of a parameter, and make the Linker aware of uselang, so it becomes "sticky".

So this RFC is just about finding a replacement for the current usage of Q123456?uselang=de and similar? It would be useful to state so in the summary, as that's much easier to understand and address.

What I propose is technically very similar to what the uselang and variant URL parameters do. And we may well keep using these internally. There are a few more aspects to it, like making the Linker aware of this, coupling the parser target language to the user language (for some pages/namespaces), and allowing such multi-lingual pages to make proper use of web caches, so hacks like AnonymousI18n.js are no longer needed.

We could start by making the *de facto* uselang standard explicitly supported by MediaWiki's linking functions, and then maybe think of a different URL format if there is ever a need.

@Purodhasetlang works OK for logged in users, but it still means we serve different content for the same URL. Which is Not Nice (tm), and screws with web caches. For anons, setlang sets a cookie, which would either be ignored by the web caches, causing random language versions to be cached and served, or it would cause the cache to be bypassed, causing performance issues. That's why it's soft-deprecated, and not enabled for anons. I wrote this RFC in order to fix that.

I agree by the way that we should avoid having a confusing set of language settings for various aspects of our content. I propose having two, basically: the "current" language (uselang), and the "preferred" language (from user preferences). Anons only have a "current" language. This language would be applied to the UI, and to the content of multilingual pages. We may also use it to pick translations, though I'm not sure about the UI for that.

My take-away from the comments so far is that I need to clarify that this is not about manually translated text. It's about displaying structured data in the user's preferred language. For translations, I also prefer suffixes.

Rather than add "yet another" way to specify a language in the URL, I'd prefer if we try to reuse the mechanisms we already have. When the user selects a language+variant in which to view a page, it seems natural that the UX text would also change to that same language+variant. They are specified in the same way: base language code, hyphen, variant string.

My strawman example is a user on zhwiki who has a target variant set to zh-hant but has the UX language (image metadata labels, {{int}} output, page UI) set to, say, de. Is this something we ought to account for? If you specify de alone, is language converter just turned off? (The result is an incomprehensible mix of character sets and variant terms.) Or do we fall back to some default (politics alert) and acknowledge this is nonideal but it's a corner case and unusual in practice?

There are many websites which do not have per-language URLs and that seems fine. There is always a trade-off: whether making it harder to share an URL with explicit interface language or to share an URL without an explicit interface language. To me the case of not having explicit interface language in the URL feels as the more common use case, but I don't have data to back this up. But it can be compared to the permanent links where oldid is present, but not there by default. I would argue that interface language should also be optional, put possible to add when wanted (which is right now possible with uselang).

Sure, if there is no sane way to do either manual or automatic interface language selection (see T149419), then having explicit interface language in the URL can be an acceptable trade-off, but it would not be optimal for user experience in my opinion. It is not clear from the proposal whether this URL scheme would only be used internally (say, rewriting non explicit language URL based on a cookie in the frontend), or also externally. Considering that other MediaWiki installation would need this kind of URL scheme, it would be better if Wikimedia did not deviate from this practice externally.

And if we still want to use per-language URLs, I would require very good justification for not using the existing uselang parameter.

In relation to Translate, I would be happy for solution to redirecting readers to content in the correct language (interface language) that would not depend on using Special:MyLanguage which breaks link tables.

I do recommend using ULS's logic to determine default language if possible, there is no reason to build new, likely diverging, solutions.

There are many websites which do not have per-language URLs and that seems fine. There is always a trade-off: whether making it harder to share an URL with explicit interface language or to share an URL without an explicit interface language.

I think you are right if we are really talking about the UI language. And in MediaWiki we technically are talking about the UI language, not the content language.

But the main use case, Wikidata, generates content in the UI language. So uselang=fr will not just cause your navigation to be in French, it will cause the entire page to be in French. It seems quite useful to e.g. have Google index these different versions separately, and to be able to bookmark them, and link to them.

I am not proposing to do this for all wikis. I'm proposing to do it for Wikidata, Commons, and perhaps a handful others.

To me the case of not having explicit interface language in the URL feels as the more common use case, but I don't have data to back this up. But it can be compared to the permanent links where oldid is present, but not there by default. I would argue that interface language should also be optional, put possible to add when wanted (which is right now possible with uselang).

The uselang argument currently breaks web caches, and it's arguably ugly. Basically, my proposal is a prettified, "sticky" uselang. "Sticky" because links would be generated so that they would again point to the same language version.

It would still be possible to link to a language neutral path. My thinking is that the neutral path should trigger an HTML redirect based on the user language (or a good guess), but it could also serve content directly (though it would have to bypass web caches then). Or always use user language = content language for anons, as we do now.

Sure, if there is no sane way to do either manual or automatic interface language selection (see T149419), then having explicit interface language in the URL can be an acceptable trade-off, but it would not be optimal for user experience in my opinion.

To me it seems to be exactly the other way around. Having separate URLs serves the user better - the different renderings of the page get indexed separately by google, you ban bookmark a specific language, and you can link others to a specific language. Some guessing heuristic should be used as a fallback.

It is not clear from the proposal whether this URL scheme would only be used internally (say, rewriting non explicit language URL based on a cookie in the frontend), or also externally.

[...]

And if we still want to use per-language URLs, I would require very good justification for not using the existing uselang parameter.

The URL path containing encoding the desired language will be used *only* externally. The idea is to use uselang internally. The language encoded in the path will be rewritten to uselang before it hits MediaWiki code.

In relation to Translate, I would be happy for solution to redirecting readers to content in the correct language (interface language) that would not depend on using Special:MyLanguage which breaks link tables.

Yes, but I would like to keep that discussion separate. The relationship between user language and content language (and variant) is quite complex, and differs across use cases. My proposal should already work out of the box with the way file description pages are localized on commons.

I do recommend using ULS's logic to determine default language if possible, there is no reason to build new, likely diverging, solutions.

Yes, absolutely. There is no intention to build another language selector or guessing heuristics.

To me it seems to be exactly the other way around. Having separate URLs serves the user better - the different renderings of the page get indexed separately by google, you ban bookmark a specific language, and you can link others to a specific language.

Most of the times, when someone shares a uselang URL with me (e.g. those which many Wikimedia wikis generate to link Commons files), the URL is wrong. People don't quite pay attention to the meaning of the URLs they share.

Most of the times, when someone shares a uselang URL with me (e.g. those which many Wikimedia wikis generate to link Commons files), the URL is wrong. People don't quite pay attention to the meaning of the URLs they share.

What do you mean by "wrong"? You mean you get a language-specific URL (for some reandom language), while you would prefer a language-neutral one?

My idea to resolve this is to show a navigation bar when the current uselang disagrees with your preferences. So going from the "wrong" to "your" language would be a single click. This was part of the original RFC, but I decided to cut it out, to keep the scope narrow.

Anyway, would that address your problem with the "wrong" URL being shared?

<DanielK_WMDE> subbu: re default path for a default language: i think that when we first try this, the default path should stay as it is now. eventually, the default path can trigger a redirect to the apprpriate language path <subbu> wfm. (DanielK_WMDE, 22:13:33)

we can use subdomains instead of pathes, but it's probably harder to get right (DanielK_WMDE, 22:14:13)

competing RFC T149419 proposes to split the cache on a cookie, instead of the url. (DanielK_WMDE, 22:14:42)

"We currently do not support setting the same XKey on very large numbers of objects. In practice something on the order of 1-100 objects attached to a given XKey is reasonable." (DanielK_WMDE, 22:26:40)

<TimStarling> action item to corner bblack and make him say that? (DanielK_WMDE, 22:27:02)

<TimStarling> the patch would be to CdnCacheUpdate (DanielK_WMDE, 22:34:32)

@kchapman no resources. unlikely to move any time soon. the discussion and conclusions are still relevant. I suppose that means it can either sit in the backlog, or drop off the board. I'm fine with either.