User:Runab WMF/Language goals and wishlist

This page is currently a draft.More information and discussion about changes to this draft may be on the discussion page.

This document is based on the previous wishlists internationalisation wishlists (2013, 2014, 2017). This page is an living document used to collect ideas that can be used by the Language team for planning purposes. The purpose is to have a ready list of idea candidates for annual planning and for mentoring programs. Each idea should have a title and a description that describes the issue being solved, the possible impact, possible approaches, community demand and expected effort needed to implement the idea. Optionally each idea has a line about its current status, link to a relevant Phabricator task.

Restart work as per roadmap to complete the second version of CX on the new codebase (including VE integration for the translation interface), and initiate appropriate community interactions for release of CX as non-beta on selected wikis.

Various i18n libraries use different ways to mark variables. Some examples

$1 // MediaWiki
$var
%1$s, %2 // Many C/Gettext projects
${var}

With insertables (those buttons that can be activated to insert these), we have made it easier to add these and avoid spelling mistakes in them. However, some of these formats, those with latin letters, are difficult and confusing to use in right-to-left language translations. One possible approach is to unify all these formats, so that translators only see one of them, even though the underlying code will see whatever syntax they use. We can also make it so that in right-to-left languages we use syntax that does not cause issues.

Another aspect of unification is that for translation memory, we should replace all variables with a similar placeholder, so that translation memory matching is more accurate.

If we want to take this even further: Insertables should perhaps be easier to control, so that project contacts have more visibility on them without having to write PHP code to support them. We can do this by 1) supporting most common formats out of the box 2) allowing to specify regular expression directly in YAML configuration.

Translate extension supports multiple file formats. The formats have been developed "as needed" basis, and many formats are not yet supported or the support is incomplete. In this project the aim would be to make existing file formats (for example Android xml) more robust to meet the following properties:

the code does not crash on unexpected input,

there is a validator for the file format,

the code can handle the full file format specification,

the code is secure (does not execute any code in the files nor have known exploits).

In addition new file formats can be implemented: in particular Apache Cocoon (bug 56276) and AndroidXml string arrays have interest and patches to work on, but we'd also like TMX, for example. Adding new formats is a good chance to learn how to write parsers and generators with simple data but complicated file formats. For some formats, it might be possible to take advantage of existing PHP libraries for parsing and file generation. (More example formats other platforms support: OpenOffice.org SDF/GSI, Desktop, Joomla INI, Magento CSV, Maker Interchange Format (MIF), .plist, Qt Linguist (TS), Subtitle formats, Windows .rc, Windows resource (.resx), HTML/XHTML, Mac OS X strings, WordFast TXT, ical.)

This project paves the way for future improvements, like automatic file format detection, support for more software projects and extension of the ability to add files for translation by normal users via a web interface.

Adding a new language on translatewiki.net (translatewiki:Translatewiki.net languages) requires many decisions and checks (e.g. ISO status, names in Wikipedia/CLDR/request, jquery.uls) and changes in various repositories. It's also not clear to translators what the status of their request is, sometimes data is forgotten. Only core staff can help (in practice just a single person) since full access to configuration and repositories is needed.

Suggesting to build a good documentation for the process and clear criteria that can be executed by anyone, leaving only +2 and oversight to admin. Thanks to more active code review tracking, patches there are slightly less likely to get stuck.

Adding a new project on translatewiki.net (requires many decisions and checks (e.g. file format, access rights, license, string quality) and changes in various places. People are asked to join a IRC channel to discuss this. Progress is loosely tracked (with support requests and sometimes Phabricator tasks).

Suggesting to build a good documentation for the process and clear criteria that can be executed by anyone, leaving only +2 and oversight to admin. Thanks to more active code review tracking, patches there are slightly less likely to get stuck.

It would be helpful to alert users when translations are not being exported from translatewiki.net due to not meeting the export threshold. This information should be accessible to the Translate extension. Currently this is specified in the repository management. If this information is moved to the message group configuration, we would avoid duplication, and simplify repository management for exports.

It would be also good to reconsider whether the current export levels make sense. One should check how many translations we currently have, that are not being exported due to not meeting the export level. We should also consider of lowering our thresholds for hopefully increased translator motivation, faster deployment of translations and less wasted work.

Currently translation administrators of translatewiki.net need to check all changes manually using Special:ManageMessageGroups. One of the reason is that sometimes messages are renamed. If this happens, the admin must use Special:ReplaceText or equivalent to move translations in the wiki to retain histories. It would be better if this could be done directly from the review interface. This would save considerable amount of time and bring us closer to fully automated imports as well.

Automatic detection of renames (same content deleted in one message and added in another) would be best, with manual override for the cases rename is not automatically detected or is detected incorrectly.

There must be a lot of glossaries and terminologies out there. Some of them must be useful to integrate in Translate.

Provide technical support for building glossaries with Translate extension. These should directly integrate into the translation editor. The solution can range from a simple glossaries defined in a wikipage to much more complex solutions. But currently we have almost nothing, so it would be good to get the thing rolling.

Let's gap a bit the hole between plain TM and MT and do something more useful than edit distance. Try to get the ElasticSearch-as-TM project to fly and promote it. It could be *the* open source TM solution to plugin into your software, given that there doesn't seem to be much competition at the moment. Use sentence alignment to increase the recall of the translation memory. Or at least fix the known issues mentioned in the task that are making it more annoying than useful for e.g. Tech News translators.

Nobody updates MediaWiki regularly. LocalisationUpdate type of functionality should be available and functional on all MediaWiki installations. It should not need extra configuration such as cronjobs. This requires we can safely and very efficiently serve the updates to the wikis. Naturally, it should be very easy to not use this feature for those who don't want to.

Currently the formal and informal variants are a bit hit and miss. The fact that not every message needs translation (compare with variants of English) makes them problematic. For languages which want to take seriously, we could make the formality an inline feature in addition the existing PLURAL, GENDER and GRAMMAR features.

The formality could be an additional option in the user preferences (like gender) or driven by the language codes directly. Language could choose their own number of formalities, not limited to two.

The underlying assumption here is that only MediaWiki uses these formal/informal variants. If other projects use them, they could keep them as separate languages still until a better solution for them is described.

The MediaWiki message library is very versatile, but some limitations have become apparent over the time. The main one is the inability to embed structures that themselves contain linguistic content in sentences. This is best illustrated with the case of links. All the current alternatives are no nice:

Instead, if we could do embedding, things would be quite simple for translators and developers:

# Suggested solution
msg1: Please see our {{#embed:$1|terms of service}} for more information
call: $this->msg( "msg1" )->rawParam( Html::element( 'a', [ 'href' => '...' ], '$1' ) )->escaped();
// The $1 inside the link gets replaced with "terms of service" from the translation with same escaping as the rest of the message.

It is also possible to device a custom syntax to make it shorter, but that is probably not necessary as translators encounter a lot this kind of syntax already with PLURAL, GRAMMAR, GENDER and some others.

See https://github.com/Nikerabbit/monkey-i18n for proof of concept for this idea. It also supports typed parameters, so that GENDER, PLURAL etc can validate that they are really getting a user or number, and even format it automatically without the need to use numParams().

Impact: Way for replacing our home-grown grammar support with better solutions and collaborating with them.

Grammar is complicated. Amir Aharoni started a project to move grammar rules from code into data (regular expressions in this case) that can be reused both in JavaScript and PHP, and more easily tested. However, this has only been done for a few languages.

And regular expression are likely not flexible enough for all languages. There exists libraries such as Open morphology for Finnish that do a better job. The task for this project would be to investigate how these kind of libraries can be integrate into MediaWiki message processing. Speed is of concern here, as is how to manage the dependencies and services, as these libraries for different languages are developed by different people using different programming languages etc.

MediaWiki is able to support ICU collation, and also has some support for custom-built collation (e.g. in Bashkir). However, it must be enabled manually on each site. It makes sense to have MediaWiki core specify a default collation for language that are supported in ICU or inside MediaWiki (T164985, T47611). Without this, even languages that already can have good collation may be missing it. This affects all non-Wikimedia users of MediaWiki as well.

We should have a reference library which embeds all our learnings and best practices on i18n handling and l10n formats, to promote and use it widely in PHP and JavaScript projects. The library should also try to unify the custom/diverse formats like those for dates from moment.js or others (compare phabricator:T31235).

Collaborate with Globalize.js, cldr.js and other projects and ensure our jquery i18n projects works well with those with minimal overlap. A complete solution includes everything like time and number formatting, localisation file formats, message delivery, message formatting, input methods, web fonts, language selection. It also makes no sense for us to main two i18n libraries written in JavaScript. Make jquery.i18n good enough and make MediaWiki use it.

Currently, we have a sort of conflict between our own PHP and JavaScript libraries and even many Wikimedia projects in PHP end up using custom solutions. At translatewiki.net we have multiple PHP projects and many php projects that would benefit from high quality i18n library they could just plug in. Licensing issues might be a problem if we want to reuse code from MediaWiki.

We don't have recommendations for important languages like Python, which are "stuck" with Gettext (or custom formats like pywikibot?).

There are many PHP projects that would benefit from high quality i18n library. MediaWiki has many excellent features such as extensive handling of parameters, parameter types etc. It has some drawbacks though such as not being able to support nested constructions.
See also https://github.com/Nikerabbit/monkey-i18n

At translatewiki.net we have multiple PHP projects. The licence (GPL-2.0+) might be a problem if they want to reuse code from MediaWiki.

The number of wikis using translation extension has increased significantly. At translatewiki.net, in some rare cases people run out of things to translate. It would be benefical to have some kind of central place to see translation status across the Translate universe. It would facilitate cross-project collaboration and raise awareness of different wikis having different kinds of content to translate.

Various ideas have been floated for implementation, from one special page just listing overall translation coverage in each wiki for a given language, to a "blog roll" type of links across wikis as well as single sign-on systems to ease moving between wikis.

Providing our translation resources to additional projects, which are not directly or entirely under the Wikimedia umbrella, can often give back a lot: translatewiki.net is an example of a success, where for instance "offering" our community of translators to additional software projects has helped expand the number of translators also for MediaWiki.

Impact: Increase in the number of software projects that adopt fast translation updates. Increased translator motivation and quicker fixes thanks to seeing translations in action faster.

Provide an efficient service/API for any product to automatically update their translations live. It is not necessary to implement as part of MediaWiki with PHP. Could also be Node.js/Go or something else.

There are many open source software products out there with translations. It would be great if those translations could be harvested to provide a open source translation corpus and translation memory. See also https://intense.wmflabs.org for an attempt that has since been stalled.

Translatewiki.net exports are currently semi-automated to the level that one needs to run one script and watch the output. Ideally, it should be fully automated, run by a cron job.

The issues we currently consider blocking this are:

migration away from personal ssh keys to a key used by translatewiki.net

secure handling of this service ssh key

migration of all projects to our new repository management tool (repong)

reliability of exports

automating the process of addition for new languages

defining the automation via puppet

Step 1) can be completed by poking existing products to add access with the new key, preferably using an account for translatewiki.net. See for example https://github.com/translatewiki.

Step 2) would need advice from people with experience on this kind of thing to make sure it is secure. Obviously the automated exports would need access to this key, which is currently password protected.

Step 3) requires adding support to non-git version control systems to repong (written in PHP)

Step 4) would entail adding more checks on our end to verify we are not creating broken files. This rarely happens in syntactic level, as we use pretty standard libraries and battle tested code, but on higher level this can happen (e.g. not outputting authorship info). We should also better handle failures (logging, making sure admins can easily see and act on them). One issue is that the project might commit changes between our last import and following export. At minimum we should abort the export if this happens, by checking that we are exporting to the same revision as we have imported.

Step 5) needs more thought. Many projects need to add a code map or register new language in a separate file. Perhaps we can devise a safe way to run scripts that these projects create, or just not export new language automatically, falling back to humans to add those manually.

Step 6) is just making sure this all happens automatically by having cron or similar execute exports periodically.

Software translated in translatewiki.net usually uses the master branch as input and export. This means that once a stable branch is created, it stops receiving translation updates. It should be possible to translate, import and export multiple branches simultaneously. When translating, the messages which are same across branches should only be translated once.

Branch support has two benefits:

software that is branched but not yet released can receive translation updates

software that is already released, can release minor updates with latest translations

Lately the number of active translators at translatewiki.net has not grown. We should try to get big projects like KDE to get growth in number of translators.

Alternatively, convince people like the Free Software Foundation to adopt MediaWiki+Translate for the translation of their software with as little quirks as possible.

With increased number of translators, it is expected that proofreading will increase, providing more consistent translations. It also justifies increasing resources to support the development of the platform. Would also help with: Promote i18n best practices.

Magic words, special page aliases and namespaces should be translatable with a web interface to:

allow translators to change or update translations easily and quickly, without having to know about order of precedence or allowed characters and so on, but also reports on mistakes;

keep translations in a data format which is resilient to mistakes (no fatals due to data errors) and can be easily exported to the repositories (without worrying about removing translations which should be kept for backwards compatibility), like some JSON format on ContentHandler pages on translatewiki.net;

ideally, export such updates as part of the usual scripts to follow the usual continuous translation model and reduce breakage.

Impact: Much more translated illustrations available for use in Wikipedias and other sites to make people understand better.

The TranslateSVG extension has been developed as a part of Google Summer of Code project. It needs further development to bring it to the level it can be deployed to Wikimedia cluster. Also the community needs to be involved in testing this tool to make sure they want it.

Commons supports subtitles on videos. Those are translated by editing a wiki page containing a special syntax. For multiple reasons this is not ideal. We should be looking into possibilities of integrating this with our existing translation tools or with some other free software tools that already exists (perhaps by integrating those into our tools). The goals should be: discoverability of things to translate, easy translation, modification tracking and no extra steps to have translations become available.

PageForms is an extension previously known as SemanticForms. It allows to create forms for inputting data in a structured manner. It would be great, if it was possible, when creating a form with a form, to ask the form to be made translatable. This can be done manually after the form page has been created, but it is a laborious process to extract all the strings manually, create dozens of pages and manual configuration in LocalSettings.php (only available to wiki administrators). This all could be simplified with a better integration to something like one checkbox to check during form creation. After that one could go to Special:Translate to translate the form.

As seen from many empty or semi-empty pages at Help:Contents, most of the help pages and manuals for MediaWiki core (and certain extensions) still live on Meta-Wiki or rather on local Wikipedias: this leaves most small Wikipedias and other projects in "small languages" with no documentation or outdated documentation. Everything should be central, translatable, easily and equally available from all wikis.

«Jon Liechty [...] indicated that half of Wikimedia [Commons] uses the English language template, but the rest of the languages fall off logarithmically. He is concerned about the "exponential hole" separating the languages on each side of the curve.» [1]

Having a fully multilingual MediaWiki with Translate is possible, but sometimes tedious, especially for large wikis with a lot of history. Certain tasks should require less manual work than they do now.

Wikimedia CAPTCHAs are broken: they don't stop any bot or spammer, but they harm real users and are harder for non-English speaking users, impairing the growth of non-English wikis. The captchas should be localised or make language agnostic. Research is needed to identify different CAPTCHA options, designed for multilingualism. See related bug 32695 (mostly focused on the reCAPTCHA-like solution with Wikisource integration). Some prototypes have been designed a while ago.

Impact:

Over three millions captchas are filled every month on Wikimedia projects. Risk of failure is high, but when it succeeds, the rewards may be huge.

Possible approach:

Preliminary discussion and general questions to mentors should happen on Talk:CAPTCHA; please create specific proposals/applications as subpages of the page CAPTCHA and discuss them on talk.

Wikimedia now has infrastructure for providing machine translations via a service (partially based on unfree software). These services are now in use by the Content Translation tool and the Translate extension for page translation. We could also use these services in wiki discussions, to request translation of a comment of whole thread, to help non-speakers to understand what is being discussed, without having to copy-paste the text manually to a translation service. This could be integrated into Flow, for example.

Possible approach:

One component that is required for this is to detect the source language. Often we can assume it is the default language of the wiki or page, but in multilingual wikis such as Meta and MediaWiki.org, it is necessary to use a library or service that identifies the language.

Multilingual Wikimedia sites such Commons, Meta, Wikidata and MediaWiki.org require to register a user account to change interface language. It should be possible to change the language without registering a user account and logging in and without relying on the traditional uselang hack and uselang links created via JavaScript.

The portal of each project, like wikipedia.org, should have a multilingual search, automatically returning results in the most relevant language; see also bug 1837. Such multilingual search could then be expanded to Special:Search of each wiki and triggered either automatically, or as fallback (see also above), or as another option/profile/whatever.

Project administrators/coordinators (project contacts on translatewiki.net) should be able not only to have a clear sense of what work is going on (#Relevant statistics), but also of what translators/languages may need an additional effort (or vice versa are going especially well), in order to be able to contact translators where needed. Detailed reporting may be needed, if not an interface to semi-automatically send notifications in certain cases (such as translators who've reduced activity a lot in a language which needs more translations).

Thanking translators is still best done manually, but project contacts need to know whom to thank (knowing about new languages exported may also be helpful to tweak their configuration to actually use them, at times).

Impact:

The ability to easily communicate with "your own" translators can help project administrators build a sense of community and make them feel they're still in charge of the project even though they've merged with a larger wiki/community.

Possible approach:

Translators should be able to stay on top of new translation work easily, e.g. by subscribing to feeds and notifications in the projects of their interest when there are new messages in the source language or requests for translations update (which no longer triggers edits and hence escapes enotifwatchlist). They are also interested in knowing how they rank against others, but our tools to this purpose may be: currently we have a monthly rank on the main page, a contribution count with a babel template and total "ranks" with language statistics

People interested in the localisation of some project often has some time-bound need for a specific subset of its messages to be translated quickly. For instance, developers or project coordinators may want a group of recent messages to be translated in as many languages as possible before a release; or interested users may want to increase translations of a handful messages which they find especially important.

The typical way to solicit translations is to make a clickable list of messages and advertise it by public or private messages (in Wikimedia, translators-l is often used).

a wiki page containing wikilinks to each message (tedious, requires to handle subpages; translators have to open many tabs and use the bare wiki editor);

using pre-TUX Special:Translate to list all messages in the group (and link an anchor) or list each message individually but in all languages with Special:Translations (tedious, slow to load, unsupported, uses old interface);

making a custom message group (requires intervention of developers and/or sysadmins).

Possible approach:

In an ideal workflow, Translate (or TranslationNotifications) would also be able to handle the delivery of such notifications/requests, with two possible levels of integration:

allow to send a manually-composed message (with said URL) to a list of languages/projects/users,

also allow to build such message and list of recipients.

The focus should probably be on an interface to query message groups (to identify the needed messages) and make a list of users depending on their language (and its translation level for the needed messages), activity and interest in the project. The actual delivery could be delegated to existing systems such as Extension:MassMessage.

Sometimes projects want to know more about the workload for translators and so on. Translate offers a lot of reporting, but one simple feature we're currently lacking is the ability to count translations by number of words rather than by number of messages. Additionally, our statistics pages are currently missing information about proofread progress.

Possible approach:

Needed components:

a way to computer number of words (should work in any language)

storing the number of words somewhere in the database for quick access

updating the statistics pages to use words instead of messages

Special:LanguageStats and Special:MessageGroup stats look outdated compared to TUX editor and need a face-lift too. In addition we could make them Web 2.0 compliant and make them faster with AJAX by not loading all information immediately. When the number of messagegroups grow to thousands, the page gets slow to load and use.

A reliable way for system administrators or wiki administrators to force hard updates of statistics and all caches may also be welcome, to easily overcome and problem with cache or job queue or other (compare phabricator:T145295#2916209).

New translators in particular want to easily have feedback on their translations and what happened to them. On Translate wikis in general, watchlist (for accept log and modifications) and contributions (for a mere list) are not enough, especially for unlogged/separate actions like setting workflow state, pushing to CentralNotice, copying to another wiki, exporting to a CVS. Credit: Gloria_S.

It is especially unclear to people when their translations will appear in the software. With some more integration of repository scripts, it should be possible to add metadata to translation revisions in which commits or branches they are included. Different kind of summaries can then be built on this data, such as "these translations of yours are still waiting to be exported".

Possible approach:

The implementation would consists of two mostly independent parts:

a tool that reads git repositories to check for each translation in which branch they are,

an interface that displays relevant information to the translators.

The interface could also be RSS, Twitter, IRC, whatever makes sense (perhaps also for imports): the main benefit would be transparency for what and how much we do. Since 2016, rakkaus sends some messages to #mediawiki-i18n for autosyncs, but these are cryptic. Going further, we could send out notices to translators "Your translations are now visible to users".

As an extension, we could try to hook up into the Wikimedia LocalisationUpdate process and the release processes of different projects to also record the information when they are deployed. This is likely much more complicated.

When multiple people are translating same group (say a translatable page), it would be easier to see the updates they do live, something akin to etherpad. It doesn't need to support multiple people editing the same message at the same time. Even seeing what messages are open (and their content) would help. Credit: neverendingo.

Wiki editing and discussion doesn't work with many users not agreeing on what's best. A quick list of alternative (past/proposed/used elsewhere) translations with relevant info would help find what's consensual.

See for instance fundraising messages which in theory should gradually approach perfection but in reality only see things changed back and forth continuously, with the worst translations usually surviving. Situation is made worse by: a) history not being easy to reach from translation editor, b) discussion even less, c) activity and discussion around same translations in other messages being completely lost, d) people never finding where the message is (especially from other wikis as with banners).