Many flaws in short/narrow unit data

Description

The CLDR 24 short and narrow forms data for units contain many
outright errors, inconsistencies, nonce abbreviations, failures to
compose SI unit designations from prefix+named unit, and other errors
of various kinds. It would a very bad idea to use that data in any
application.

I would suggest a special round of opening the ST for short/narrow units
only, perhaps limited also to the locales which already have some such data.
The starting point should be the data in the spreadsheet linked above,
not the current erroneous data.

Attachments

Change History

The committee reviewed this bug. It appreciates the work you have done, however, cannot simply take a raft of data that has not been vetted, for a variety of languages that you are not fluent in, and that do not follow the principles that CLDR is using for units.

If you want to file a new bug with narrower scope, for languages that you are fluent in, following the principles that CLDR is using, the committee would welcome that.

The committee [...] cannot simply take a raft of data that has not been vetted,

That was not the request, even if I would have been happy if you had done so...

The request was, and is:

I would suggest a special round of opening the ST for short/narrow units
only, perhaps limited also to the locales which already have some such data.
The starting point should be the data in the spreadsheet linked above,
not the current erroneous data.

In addition, "bulk upload" should be turned off, to lessen the risk that old bad data is re-uploaded. The thereafter "vetted" data will still need to be reviewed after ST close, so that erroneous or inconsistent data can be fixed
(or deleted).

Note that the committee has, erroneously, accepted a "raft" of data that contains numerous grave errors, and an even greater number of inconsistencies. I don't see why the committee want to keep such exceptionally low quality data;
the quality for short/narrow units in CLDR 24 is so low that the data should never have been released.

for a variety of languages that you are not fluent in, and that do not follow the principles that CLDR is using for units.

Not sure what principles that would be, since the released data on units follow no principles at all, and is a hodgepodge of inconsistencies, lazy votes for (bad) root data, nonce abbreviations, lack of appropriate
"translations", and outright errors.

In the data I give in the referred to spreadsheet, I have fixed numerous problems, including:

outright errors

decimeters referred to when one must refer to inch

tonne referred to when one must refer to hour

second referred to when one must refer to hour

"mile" nearly always need to be qualified to be of the "English" variety (fixed many such omissions in the spreadsheet after first submitting this ticket)

volt referred to instead of watt

local use area unit referred to instead of acre (a qualification to "English" could be an acceptable fix, but was not present)

others

inconsistencies

use of space or not between number and unit designation (minor, but still annoying)

each named unit need to always be referred to in the same way within a locale, including:

same script (unless one really needs to keep Latin script)

same letter sequence

each prefix (for SI units) is always referred to in the same way within a locale (ok, there is an issue with hectare and zh that needs to find a solution, which I currently don't have)

for locales using the Latin scripts, the SI units must use the standard symbols for the units; grave error not to do so

nonce abbreviations for SI units, prefixes and powers

the prefixes need to be "translated" consistently within each locale, and not be an abbreviation; if not in the Latin script, the designation must still be "in the spirit of" how SI prefixes are constructed

the named units need to be "translated" consistently within each locale, and not be an abbreviation; if not in the Latin script, the designation must still be "in the spirit of" how SI named units are constructed

while it is understandable that superscript digits are often substituted with abbreviations for "square"/"cubic", that is more an issue with keyboard layouts (hard to write superscript digits) than anything else; for this data superscript digits should be used in all locales

If you want to file a new bug with narrower scope, for languages that you are fluent in,

Actually, very little of the data for short/narrow units is *linguistic* at all. Only in the cases for imperial and popular (horsepower, light year) units do actual words from any language come into play. And there I most often have picked data from the long form, and where needed tried to find a suitable short/narrow form. For the SI units it is more a case of finding the appropriate letters in the script. Nonce abbreviations are not ok for the SI units.

B.t.w. for the languages where I can submit data via ST, the units section has higher quality (though some entries were later botched) than for most other locales...

And yes, here I can spot the errors without being fluent in the languages in question. As I said, very little of the data for short/narrow units is *linguistic* at all.

Without doing anything special, errors, inconsistencies, and other inappropriate things in the unit section will linger for years, even decades, making the units section of the CLDR locales inappropriate, even an error, to use for years or
even decades. I think that would be a very bad idea.

following the principles that CLDR is using, the committee would welcome that.

Again, not sure what principles you are referring to, given that the released units data in CLDR 24 is an unprincipled mess filled with errors (and any application using it is in grave error), whereas the data I propose as a base for an ST round is highly principled...

And yes, there really is a need to be able to say for imperial units: "not used". I don't mind "translations" of the imperial units per se very much (if they are corrected, otherwise I do mind), but they really need to be marked as not used for most locales. Whether that is done by deleting the translation, replacing each of them with "∅∅∅", or by a separate attribute ("common use"/"limited use"/"not used") I leave up to the committee. But if using an attribute, all SI compatible units must be marked "common use", though, regardless of locale. And no, I don't
mind local use units. Indeed, I'd like to see, in future, among others, coverage for Scandinavian mile as well as units popularly used locally in Asia and elsewhere.

I really do think that we (and the world...) needs to be saved from the bad data in the units section of CLDR 24. But your approach so far does not achieve that, but my suggestion might.

Kent, the committee would like to make a number of fixes that you outline, but needs to pick among the types. So what would help us is if you could mark them in the spreadsheet according to the following breakdown.

outright errors

decimeters referred to when one must refer to inch

tonne referred to when one must refer to hour

second referred to when one must refer to hour

volt referred to instead of watt

consistency -- if inconsistent within the same unit and width.

use of space or not between number and unit designation (minor, but still annoying) --

each named unit need to always be referred to in the same way within a locale: same script -- if inconsistent within same unit and width

can't take without justification by native speaker.

"mile" nearly always need to be qualified to be of the "English" variety (fixed many such omissions in the spreadsheet after first submitting this ticket)

local use area unit referred to instead of acre (a qualification to "English" could be an acceptable fix, but was not present)
*[We've queried many people, and have found that for many languages, the traditional unit known as "mile", even if a different length, in modern use is unambigously interpreted as the English units]

each prefix (for SI units) is always referred to in the same way within a locale (ok, there is an issue with hectare and zh that needs to find a solution, which I currently don't have)

for locales using the Latin scripts, the SI units must use the standard symbols for the units; grave error not to do so

nonce abbreviations for SI units, prefixes and powers

the prefixes need to be "translated" consistently within each locale, and not be an abbreviation; if not in the Latin script, the designation must still be "in the spirit of" how SI prefixes are constructed

the named units need to be "translated" consistently within each locale, and not be an abbreviation; if not in the Latin script, the designation must still be "in the spirit of" how SI named units are constructed

[We don't have assurance that it is not standard practice in the language.]

any others that aren't outright errors or inconsistencies that span units or widths.

don't quite understand

while it is understandable that superscript digits are often substituted with abbreviations for "square"/"cubic", that is more an issue with keyboard layouts (hard to write superscript digits) than anything else; for this data superscript digits should be used in all locales

I've added comments (column J) regarding what change has been made for each entry (line).

It's not exactly the breakdown above, but many items are changed for multiple reasons, and it is hard to mention them all. I hope the comments are still helpful, they are a bit more detailed than the above breakdown.

I've also gone over the data itself yet again, and done more changes similar to the ones I made in the first round.

Columns F and G are for short, with column G indicating which values should be marked "unused". Columns H and I are for narrow, with column I (redundantly) indicating which values should be marked "unused". (I'll file a separate ticket on that.)

Apart from space and outright errors (referencing another unit, missing {0}, or even missing unit designation), the proposed values should of course be reviewed translators (I already suggested a separate opening of ST just for that, with the proposed changed values are primary). And the suggested fixes to the fixes then reviewed as well (by the committee), so that inconsistencies or errors are not reintroduced.

Wow, this is a lot of data, a very thorough analysis. But many of the potential issues identified here need some input from language specialists. There is not time for me to do this in CLDR 25. We will try early in 26 or perhaps in a dot release.