The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.

You rarely need to index a string with an integer in Python. FOR loops don't need to. Regular expressions don't need to. Operations that return a position into the string could return an opaque type which acts as a string index. That type should support adding and subtracting integers (at least +1 and -1) by progressing through the string. That would take care of most of the use cases. Attempts to index a string with an int would generate index arrays internally. (Or, for short strings, just start at the beginning every time and count.)

Windows and Java have big problems. They really are 16-bit char based. It's not Java's fault; they standardized when Unicode was 16 bits.

I think it's even better to take this one step further and have your default "character" actually be a grapheme[1]. In almost any case where you're dealing with individual character boundaries you want to split things on the grapheme level, not the code-point level.

This doesn't matter much for (normalized) western European text, but if the language in question needs to use separate diacritical code points you'll likely end up with hanging accents in the like. Swift is the only language I know of that has grapheme clusters as the default unit of character, I'd love to see it in more places.

Navigating a UTF-8 string on codepoint level is a fairly simple algorithm, since UTF-8 is self-synchronizing. This means it can easily be done without relying on external libraries or data files. It's also stable with respect to Unicode version--it always produces the same result independent of what version of the Unicode tables you use.

Moving to grapheme cluster boundaries means that the algorithm may work incorrectly if you input a string of Unicode N+1 to an implementation that only supports Unicode N. It also makes the "increment character" function very complicated. In the UTF-8 version, this looks roughly like:

See the vast difference in the two implementations? It's a lot of complexity, and it's worth asking if that complexity needs to be built into the main library (strings are a fundamental datatype in any language). It's also important to note that it's questionable whether such a feature implemented by default is going to actually fix naive programmers' code--if you read UTR #29 carefully, you'll notice that something like क्ष will consist of two grapheme clusters (क् and ष), which is arguably incorrect. Internationalization is often tied heavily to GUI and, especially for problems like grapheme clusters, it arguably makes more sense for toolkits to implement and deal with the problems themselves and provide things like "text input widget" primitives to programmers rather than encouraging users to try to implement it themselves.

History has shown that, when it comes to strings, developers have a hard time getting even something as simple as null-termination correct. If grapheme handling is complex, that's an argument for having it implemented by a small team of experts exactly once. The resulting abstraction might not be leak-proof, but then no abstraction is.

(Probably a bit "unfair" to "pounce" on an off-hand paranthetical like this, but I'm in a bit of a pedantic mood...)

This is not true for e.g. Haskell. In Haskell it's defined as [Char], i.e. a list of characters. (Of course the Haskell community is suffering from that decision, but that's another story.)

I'm not sure why strings would need to be a fundamendal type, though. Sure, they would probably be part of the standard library for almost all languages, but they don't need to be "magical" in the way most fundamental types (int, etc.) are.

I don't think there should even be a "default" character at all. In some cases, codepoints are the right choice; in others, graphemes are. If we make this explicit - e.g. by forbidding to iterate strings directly, but providing functions/methods to extract codepoints and graphemes as iterable sequences - the programmer has to make a conscious choice every time they iterate.

In principle I'd agree with you, but the response from most people learning a language that treats strings like that is likely to be "You can't iterate over strings? And it doesn't even have a character type!" I'm not saying people are stupid: I had never really thought about the difference between a code point and grapheme until relatively recently when I had to do some low-level text layout stuff and it became important.

My point is that I don't think throwing people into the deep end and expecting them to grok the codepoint/grapheme division before they get to use the language is likely to be productive. Defaulting to graphemes carries the advantage that if a programmer who doesn't know much about languages that require more Unicode finesse uses it purely intuitively, they'll get things right in a lot of cases. Using codepoints, on the other hand, makes it easy to put text through a grinder while doing relatively innocent things.

Unfortunately while we're discussing the finer points of using graphemes and codepoints in APIs a million supplementary characters were brutally eviscerated by code running on languages that haven't quite gotten past UCS-2 XD

I think that throwing people into the deep end is the only way to get them to do it right in this case. Defaulting to graphemes is too often the wrong answer, as well (e.g. you don't want that in a parser).

Really, this is not dissimilar to "what do you mean, there are more letters than A to Z?" issue that plagued software written in US back before Unicode became dominant. The way we (my perspective on this is as a native speaker of a language with a non-Latin alphabet) have eventually solved it is by basically forcing Unicode onto those people. It broke their simple and convenient picture of the world, and replaced it with something much more complicated. But it was necessary.

My position is that letting programmers get away with a simplistic view of text processing (by allowing defaults that "mostly" work) is what creates those issues. So adjusting the abstractions such that they expose more of the underlying complexity is a good thing. People SHOULD believe that doing text processing the right way is hard, because it is.

Perl has historically had excellent Unicode support. I remember going from Perl to Python (like a decade ago), and being annoyed at how messy Unicode support still was. Ruby, too, lacked good Unicode support, for many years after Perl had it pretty good.

But, Perl 6 definitely gets it more right than other implementations I've seen.

Since all other systems have standardized on code points this would lead to subtle incompatibilities. For example, checking for length prior to inserting in a database must be done in code points.

What i find more frustrating is how the documentation for many systems describes the basic unit of text as a character, without specifying whether a code point or grapheme is meant, and without leading people to an explanation of the difference. There is still a lot of software that processes unicode text incorrectly, not because it is difficult to do so, but because nobody told the developer how things should be done.

1 grapheme = 1 unit of datatype is a good start. There is a lot of complexity in Unicode, but the mental model of a "char" is what most developers are thinking, largely blind to that complexity. Kudos to anyone trying to make that more "common sense" model closer to reality.

However... How does a language or library attempting to abstract this part (like swift might) deal with the other, unrelated annoying aspects of Unicode? Even if 1 glyph is 1 "char", and we normalize all the inputs, there is still, say ... Bi-di text.

I don't think there is an easy answer to this. Even just pure RTL text is hard to support: IIRC Android and iOS only recently made all the built-in views support it. This is complicated by the fact that most developer teams don't have nor can afford to hire someone familiar with RTL to implement this kind of stuff.

I don't think the rules in Unicode are simple for anyone. I have seen professional developers who are native speakers of RTL languages screw it up. All this stuff about the directionality of a paragraph and the embedded markers... It boggles the mind. But a truly international product should be getting it right. Sad mismatch there.

PS: bit of trivia that people forget these days, Win32 has been supporting it for longer than android/iOS.

There are a couple cases for string indexing, usually involving parsing or regular expressions. You might want to slice the quotes off of a quoted string, or slice from one match of a regular expression to a match of a different regular expression starting at a different index. These come up infrequently enough that it doesn't make sense to make a better API just for these use cases, but frequently enough that it would be a serious impediment if we didn't do some kind of string indexing.

I agree, however, that it's completely irrelevant whether the indexes correspond to code units (i.e. byte offsets in UTF-8) or whether they correspond to code points (how it works in Python currently), as long as we have some way to store, compare, and otherwise manipulate locations within a string.

Some Rust developers at one point proposed making string indexes their own (opaque) type, as you suggest, so that they couldn't be confused with integers used for other purposes. The extra complexity of such an API meant that this proposal was never really taken seriously, and it only prevents a small category of programming errors.

You might be interested in looking at some string APIs which are mostly without string indexing, like Haskell's Data.Text, which is one of the most well-designed string APIs ever made.

That's why I really like dealing with UTF-8. As you said you can just index by byte instead of having to worry about code point boundries. This is because a code point will never match inside a larger code point.

So if I search for a one byte quote character it will never match the second, third or fourth bytes in a larger code point. Same with other control characters.

Makes a lot of sense. I get frustrated with how explicit I have to be with Rust sometimes (currently I'm writing something with a lot of reusable, immutable structures and am put off by how much Rc.clone() that requires) but making it known to the programmer that every indexing of a String is effectively an iteration is a good call.

The advantage of an opaque type is that you should be able to add one to it and advance one rune. Go copies the string functions from C, but doesn't (at least as of last year) offer "advance one rune" and "back up one rune" functions.

If by rune you mean code point that doesn't really gain you anything. If you are searching for something it's faster to just scan a byte at a time.

If you want to split the string into "characters" you need to do it a the grapheme level (multiple code points) and for that you need to use a unicode library. But that adds overhead when you're just scanning.

A JSON or XML scanner does not need the added overhead of advancing by code point or grapheme.

I don't understand the argument for "characters = grapheme clusters". From my perspective, there are a lot of different ways you'd want to iterate over a string. Grapheme cluster breaks, tailored grapheme cluster breaks, word breaks, line breaks, code points, code units… all of these make sense in some context. However, there are precious few times that I've wanted to iterate over grapheme clusters, so telling people that they should do that instead of something else doesn't make sense to me. (I mean, what problem is so common that we would want to iterate this way by default?)

For parsing, it often makes sense to iterate over code points or code units, since many languages are defined in terms of code points (and you can translate that to code units, for performance). XML 1.1, JavaScript, Haskell, etc... many languages are defined in terms of the underlying code points and their character classes in the Unicode standard. JSON and XML 1.0 are not everything.

We're pretty much on the same page here. When you want to slice a string (because you can only display or store a certain amount) or you want to do text selection and other cursor operations you can't so it by code point. That's where you want to break at character boundries which are graphemes or grapheme clusters.

For parsing it's easier to just scan for a byte sequence in UTF-8 because you know what you're looking for ahead of time. If you're looking for a matching quote, brace, etc. you just need to scan for a single byte in your text stream. Adding a smart iterator to the process that moves to the start of each code unit is not necessary and will slow things way down.

I just gave JSON and XML as examples and not an exhaustive list. If you know the code points you are scanning for it's way more effecient to scan for their code units. The state machine in a paraer will be operating at the byte level anyways.

I have yet to see a good example where processing/iterating by code point is the better choice (other than the grapheme code of the unicode library).

I'm not convinced that state machines will operate at the byte level. First of all, not all tokenizers are written using state machines. Even if that is the mathematical language we use to talk about parsers, it's still relatively common to make hand-written parsers. Secondly, if you take a Unicode-specified language and convert it to a state machine that operates on UTF-8, you can easily end up with an explosion in the number of possible states. Remember, this trick doesn't really change the size of the transition table, it just spreads it out among more states. On the other hand, you can get a lot more mileage out of using equivalency classes, as long as you're using something sensible like code points to begin with.

State machines would have to operate on the byte level. Otherwise each state would have to have have 65536 entries per state. The trick to handle UTF-8 would be to have 0-127 run as a state machine and > 127 break out to functions to handle the various unicode ranges that are valid for identifiers.

For languages that only allow non ascii in string literals a pure state machine would suffice.

Not sure why you're mentioning parsers. At that point you you're dealing with tokens.

As for UTF-16 it's an ugly hack that never should have existed in the first place. Unfortunately the unicode people had to fix their UCS-2 mistake.

Since Javascript is standardised to be either UCS-2 or UTF-16 it probably made sense to make the scanner use UTF-16.

State machines don't have to operate on the byte level because the tables can use equivalency classes. This will often result in smaller and faster state machines than byte-level state machines, if your language uses Unicode character classes here and there.

Looks like Javascript source code is required to be processed as UTF-16:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.

I'm not sure it is so common in UIs. Truncation is done by a single library function, so that's one case where it's used. Another case is for character wrapping, but that's fairly uncommon. I'm having trouble coming up with another case where it's used. Font shaping is done by a font shaping engine, which applies an enormous number of rules specific to the script in use. Text in a text editor isn't deleted according to grapheme cluster boundaries, and the text cursor doesn't fall on grapheme cluster boundaries either. These are all rules that change according to the script in use.

> The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.

It's especially amusing because on Python 3 strings internally cache their utf-8 equivalent if it was used once.

Yeah, I didn't really understand this until I heard Go/Plan 9 guys ranting about this. In other words, char* IS unicode if you use utf-8. Otherwise you need wchar_t and all that junk.

I think Python got unicode in the same era as Java, so it's understandable that Python 2 doesn't work like this. But if they are going to break the whole world for unicode, I also think it would have been better to do something like Go does (e.g. the rune library).

It's very easy to not get corrupted strings when byte indexing into UTF-8. In a while loop, if the index is not at the end of the string and the top two bits of the character at the index are both one, advance the index by one.

Or throw an invalid index exception if the top to bits are one if that makes more sense for the language you're using.

In a while loop, if the index is not at the end of the string and the top two bits of the character at the index are both one, advance the index by one.

So what you're saying, is that it's very easy to get corrupted strings by anyone who doesn't have an understanding of utf-8 at the bit-level - which in my experience seems to be the majority of programmers.

Except what happens in the real world is that people who are used to indexing and slicing ASCII strings however they please don't think "I should use a library for this", instead they just keep indexing and slicing as per usual and don't think anything of it until their Chinese customers start complaining of random program crashes, or missing text - which the developer then has difficulty trying to reproduce because hey, it works for them.

My only gripe with your argument is that I don't think it's easy to avoid corrupted text in modern text processing - which is precisely why there are libraries for it because it's actually really easy to get it wrong - even if you know what you're doing.

Which is why we have languages like Go where we can put those types of developers. Incidentally Go use UTF-8. Higher level languages like Go, Python, etc were designed so newbie and/or ignorant programers could do less damage.

When I was working on a project before Unicode we would switch our dev PCs to the other languages we supported. What a pain that was. Only issues we had was when a translated string was much longer than the screen space allocated to it. I belive Swedish was the main culprit. No problems with simplified and traditional Chinese as those were more compact. I have no sympathy for dev shops that can't get internationalization right. As with everything else in the corporate dev world management doesn't seem to want to hire/retain the more experienced programmers.

I think you have a gripe with my argument because you may be missing my point. If a high level language chooses to let a programmer index into a UTF-8 string at the byte level (for performance and other reasons) it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

The reason being is that the language function to slice a unicode string would either throw an exception or just advance to the next valid index. There wouldn't be a way for the programmer to slice a unicode string in the middle of a code unit.

I think you have a gripe with my argument because you may be missing my point

I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things, or keeping programmers who don't understand what they are doing away from that sort of thing.

The most egregious example that I've personally seen was a developer working on a legacy Cobol banking program that needed Chinese support retro-fitted to it.

The app was originally only developed with ASCII in mind and so sliced through strings willy-nilly, which naturally caused problems with Chinese text.

The developer working on the "fix" before me, was calling out to ICU through the C API of the version of Cobol that we used and was still messing things up - he'd actually modified ICU in some custom way to prevent the bug from crashing the program, but was still causing corrupted text.

I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary. Much simpler and resulted in the removal of an unnecessary dependency on ICU.

This bug had been outstanding for several months when I first joined that company, and it was the first one I was assigned to work on - and luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

Okay, but even you made a mistake in your first example of what to do, and that's the sort of code that someone who knows what they are doing could write, and will seem to work in the conditions under which it was tested (working on my machine, ship it!), but that will cause seemingly random problems once it hits users.

> I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things

No, I still think your missing some of it. I am not advocating that what I said is the solution for everything.

Someone said that slicing UTF-8 strings leads to string corruption and endorsed the Python 3 frankenstien unicode type as a way to avoid it. I just gave a way of preventing that.

Now you argued that a novice programmer would fail to implement it properly. So you're comparing my method implemented by a novice programmer to a method implemented by profesional compiler writers. That hardly seems fair. :)

So my argument is that if my method were to be implemented by professional compiler writers it would prevent corrupted strings while still using UTF-8 as the internal representation.

> I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary.

> luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

I writing this on an iPad while watching TV and playing a game on another android tablet while looking at the wikipedia UTF-8 article on a tiny phone screen while a little white dog is trying to bite my fingers (wish I was making this up). Not exactly my usual programming environment. ;)

Which wide characters are you talking about? Because on Windows, where wide characters are 16 bits, it's quite possible to get corrupted strings (and in fact quite a few well-known programs, written by quite well-known software companies, make this exact mistake).

All you need to do is index/slice a string half-way through any character that is outside Unicode's Basic Multilingual Plane

This drives me crazy. The Win32 API was designed for UCS-2. Then UTF-16 came out and the API was shoehorned to use it but as you said they still haven't caught all the places where it still thinks it's UCS-2.

First of all, Unicode doesn't define characters it defines codepoints.

I get that this might seem pedantic, but it's important to be pedantic about this, otherwise misconceptions and ambiguities occur e.g. 'just use wide characters' - the definition of which changes depending on the platform.

Second of all, "wide enough" for all intents and purposes means 32 bits. Technically Unicode only needs 21 bits to cover the currently defined codespace, but computers don't deal well with that and so 32bits is the minimum "wide enough" character size.

This creates a lot of wasted space and memory, not to mention pushes medium length strings across cache line boundaries for very little benefit - the ability to directly index/slice strings without accidentally corrupting data.

Now obviously you want to avoid accidentally corrupting data, the tradeoff comes down to whether you need direct, arbitrary indexing, or if it's worth doing some processing to determine the correct place to split in order to make space gains.

The technical world has come down overwhelmingly in favour of the latter, and that's why you see hardly anyone using utf-32. It's simply not as good a solution for most real world concerns.

Graphemes operate at a higher level than characters. You could construct a grapheme-strings, I suppose, but that has tons of edge cases, and if you don't like character-strings, I doubt you will like grapheme-strings.

That's what they were hoping for. Didn't tuen out thst way. From icu-project.org:

"As with glyphs, there is no one-to-one relationship between characters and code points. What an end-user thinks of as a single character (grapheme) may in fact be represented by multiple code points; conversely, a single code point may correspond to multiple characters."

I consider changing a glyph to some other glyph(s) as corruption. Take an emoji flag character as an example. Split it between the code points and you end up with two boxed letters.

If you chop in the middel of a code unit then you end up with U-FFFDs. In both cases the visual representation has been altered.

As I wrote elsewhere it is easy for the slice routines of a language to check to see if the programmer tried to slice in the middle of a code unit and either return an error or just advance to the start of a code unit.

A code point string has other niceties, like being indexed by simple integers. If end is the index of the last code point of a grapheme, then the next grapheme starts at end + 1.

If end is the index of the last UTF-8 encoding of a code point, then the next grapheme does not start at end + 1.

We can have it so that it does by making end point to the last byte of the UTF-8 encoding of the code point; but then it doesn't point at the start of the character, recovering which is awkward.

The code uglification can be addressed by piling on abstractions: integer-like iteration gizmos that can be incremented and decremented thanks to function or operator overloading.

I feel that that level of abstraction has no place in character-level data processing, if anywhere, whose basic operations should be expressible tersely in a few machine instructions.

Also, we mustn't lose sight of what the T means in UTF-8: transfer. It's not called UPF-8 (the Unicode processing format in 8 bits).

Working with UTF-8 instead of with the objects that UTF-8 denotes is like working with a textual representation of Lisp s-expressions that still contain the parentheses and whitespace delimitation, and quotes around strings and so on, refusing to parse them to obtain the object which they represent. People who do this should immediately turn in their CS degrees.

All those other issues you refer to are addressed by more parsing. If you want the glyphs, the correct thing is to parse the code-point string and make a list or vector of glyph representations.

With that representation you can still break the text "carpet" into "car" "pet" which destroys semantics; that is dealt with by parsing into words.

Chopping lists of words destroys phrases; so parse phrases, and transform at the phrase level.

His point was that the transfer format should be conceptually independent of the processing. Obviously it has to be encoded in RAM in some way, but the programmer doesn't need to worry about the memory layout.

Why? Why should it be conceptually different if it's easy to work with the encoded form?

Many unicode languages work with UTF-8 or UTF-16 internally. So working with the "transfer format" is common practice.

While it may not be necessary to know who the languages your program in work under the hood, expert programmers do want/need to know. That way they can write better code, or switch to another language or get the language devs to improve their internal handling.

Bad example. If you want to embed that string in your code you have to type those 6 characters anyways.

A better example is if you want to find a newline in a string. If you do a find it in a UTF-16 string it may be position 8 and a find in UTF-8 may be position 12. Does it matter what the actual number is? NO. You just pass it to the next function or whatever.

Likewise, I wish JS had just changed the internal representation of strings to UTF8, and accepted that some older code might break instead of adding in the new string/regex bits. It would have made far more code simple start working with large/multibyte/international characters than it ever would have broken.

I haven't done Python for a while, so maybe it's changed, but I thought the new Python 3 strings werre index by byte rather than character, and it was 'as designed' despite being very non-intuitive (unlike the rest of Python).

Edit: see pjscott's comment below - it's by code point, not byte, but still not by character.

That is perhaps the most succinct and accurate way I've heard to explain and justify why you're sounding like a wet blanket to people that may not understand, while acknowledging that you know how you sound, but there is a reason for it. I expect to use this in the future.

It's worse. With UTF-8, if you're not processing it properly it becomes obvious very quickly with the first accented character you encounter. With UTF-16 you probably won't notice any bugs until someone throws an emoticon at you.

_Unicode_ adoption increased to 87%.
At the cost of non-Unicode encodings.

UTF16 isn’t good enough for web: even for a content in Ukrainian or Hebrew languages, UTF8 saves a sizeable bandwidth because spaces, punctuation marks, newlines, digits, English-inspired HTML tags — in UTF8 they all encode in 1 byte per character, and for the web, bandwidth matters.

We can't even convince Microsoft, Apple, and everything else Unix based to agree on line endings. How on earth are we going to convince everyone that one character encoding format is the only way they should store their data?

Annoying as it is to deal with, our history as computer scientists demands that we maintain compatibility with older systems and encoding formats that were once used but are now almost forgotten. If we removed all the other encoding formats (code paths that, while underused, still function perfectly fine) we would lose the ability to parse and manipulate a lot of old data.

I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

An interesting suggestion they make is to keep utf-8 also for strings internal to your program. That is, instead of decoding utf-8 on input and encode utf-8 on output, you just keep it encoded the whole time.

What would be a good alternative for strings internal to your program?

I work with multilingual text processing applications, and I strongly support that concept. A guideline of "use UTF8 or die" works well and avoids lots of headaches - it is the most efficient encoding for in-memory use (unless you work mostly with Asian charsets where UTF16 has a size advantage) and it is compatible with all legal data, so it's quite effective to have a policy that 100% of your functions/API/datastructures/databases pass only UTF8 data, and when other encodings are needed (e.g. file import/export) then at the very edge of your application the data is converted to that something else.

Having a mix of encodings is a time bomb that sooner or later blows up as nasty bugs.

Abstraction is the alternative. Design an API that treats encodings uniformly, and the encoding becomes an internal implementation detail. You can then have a polymorphic representation that avoids unnecessary conversions. NSString and Swift String both work this way.

IMHO, UTF-16 is the worst of both worlds. It breaks backwards compatibility in the simple case and wastes storage, but still has to have complex multi-byte decoding because it's not a fixed length encoding.

UTF-8 is probably the best compromise of the lot, with the advantages of UTF-32 being outweighed by the massive overhead in the most common case.

I'm on Mac and I've had problems with Chrome sending ajax requests or decoding ajax responses in ISO-8859-1, if I remember well. I had to add "; charset=utf-8" to my headers. I remember it was a browser problem, and I think it was the same for all browsers.

The older the site, the less likely it is that it will have been updated. Therefore, it's reasonable to assume that newer sites will either declare UTF-8, or can be modified to declare UTF-8, while old sites stay the way they always were, pre-UTF-8.

If your goal is to eliminate ICU, there's not any change you can realistically make. Unicode has problems, but the most obvious things to fix (CJK unification, precomposed versus combining characters, different semantic characters with completely identical graphs (Angstrom sign versus A-with-circle-above, e.g.)) do not eliminate the need for ICU.

Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.

The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.

It's worse. Sorting is dependent on the task at hand. http://userguide.icu-project.org/collation: "For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite."

That page has lots more 'interesting' cases, for example:

"Some French dictionary ordering traditions sort accents in backwards order, from the end of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o"."

That means that, given two strings s and t such that s sorts before t, you can append characters to t to get u which sorts before s. EDIT (after reading the reply of kelnage): _for some strings s and t_

No, I don't think that example does imply that. I interpret it as meaning that for the variants of the same "base word" (i.e. all characters are unaccented) the ordering is defined by the positions of the accents rather than their respective orderings. It says nothing about two words that have different lengths or bases.

Given more and more custom fonts in the OSes/websites, maybe by using some new APIs, we don't need to specify everything in the Unicode standard. We can design a new font format or just a separate datafile, to store those locale-specific information. The Unicode code points then becomes parking slots for different fonts(with locale info to be registered). And We can use the standard/default datafile to keep the old info about the current unicode standard (say Unicode 8.0).

This is just my first thought. Seems that the job of ICU is transfered to the OS or web browser.

If you want to handle characters by anything much simpler than current Unicode, you need to simplify the reality that Unicode describes, changing or eliminating a bunch of major human languages. Not all of them, and not even most of them, but still hundreds of millions of people would need to change how they use their language.

It could happen in a century or two, actually, we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

Simplication (caused by internationalization) and diversification (caused by localization) are two ends of a spectrum, but languages, both their spoken and written forms, have bounced between those ends throughout history. In a century or two, by the time simplification has succeeded on Earth, the settlers on Titan will rebel with their own graphical symbols for displaying language.

> we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

I know you're not necessarily advocating it, but if our cultures change to adapt to our technological limitations, that's the reverse of what I think should be happening - there's a problem with the tech.

> In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.

Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.

> Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.

Cyrillic, Hebrew and several other languages still have spaces and punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage are cheap and declining, but CPU branch misprediction cost is same 20 cycles and not going to decline.

> plain Windows edit control (until Vista)

Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%. Who cares what was before Vista?

> In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8.

The exception that are part of STL don’t return Unicode at all, they are in English.

> Do you mean they return the text as bytes using some (likely ASCII) character encoding and all the text characters are in ASCII range?

If you rely on std::exception::what() while building a localizable software, you’ll end with inconsistent GUI language. Because some exceptions (that are part of STL) will return English messages, other exceptions (that aren’t part of STL) will return non-English messages.

This means if you’re developing anything localizable, you can’t rely on std::exception::what().

The standard does not specify what the standard exceptions return from what(). It does not have to be in English.

Why care about it's prototype? You may want to embed into what() unicode strings that describe the error and came from elsewhere. E.g. a path, a URL, an XML element id, etc. from the context the exception originated. It may be shown to the user or written to the log. Localization is irrelevant here.

Places and situations where you can't accommodate variable-length encodings. As far as future-proofing, UTF-8 is essentially the new ASCII, in that UTF-8 will remain a backward-compatibility goal for any other format that will succeed it.

After considering this problem in long detail in the past, I too favoured utf8 at the time.

I remember a project (circa 1999) I worked on which was a feature phone HTML 3.4 browser and email client (one of the first). The browser/ip stack handled only ascii/code page characters to begin with. To my surprise it was decided to encode text on the platform using utf-16. Thus the entire code base was converted to use 16 bit code points (UCS-2). On a resource constrained platform (~300k ram IFIRC), better, I think, would have been update the renderer and email client to understand utf8.

Nice as it might be to have the idea that utf16, or utf32 were a "character" it is as has been pointed out not the case, and when you look into language you can see how it never can be that simple.

I quite like Swift's approach -Characters, where a character can be "An extended grapheme cluster ... a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.". This seems, in practice to mean things like multibyte entries, modified entries, end up as a single entry.

As the trade-off, directly indexing into strings is... Either not possible or discouraged, and often relies on an opaque(?) indexing class.

The main weirdness I have encountered so far is that the Regex functions operate only on the old, objective-c method of indexing, so a little swizzling is required to handle things properly.

What's the use case? Make sure you don't introduce them as a desktop user? As a app developer? (what does the app do?) As a sysadmin with third party unknown apps?

You can't really "disable utf-8" on Linux. You can change how things are encoded when displaying or saving. (via locale/lang variables) But if the app wants to create a file named "0xE2 0x98 0x83" (binary version of course), it's still free to do that.

I just don't want garbage file names when sharing a file system between systems that don't agree on the encoding. I was thinking maybe some mount option. I can use ISO-88591 and skip UTF-8. I haven't found a mount option for ext4 or xfs yet.

This militancy to force everyone to use UTF-8 is bad engineering. I'm thinking of GNOME 3, where you aren't even allowed the option of choosing ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is just as important for what it filters out as for what it passes through. I use a lot of older tools on *nix that are ASCII-only, in tool chains that slurp and munge text. If the chain includes any of these UTF-8-only apps, I'm constantly dealing with the problem of invalid ASCII passing through.

Without. UTF-8 is such a distinctive pattern that if text with high bits set matches UTF-8, it's almost certainly UTF-8. There's no need for a BOM to tell you it's UTF-8 (looking at you, Windows), and it can easily confuse software instead.

MS popularized the idea of adding the UTF-16 BOM into UTF-8 to distinguish between UTF-8 text files and Windows code page files, or what they called "Unicode" and "ANSI." There's (nearly?) unanimous agreement among everyone else that BOMs in UTF-8 text are really stupid.

Note that the "BOM" in this case means storing the U+FEFF character in UTF-8 form (just as UTF-16 stores it in the appropriate endianness). This means that the result would be EF BB BF.

From section 2.6 in the standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

Yes, it can be used to distinguish a UTF-8 stream but it's not recommended. One issue is you can't tell if the BOM is not valid text in some other non-unicode encoding.

I'm curious where you've encountered missing content-encoding headers or other OOB indicators where it wasn't because of programmer error or laziness.

There's a difference between what a program should accept as input and what it should generate as output. The standard just says to expect a BOM on input and suggests not generating on output. In other words "A UTF-8 BOM is a bad idea but some yutz out there stated doing it so we should ignore it on input". Someone else mentioned that the yutz was Microsoft.

I misread what you wrote about where you saw no idication that it was UTF-8. You were talking about places other than the web.

BOM for UTF-8 text files seems to be a Microsoft thing. Everyone else just defaults to UTF-8. But you can't be sure that it's a UTF-8 BOM or some other encoding. Most editors let the user overide what it is.

Why would you store text in a blob column? If a database can't handle UTF-8 in it's text column it needs to be fixed (or taken out back an shot).

> There's a difference between what a program should accept as input and what it should generate as output.

I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When I press “File/Save as” in visual studio and click on the down arrow icon, I see a choice of more than 100 different encodings (including all flavors of Unicode with and without the BOM), and independent choice of 3 line endings (Window, mac, Unix).

> BOM for UTF-8 text files seems to be a Microsoft thing

Practically — maybe, most Microsoft apps tend to understand those BOMs, and most *nix tools don’t, even on input.

> Could you please name a Microsoft program that you think requires a BOM?

Visual C++ off the top of my head. It mangles UTF-8 string literals without the BOM in the source code.

> For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

That's what I was trying to say about the BOM being prevalent on the Windows side of the fence. Some programs require it, some always generate it so most program now accept it.

On the unix/osx side everyone switched to UTF-8 so the BOM is redundant. Everything is UTF-8 so the silliness of this needs a BOM that doesn't need a BOM doesn't exist. Good example of what the "UTF-8 Everywhere" site is trying to promote.

Personally I really wish Microsoft would eventually fix their UTF-8 codepage. Would be so nice not having to convert to/from UTF-16 at the Win32 API boundary.

Kind of got off track here. You can process a lot of formats as text (html, css, xml, etc). So a BOM there is unnecessary and sometimes detrimental. On the unix side there are a lot of text utilities that do useful things that you can do on these formats. That's probably why BOMs are non existent there.

> MS can’t change the compiler because backward compatibility.

You care to tell MS that? Every single time I've done a major VS upgrade my code had to be changed because something that was valid before stopped being valid.

> And there aren’t any.

If you can't see any benefit of using UTF-8 then I'm done debating with you.

The unicode BOM is code point U+FEFF. The process of encoding it determines the order.

Encoded to UTF-8 it becomes EF BB BF. Encoding to UTF-16 big endian it will become FE FF. Encoding it to UTF-16 little endian it becomes FF FE.

Converting it back from UTF-8 always gives you U+FEFF since UTF-8 doesn't care about endianess. Converting it back from UTF-16 using the correct endianess gives you U+FEFF. Converting it using the wrong endianess gives you U+FFFE which is defined by unicode as a "non character" that should never appear in text.