I think the most important thing Spolsky is trying to do is get programmers out of the ASCII mindset - one byte, one character. Once you’ve made sure your app can handle Unicode, emoji just comes along as a bonus.

As an aside… the Unicode consortium has expressed some dismay about all the attention and money lavished on the “trivial” emoji space, but I think as a larger picture having the impression that “Unicode is great, it gives us emoji!” is a net positive for an organisation.

Agreed – if emoji can get Amerocentric¹ programmers to care at all about non-ASCII support it’s a win for everyone. It doesn’t solve things like right-to-left problems, but it goes a long way toward making software accessible. Unfortunately, proper Unicode still seems like a chore instead of a basic feature. Even Rust, which goes as far as discouraging you from iterating over “chars” in the standard library, admits that “[g]etting grapheme clusters from strings is complex, so this functionality is not provided by the standard library.”

I look forward to a time when handling these things is the default and “iterating over chars” is difficult. Maybe it’s not possible given the varied features of human language and orthography, but I think there is still a long way we can go with the technology we currently have.

¹ I know “Amerocentric” technically applies to the entire continents of North and South America, but I couldn’t find a better word to capture the sense of “programmers who only consider en_US when designing and testing software.”

I’ve recently been doing some work on Unicode stuff for some commandline tools I’ve been writing, and I found the Unicode specs to be fairly hard to read, and being spread out over multiple documents isn’t helping either. You also need some background knowledge about different writing systems of the world.

None of it is insurmountably hard as such – k8s is probably more complex – but it takes some time to grok and quite some effort to get right. Perhaps we should treat Unicode as cryptography: “don’t implement it yourself when it can be avoided”. I could add LTR support, but without actual knowledge of how an Arabic person uses a computer I’ll probably make some stupid mistake; for example, as I understand it you write from left-to-right in Arabic, except for numbers, which are left-to-right.

I haven’t even gotten to vertical text yet. I have no idea how to deal with that (yet).

I know “Amerocentric” technically applies to the entire continents of North and South America, but I couldn’t find a better word to capture the sense of “programmers who only consider en_US when designing and testing software.”

Anglocentric? As it’s a problem that extends beyond just the United States (CA, UK, AU, NZ, many African countries). Many non-English programmers do a lot of their work in English and have similar biases. Especially in Europe, where most scripts are covered by extended ASCII/ISO-8859/cp1252.

When Arabic numerals were imported into Europe, they were physically written left-to-right, but to this day every school child does calculations from right-to-left (like addition, or multiplication) because that’s just how Arabic numerals work.

The story continues with computers, too: many computers were designed in European-culture countries that were comfortable with numbers working in the opposite direction from everything else, and so they used big-endian byte-ordering. Some smaller, cheaper computers couldn’t justify the cost of making the computer follow the designer’s conventions, so they went with the simpler, more straight-forward implementation and came up with little-endian byte-ordering, taking Arabic numerals back to their roots.

I quite like Eevee’s post on unicode, not because it’s particularly technically detailed or actionable, but because it gives examples—uncontrived ones, even—for why, as people constantly say, you can’t actually correctly do whatever you’re trying to do, and it’s got nothing to do with unicode.