Tag Archives: Unicode

Posted onAugust 3, 2017|Comments Off on A world of emoji misinformation

July 17 was World Emoji Day. Anyone can declare a World Anything Day, but my local library thought it was important enough to give it part of a sign, along with Cell Phone Courtesy Month. They didn’t think it was important enough to give accurate information, though. It does tell us something about how non-tech people think of emoji. Here’s the content of the sign, with commentary.Continue reading →

In 2001, the Unicode Consortium rejected a proposal to include the Klingon encoding. The reasons it gave were:

Lack of evidence of usage in published literature, lack of organized community interest in its standardization, no resolution of potential trademark and copyright issues, question about its status as a cipher rather than a script, and so on.

In Orwell’s 1984, the Newspeak language followed the principle that if you can abolish certain words, you can abolish the thoughts that go with them.

It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words. … This was done partly by the invention of new words, but chiefly by eliminating undesirable words and by stripping such words as remained of unorthodox meanings, and so far as possible of all secondary meanings whatever.

The Unicode Consortium has announced the release of Unicode 9.0. It adds character sets for some little-known languages, including Osage, Nepal Bhasa, Fulani, the Bravanese dialect of Swahili, the Warsh orthography for Arabic, and Tangut. It updates the collation specification and security recommendations.

Most Unicode implementations will require just font upgrades, but full support of some of the more unusual scripts will require attention to the migration notes.

“Asymmetric case mapping” sounds interesting. I believe this means that the conversion between upper case and lower case isn’t one-to-one and reversible. The notes give the example of “the asymmetric case mapping of Greek final sigma to capital sigma.” Lowercase sigma has two forms; it’s σ except at the end of a word, where it’s ς. Both turn into Σ in uppercase.

What really has people excited about Unicode 9, if a Startpage search is any indication, isn’t any of these things, but that about 1% of the new characters are emoji and that Apple and Microsoft lobbied against one candidate emoji. I wonder if the Unicode Consortium regrets having gotten involved in that mess in the first place. There are no possible criteria except whims for what the set should include. There’s no limit on how many could be added. OK, having a universally set of encodings promotes information interchange, but the tail is wagging the 🐕.

By the way, what’s the plural of “emoji”? I use “emoji” as both singular and plural, but I’m seeing “emojis” with increasing frequency. It just looks wrong to me. Does anyone say “kanjis” or “romajis” for the other Japanese character sets? I had to argue with the editor to keep the title of my article “The War on Emoji” that way.

This post may be illegal in Indonesia. It includes the code point sequence U+1F468‍ U+200D U+2764️ U+FE0F U+200D U+1F48B‍ U+200D U+1F468, which renders as the emoji 👨‍❤️‍💋‍👨 or “man kissing man.” According to a Time article, the Indonesian Ministry of Communication and Informatics is “asking” Facebook to block the use of “gay” emoji. Failure to comply could mean the Negative Content Management Panel (George Orwell would have been impressed!) will block Facebook in Indonesia.

Unicode is a great thing, but sometimes its thoroughness poses problems. Different character sets often include characters that look exactly like common ASCII characters in most fonts, and these can be used to spoof domain names. Sometimes this is called a homograph attack or script spoofing. For instance, someone might register the domain gοοgle.com, which looks a lot like “google.com,” but actually uses the Greek letter omicron instead of the Roman letter o. (Search this page in your browser for “google” if you don’t believe me.) Such tricks could lure unwary users into a phishing site. A real-life example, which didn’t even require more than ASCII, was a site called paypaI.com — that’s a capital I instead of a lower-case L, and they look the same in some fonts. That was way back in 2000.Continue reading →

Oh, sorry, we’re not talking about that kind of character. We’re talking about characters like the Hungarian double-acute u (ű), the four-leaf clover emoji (🍀), or the Katakana “ka” (カ). The Unicode Consortium is looking for people to “adopt” their favorite characters with a tax-deductible donation. Each character can have one Gold ($5000) sponsor, five Silver ($1000) sponsors, and any number of Bronze ($100) sponsors. As I read the rules, only recognized Unicode characters are eligible, so you probably can’t support Klingon characters.

Encoding all the characters of all the world’s languages is an endless task. Unicode 8.0 improves the treatment of Cherokee, Tai Lue, Devangari, and more. For a lot of people, the most interesting part will be the implementation of “diverse” emoji in a variety of colors. A Unicode Consortium report explains:

People all over the world want to have emoji that reflect more human diversity, especially for skin tone. The Unicode emoji characters for people and body parts are meant to be generic, yet following the precedents set by the original Japanese carrier images, they are often shown with a light skin tone instead of a more generic (nonhuman) appearance, such as a yellow/orange color or a silhouette.

Five symbol modifier characters that provide for a range of skin tones for human emoji are planned for Unicode Version 8.0 (scheduled for mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations.

… When a human emoji is not immediately followed by a emoji modifier character, it should use a generic, non-realistic skin tone.

I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.

I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.

Unicode 7.0.0 has been released, with 2.834 new character codes. It’s been fascinating looking into some of the blocks that have been added; here’s a sampling.

Bassa Vah is a really obscure script from what is now Liberia, possibly predating the country. Old Permic is supposed to be a close relative of Cyrillic, but any visual resemblance is lost on me.

Some of the writing systems came from a religious impulse. Mende Kikakui was devised by an Islamic scholar and was once widely used for the Mende language in Africa. It’s been mostly displaced by the Latin alphabet. Shong Lue Yang introduced the Pahawh Hmong writing system for the Hmong language in southeast Asia, claiming to have received it from God. Pau Cin Hau, named after its creator, was a 20th century system used for religious writings in Burma. Its original version had over a thousand characters, but the Unicode block is based on the 57-character alphabetic system. The Manichaean alphabet is fascinating just because of its name, recalling the conflicts in early Christianity. According to tradition, Mani, the founder of Manichaeanism, created the alphabet.

Finally, one of the oldest writing systems in the world, Linear A, is new in Unicode 7. It’s from ancient Crete, and no one knows how to read its texts. Now you can create computer documents in it, if you’re a scholar of old languages or just like confusing people.