If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

UTF-8 with HTML5

I didn't find anything about this elsewhere, so I thought I'd post it here.

When encoding an HTML5 (possibly earlier versions, as well) document as UTF-8, ensure that it does not include a byte order mark. (In Notepad++, this is the difference on the "Encoding" menu between "Encode in UTF-8" and "Encode in UTF-8 without BOM.")

For some reason, BOMs upset the W3C validator and web browsers (this may be server-specific); for example, the validator would not run on a document containing the byte \xA9 (the copyright sign) and a BOM, as it had trouble mapping that to Unicode properly.

Also remember to include the <meta charset="UTF-8"> tag, which makes the validator happy.

As far as I can tell, at least in HTML files or files that will be interpreted by the browser as HTML, CSS, XML, or javascript, possibly others, the BOM is an outdated, unnecessary prefix that tells the file interpreter that what follows is in UTF-8 or whatever (there are other BOM's for other encodings).

Using the BOM in those situations can cause problems, and is never as far as I know required or desired, at least not with UTF-8. If using UTF-16, it (well a different BOM for that encoding) might be required. I'm not sure.

There may be other situations - say, when the file is going to be read by something other than a browser, when a BOM is required even in UTF-8. But I'm not aware of any. Since I deal mostly with browsers and what they need/do, that's not saying a lot about other applications.

I'm pretty sure NotePad++ gives you that option in case you need the BOM for some reason.

And you're right, you should never use it for HTML files.

Something else to be aware of in regards to this is how other editors deal with it. Some just slap on the BOM, or do so by default unless set otherwise in their config. So, when trouble shooting other's work always have it in the back of your mind somewhere that an undesirable BOM may be present.

If you view the file in an editor in ISO-8859-1 (windows-1252) encoding, you can see and delete the BOM, it will look like so (enlarged and indented here for easier recognition):

ï»¿

It will almost always be the very first thing in the file, and often shows up in some browsers if they're served such a page in ISO-8859-1 (windows-1252) or other lower bit encodings.

The W3C does currently recommend the use of a BOM with UTF-16, but only for HTML5 (not for prior versions). I don't pretend to understand why.

Off the top of my head, I think that the BOM might have something to do with East Asian texts...?

I thought it would be good to note that the lack of a BOM is important for web publishing, as I, when looking for UTF-8 in the encodings list, just went to the one that said "UTF-8," without taking note of the other ones, which seemed irrelevant, given that I had already found UTF-8.

Edit:I'm thinking that, given my last point, it might be more useful for applications such as Notepad++ to give options as "UTF-8" and "UTF-8 with BOM" or "UTF-8 without BOM" and "UTF-8 with BOM," instead of "UTF-8 without BOM" and "UTF-8." Of course, knowing nothing about Unicode outside of web publishing, this could have disastrous effects on files being used in other fields.

Update: I created this for myself, and thought I should post it here, in case anyone else keeps making the same mistake in clicking "UTF-8" when they want UTF-8 encoding . It is a localization file for Notepad++ (tested in Notepad++ 6.1.3) which uses the first "more useful" example in the edit above. It should be placed in the localization directory of your Notepad++ installation folder, and can be used by choosing "English (customizable)" in the "Localization" section of the Properties dialog.

Neat. UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others. In many cases though, depending upon the dialect/character set, UTF-8 is sufficient and should therefore be employed.

When UTF-16 is required, it comes in two flavors - small and big endian. I know there's a difference, but I don't know what. I think there might be a difference in the BOM used for small and big endian.

The requirement for the BOM in HTML 5 for UTF-16 might just be wishful thinking on the part of the standards people. In my experience the standards generally fall into one of three categories:

What both works and is commonly accepted.

What 'should' be the standard but most browsers are just fine without and it doesn't hurt anything.

What most browsers do, so is being incorporated into the standard, even if there's an odd browser that doesn't follow it.

UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others.

UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

UTF-16 expands UTF-8 by adding more space for many more characters.

I really have no idea what UTF-16 does except to allow for further expansion.

In looking up more info (which didn't prove very helpful), there's also UTF-32. As needed, I guess...

UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others.

UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

That's what I said more or less:

Originally Posted by jscheuer1

UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others. In many cases though, depending upon the dialect/character set, UTF-8 is sufficient and should therefore be employed.

And there's no *might* about it. I have worked in this forum with certain Chinese and Hebrew dialects or versions that do require more character space. There probably are other languages that fall into this category - basically those that cannot be truly rendered in an 8 byte per character system.

I haven't encountered any human language that required UTF-32, but there certainly could be some.

It's my understanding that this isn't necessarily the number of characters, though that could perhaps be a factor. It is in my limited experience required when an individual character requires more bytes in order to be represented. That's what I understand to be character space - space for a given character.

So if 8 bytes aren't enough for a specific character, you need UTF-16, unless you need more than 16 bytes for a character, then you would need a higher encoding.

UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

From my experience (I'ven't taken time to study character encodings or anything), modern editors set to UTF-8 will express characters above U+FF (256 characters in, hence the maximum expressible with eight bits) as two (up to U+FFFF) or three (up to U+FFFFFF) eight-bit bytes. I'm not sure if it works the same way with four bytes.

As to how the editor makes it clear that a character is two or three bytes, as opposed to one byte, in the file, I'm completely clueless.

Note: U+FF is ÿ, which is at the end of the Latin-1 Supplement block, so any characters past there (ie. most of them) require more than eight bits to express; thus, given that encoding a character like ł (U+142) in UTF-8 hasn't caused problems for me in the past, I can conclude that UTF-8 editors do somehow manage to express characters beyond eight bits.

From this, I can conclude that UTF-16 and above is mostly pointless (as it would be less efficient than UTF-8 in most situations), unless there is some upper limit to which characters can be expressed in UTF-8.

An upper limit seems not to apply, however; Notepad++ was able to convert ���� (U+2FA1D), a Chinese symbol which is pronounced pián and means tooth painting, as well as �� (U+E0039) to ANSI (ó*€¹ð¯¨) and back without problem. The file with the two of these characters occupied eight eight-bit bytes, or 64 bits. These characters' decimal values (195101 and 917561) require 18 and 20 bits to express, respectively, which would entail using UTF-32 to express them each as one character or 32-bit byte. Hence, the UTF-8 proved more efficient, even towards the upper limit of non-PUA characters.

The highest character encoded in Unicode 6.2 is the private-use character U+10FFFD, which was expressed as four eight-bit bytes, 32 bits, and converted to ANSI (ô¿½) and back without issue.

And there's no *might* about it. I have worked in this forum with certain Chinese and Hebrew dialects or versions that do require more character space. There probably are other languages that fall into this category - basically those that cannot be truly rendered in an 8 byte per character system.

I've never had any trouble with it. And I've used most of the languages you mention. Specific diacritics might be missing, but in general everything should be available. I can certainly imagine that many Chinese characters are missing (more obscure ones), but Japanese has no trouble and all of the other languages you mentioned (eg, Arabic, Russian, etc.) have no trouble at all in UTF8.
I'm not doubting UTF16 has some applications, but I haven't run into them. I'll look into this. It's relevant for me as a linguist.

It's the difference between a generic or limited charset and a fuller, more true to the actual written language charset.

You said it yourself:

Specific diacritics might be missing, but in general everything should be available. I can certainly imagine that many Chinese characters are missing (more obscure ones)

I think in some cases, or to any purist in a specific language, it would seem or actually be much more serious than that. And/Or that just that much would seem very serious to them.

Sort of an analogy in English might be if you suddenly couldn't print soft c's and silent p's. Those who know the language well would know what you meant and ignore it. Especially if they're aware of the printing limitations you're laboring under. But if not, some of them might think you're stupid or uncultured. Others, less familiar with the language or the printing limitations might be left scratching their heads as to what you really meant.

After looking back at what techno_race added, which I missed when first responding - Some characters might be able to be expressed in UTF-8, but as a matter of course in practical usage are more easily rendered in UTF-16. That would depend upon the charset and/or font. Or maybe they really need UTF-16.

I stand by what I said though, I've worked with folks on two occasions in these forums where UTF-16 was required to render their pages correctly. One was in Chinese, the other in Hebrew. And yes djr33, at least in Hebrew these were characters with special marks, otherwise 'ordinary' Hebrew characters that have a special meaning with an added mark. In Chinese though, I think the characters were simply more complex than UTF-8 could support. My impression being that there are less limits on characters in Chinese, in that Chinese characters are more analogous to other languages' words than their letters.

I think in some cases, or to any purist in a specific language, it would seem or actually be much more serious than that. And/Or that just that much would seem very serious to them.

So let's take the example of a Chinese newspaper. Would they really require UTF-16 for daily usage? I can't answer that. There are a much smaller set of daily characters than all 50,000+ technically in the language, in the most comprehensive dictionaries. Most readers/writers don't know all of them. (Just like most speakers of English don't know the exact spelling for every single word in the language.)

Hebrew certainly works in UTF-8. But if you need to add special diacritics, that's fine-- that's what I'd imagine UTF-16 is for. Exactly that-- adding extra information.

Basically the metaphor I'd choose for this is having accented characters unavailable for English. So you might need an extended character set for é in fiancée if you choose to write it that way. (Of course this is actually a great parallel example from ASCII to UTF-8 and from UTF-8 to UTF-16.)