Character encoding, entity references and UTF-8

A short introduction

encyclo

1:41 am on Oct 11, 2005 (gmt 0)

There have been a lot of questions in the forums recently all of which touch on a very important but often misunderstood part of building websites - character encoding, or how a document stores and displays different characters on a page.

The basics of character encoding - US-ASCII

In the beginning there was binary - all information is stored as a series of ones and zeros, or "on" and "off" - the heart of computing and electronics. In order to display alphanumeric characters, a standard was created which defined which binary sequence represented which character. This was the American Standard Code for Information Interchange, or ASCII. There were a few variants, the most well-known by far being US-ASCII, still in widespread use today.

With ASCII, each character is represented by a single-octet sequence. One byte, one letter. The biggest weakness with US-ASCII is that it only includes characters used in English, excluding any accented letters or regional variations such as the German double S.

Stage two - the ISO standards

To fulfil the demands on users which required more than the basic a-z / A-Z sequence, extensions to ASCII were developed and approved by the ISO. The best known are the ISO-8859 series, which used the same sequences as ASCII but added extra characters for accented letters and regional variations. ISO-8859-1 is for most western European languages such as English, French, Italian...

ISO-8859-1 versus windows-1252

ISO-8859-1 became the standard encoding for most Unix and Unix-like systems. However when Microsoft developed Windows, it used a slight variation on ISO-8859-1 commonly known as windows -1252. the differences between the two boil down to 27 characters (including the Euro symbol, certain angled quote marks, the ellipsys and the conjoined oe or "oe ligature") which windows-1252 uses in the place of 27 control characters in ISO-8859-1. Within Windows, ISO-8859-1 is silently replaced by windows-1252, which often means that copy/pasting content from, say, a Word document left the web page with validation errors. Many web authors incorrectly assume that the fault is with the characters themselves and that using entity references is the only way for accented characters. In fact, if you are using a western European language and ISO-8859-1 most accented characters suh as é è û î etc. can be used without resorting to entities such as

&eacute;

or similar. (ISO-8859-1 does not include an oe ligature for a very bizarre reason, but that's another story! You must therefore use

&oelig;

instead.)

Character encoding on the web - HTML entity references

In order to get around character encoding problems on the web, a method was introduced to "encode" non-ASCII characters in HTML without having to change charsets away from the widely-supported US-ASCII. Accented characters such as é (e acute) can be encoded as

&eacute;

and the user agent would "translate" that into the appropriate character. These entity references or character entities are defined within the HTML document type definition (DTD) - in HTML 4.0, for example, there are over two hundred different entity references defined.

There are several weaknesses with the entity references approach. Firstly, they are excessively verbose - in ISO-8859-1 an e acute takes up one byte of space, whereas the entity reference takes up 8 bytes. The second problem is that the are only useful in the context of a parsed HTML document - read the source code as plain text and the result can end up verging on gibberish, especially of you are using a language which relies heavily on accents, such as Polish. Even in French, if you want to write the phrase à côté it ends up as &agrave; c&ocirc;t&eacute;.

HTML entities are a tag soup solution to a tag soup problem, and this is seen clearest with the third problem with entity references - XML.

HTML entity references, RSS and XML

The entity references "solution" falls down once you start working with XML. Unlike HTML 4.0 or XHTML 1.0, which have DTDs which define the entity references, most XML does not have a doctype declaration, so none of those entity references are valid. What's worse, as XML doesn't share HTML's liberal error-handling, using undefined entities will break the document.

XML actually has ony five defined entity references, the bare minimum required for functionality. They are:

&amp; &apos; &quot; &lt;

and

&gt;

. There are various hacks and methods to add extra entity references to your XML, but the only real solution is to avoid their use entirely.

The most popular use of XML on the web at the moment is RSS and syndication. RSS is an XML format, so if you are using entity references, for example held in a database, then you will have difficulties producing a valid RSS feed. What's more, encoding directly in, say, ISO-8859-1 doesn't completely solve your problem as you are limited in the character you can use. Want to add a copyright notice in you feed? In HTML you can use

, but in RSS you just get a parsing error, and ISO-8859-1 does not offer an alternative.

One encoding for every language - Unicode and UTF-8

In order to overcome the hodge-podge of incomplete, conflicting and aging standards (the ISO-8859 series date from the early 1980s), the notion of Unicode was developed. The differing versions of the ISO-10646 standard (Unicode has been approved by the ISO) are beyond the scope of this very brief introduction, but the important thing to note that is different with Unicode is that it offers one single character encoding for all of the world's languages. The second difference is that it is a multi-byte implementation rather than a simple one-byte per character representation.

By far the most important Unicode version on the web is UTF-8. This standard have numerous advantages, the most important of which is that it remains compatible with the much earlier US-ASCII standard. In fact, all of the single-byte ASCII characters are represented in exactly the same way in UTF-8. Only extended characters are different, made from multi-byte strings defined for each character, whether an e acute, an oe ligature, or characters from Arabic, Russian, Urdu or Japanese.

UTF-8 is especialy important for XML as it is the default encoding for all XML documents. And as you can't use HTML entity references and earlier ISO-8859 standards are incomplete, UTF-8 is the only logical choice when dealing with XML formats such as RSS or Atom which, even if you are only using English, are more than likely to eventually need more than the basic ASCII charset can offer.

UTF-8 is incredibly useful in HTML/XHTML too - no more entity references, the possibility to use extended characters such as curly quotes or long dashes, the possibility of using one charset across a multi-lingual site.

The downsides to UTF-8

There remain a few hurdles to UTF-8 acceptance, most of which can be minimsed or overcome.

- Browser support is excellent, with IE5.x up supporting UTF-8 fully, as do Mozilla/Firefox, Opera, Safari, Konqueror, etc. However earlier browsers such as IE4 and NN4 have problems, and IE3/NN3 and earlier lack support. Bear in mind that documents using markup older than HTML 4.0 cannot use UTF-8.

- The scripting language PHP (and some others) can have problems with multi-byte strings. See an excellent earlier WebmasterWorld thread by ergophobe: UTF-8, ISO-8859-1, PHP and XHTML [webmasterworld.com]. However if you check out how beautifully the PHP-driven WordPress handles UTF-8 content, it is clear that UTF-8 and PHP can successfully mix.

- Just because you can add content in, say, traditional Chinese to your site doesn't mean that the end-user has an appropriate font to display it - you still need to test and ensure compatibility when it comes to defining font families and such for your target audience.

How to implement UTF-8 on your site

If your site's language is English, simply swapping your ISO-8859-1 meta tags to UTF-8 goves the impression that you have succeeded. However, there is a little more to it than that. You still need to ensure that any non-ASCII content is correctly encoded. Users of other languages will almost certainly need to convert their files to UTF-8.

Most modern text and wysiwyg editors handle UTF-8 perfectly - in most cases, it is simply a case of going to "Save As" and choosing "UTF-8" or "Unicode" from the options. From then on, you can tidy up any entity references and start using the true characters. One useful tip is to copy/paste from a word-processing program such as Word which automagically replaces, for example, straight quotes with the appropriate "curly" opening and closing quotes.

If you are using a Linux or similar Unix-like server or desktop, you can use

iconv

to batch-convert many files at once.

Conclusion

If you are serious about standards, character encoding matters - even if you are just producing content in English. UTF-8 offers huge advantages, and you have everything to gain by moving to UTF-8 for new content.

Further reading

If you want a better or more detailed introduction to Unicode and character encoding in general, try some of these links:

SEOtop10

3:13 am on Oct 11, 2005 (gmt 0)

Very nice explanation. I have been fiddling with this as I suggest all my clients to go for validation of their source files for greater spider compatibility and cross browser compatibility but was stuck at how to eliminate the issues.

I am better equipped now - thanks Encyclo.

asp4bunnies

3:56 am on Oct 11, 2005 (gmt 0)

Wonderful, wonderful post. I've been struggling with creating a site that works in all languages and recently discovered the awesomeness that is UTF-8 for myself.

You had a good point about problems with dynamic languages. I've noticed that MSSQL will convert UTF-8 characters to ASCII when executed as part of a vanilla SQL statement (i.e. written in ASP). The only work around is to not use vanilla SQL statements and instead to use stored procedures and pass parameters that way (better coding practice anyway).

moishier

4:58 am on Oct 11, 2005 (gmt 0)

If only the email protocol and email clients would support plain-text UTF-8 (unicode)...

larryhatch

5:05 am on Oct 11, 2005 (gmt 0)

I've beem using ISO-8859-1 for years now. Why? Because of French letters like the c with the snoodle at bottom in francaise. I was able to cut and paste them right into my html text without a problem. I tried UTR-8 and it worked fine there too, BUT then pages would not validate in W3C! I switched back to ISO-8859-1 and everything validated.

Is there a workaround, or am I stuck with something like &c-snoodle in my source code? For now I am declaring W3C//DTD HTML 4.01 Transitional - Larry

Fischerlaender

7:42 am on Oct 11, 2005 (gmt 0)

Character encoding - one of my favourite topics recently.

While developing our german-focused search engine, we came across a lot of pages with wrong encoding. 'Wrong' means, that there indeed was an encoding stated in a meta tag, but the pages were not encoded that way. We even came across one site, which sent an Encoding HTTP header and additionally an encoding meta tag - with different encodings!

I think one reason for this is that IE does a lot of guessing internally what the correct encoding may be - and thus IE shows pages with incorrect encodings in a correct way. Good for the users, but very bad for developers.

Mr Bo Jangles

7:52 am on Oct 11, 2005 (gmt 0)

very welcome post thanks 'encyclo'.

AlexK

In simple terms, the browser sends a Request header listing the charset-encodings that it can handle (Accept-Charset). If utf-8 is one of these charsets, then it is safe to send a utf-8-encoded response. If the program and the browser cannot agree on an acceptable charset, a 406 Not Acceptable should be sent.

surfgatinho

10:51 am on Oct 11, 2005 (gmt 0)

I run a a site that uses 2 databases, one I beleive is in utf-t and the other ISO. Does anybody know of a tool to convert an entire (mySQL) database or any way of handling this in PHP.

Another quick question - why am I getting words like KitzbÃ¼hel? This is a Germanic alphabet word. Thanks

adb64

10:58 am on Oct 11, 2005 (gmt 0)

Surfgatinho,

Answer to your quick question: This is probably due to the fact that your page is encoded as UTF-8 but your meta-tag states something different, probably iso-8859-1.

Arjan

surfgatinho

11:16 am on Oct 11, 2005 (gmt 0)

Arjan,

Thanks, you were right but fixing this hasn't fixed the problem! I think it's robably because the db is utf and ISO

oddsod

1:05 pm on Oct 11, 2005 (gmt 0)

Flagged! Thanks, encyclo.

dcrombie

2:11 pm on Oct 11, 2005 (gmt 0)

Great article! I've been waiting for a decent explanation of all the different character codes.

I had AddDefaultCharset on in httpd.conf but now have commented that out and added UTF-8 meta tags instead. The problem is that - after restarting apache - something is still sending an HTTP-Header with iso-8859-1?!?!?

Accordiing to the W3C validator:

The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the <meta> element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation.

Thus, as just one example, USER32.DLL contains two "FindWindow" functions:

FindWindowA (8-bit ANSI character strings)

FindWindowW (16-bit Unicode strings)

Files are stored on disk as 8-bit, but (transparently) converted to 16-bit for the Windows API (and back again) as necessary. This is all bad enough, but The Windows-3.1 API involves "Windows character-sets" and "OEM character-sets" and refers back to the MS-DOS code pages that the whole affair came from in the first place. I will not go any further with that, since it all needs a very strong stomach indeed, and I only half-understand a third of it myself.

The issues for Webmasters to get hold of are that:

Windows western charsets are 8-bit (256 chars).

Only the first 128 chars are standard, and directly map to the equivalent ISO-8859-x chars.

Only some of the second (so-called hi-bit) 128 chars in the various Windows code-pages will directly map to any of the ISO-8859-x charsets.

If the document does not contain a Content-Type charset-encoding Response header, then Windows has every right to encode it with the users local Windows code-page, and to POST with the same code-page encoding.

encyclo

4:03 pm on Oct 11, 2005 (gmt 0)

larryhatch:

Is there a workaround, or am I stuck with something like &c-snoodle in my source code?

I build sites in French myself - you can use the

&ccedil;

entity reference for the ç but you are still faced with the disadvantages listed above. The problem you are experiencing is that you are changing the defined character encoding without actually converting the file to UTF-8. The exact method depends on your editor - most modern editors such as Dreamweaver and many text editors will correctly save your document as UTF-8 if you specify it. If you are on Linux or have a Linux/Unix server, you can use the iconv utility to convert the file.

surfgatinho:

Does anybody know of a tool to convert an entire (mySQL) database or any way of handling this in PHP.

Quick 'n' dirty way? Dump the database into a .sql file and run iconv against it to get your new UTF-8 enncoded file, then re-import it, replacing your existing database.

I'm a Linux user, so unfortunately I don't know about batch conversion tools on Windows. Does anyone know if any exist? I suspect that Dreamweaver may be able to do it, but I don't have any real experience with that program.

dcrombie: regarding your issue (glad you found the problem!) it is important to remember that a HTTP header always overrides a charset defined within a document - this can cause problems with misconfigured servers with default charsets defined - you should only use

AddDefaultCharset

if all the files on the server have the same encoding.

AlexK: thanks for the more detailed explanation! I actually use Ununtu Linux, where UTF-8 is the default encoding across the OS. Much easier!

encyclo

4:12 pm on Oct 11, 2005 (gmt 0)

Another quick question - why am I getting words like KitzbÃ¼hel? This is a Germanic alphabet word. Thanks

Answer to your quick question: This is probably due to the fact that your page is encoded as UTF-8 but your meta-tag states something different, probably iso-8859-1.

I think that it will be the other way round - if you look you have two characters "Ã¼" in the place of one. That would indicate that the letter is encoded as a multi-byte character (ie. UTF-8) but the page is being served as ISO-8859-1. Check your server headers to see if there is an HTTP header overriding your stated encoding.

surfgatinho

4:25 pm on Oct 11, 2005 (gmt 0)

Thanks for the help. I think the problem needs dealing with at the sql file level although I did have some success with using one of the iconv functions in PHP

Trisha

6:54 pm on Oct 11, 2005 (gmt 0)

Thanks! That explains a lot that I didn't understand!

But - I still don't know how to, for example, write proper spanish characters on a web site. I can make sure the html file is UTF-8, but how do I get the special characters without using HTML entities if I don't have keys for those characters on my keyboard?

AlexK

7:21 pm on Oct 11, 2005 (gmt 0)

encyclo:

I actually use Ununtu Linux, where UTF-8 is the default encoding across the OS. Much easier!

Hmm. My server uses CentOS (RHEL) which, like all RedHat since RH8, uses UTF-8 as the default encoding. That screws SSH terminal access, and Perl (reputedly). I've had to reset to ISO-8859-1 + en-GB to get a terminal without multiple "--" prefixes etc.

Trisha:

how do I get the special characters without using HTML entities if I don't have keys for those characters on my keyboard?

Little personal experience, but if Windows look under the Help & Support Center: "Customizing your computer"-"Keyboard and mouse"-"To switch languages or keyboards from the taskbar". You will probably have to get some sticky-paper to put on the keys of your keyboard!

gaouzief

-I've had great great results with mysql 4.1 (utf-8 native) and php 5 with mbstring, works smoothly

-Beware that php itself can only be written in iso-8859 (variables, functions...) so if you need to write some stings (error messages for example) in your scripts, better store them on seperate text files or in the database

-Beware that some editors add a header line called "UTF-8 signature" to text files wich might cause you some headecks with php and other parsers

-URL encoding and decoding Data is a good way to manipulate UTF-8 data in a non UTF-8 environnment and to exchange data between php and javascript

-Always add accept-charset="UTF-8" to your <form> tags

-Always use Binary transfer in FTP, ALWAYS :)

hope it'll help some

r12a

9:51 am on Oct 12, 2005 (gmt 0)

If you are switching to UTF-8 you may also find useful information here:

Step 1: Save the data as UTF-8 Step 2: Declare the encoding in your page Step 3: Ensure that your server does the right thing

It's a short overview article, but follow the links to other useful articles at appropriate places to complete the picture.

Those of you wondering about use of entities and NCRs could look at: [w3.org...]

larryhatch

11:00 am on Oct 12, 2005 (gmt 0)

Thanks much Encyclo!

" larryhatch: am I stuck with something like &c-snoodle in my source code?

" - you can use the &ccedil; entity reference for the ç but you are still faced with the disadvantages listed above. The problem you are experiencing is that you are changing the defined character encoding without actually converting the file to UTF-8. The exact method depends on your editor - most modern editors such as Dreamweaver and many text editors will correctly save your document as UTF-8 if you specify it. If you are on Linux or have a Linux/Unix server, you can use the iconv utility to convert the file."

- -

I changed c-snoodle to &ccedil;, and the Spanish 'enya' to &ntilde;. Both worked fine without changing to content="text/html; charset=UTF-8">

That rendered OK on Firefox. Then I went ahead and changed to UTF-8 and it validated fine.

Not sure if I understand what you mean about the text editor. I use plain old Wordpad, and code everything by hand. I could not find any way to alter the character encoding.

WP shows &ccedil; and &ntilde; in full, exactly as I typed them in. Are you saying that newer editors will show the proper snoodle-c and enya-n with the tilde while I'm editing? If so, its not such a big deal. I just want my pages to render properly.

Thanks again -Larry

Fischerlaender

11:15 am on Oct 12, 2005 (gmt 0)

Larry wrote:

I tried UTF-8 and it worked fine there too, BUT then pages would not validate in W3C!

I use plain old Wordpad, and code everything by hand. I could not find any way to alter the character encoding.

How did you create UTF-8 encoded pages with Wordpad? You need an editor which is able to save text encoded as UTF-8. If you were using Wordpad then your text was saved as ISO-8859-1 or even Windows-1252, but not as UTF-8.

I'm a big fan of UltraEdit, which handles UTF-8 correctly. Alternatively I think that NoteTab Light should also be able to encode text as UTF-8.

AlexK

11:45 am on Oct 12, 2005 (gmt 0)

larryhatch:

I use plain old Wordpad, and code everything by hand. I could not find any way to alter the character encoding.

WordPad does not have a "Save as UTF-8" option. It does have a "Save as Unicode" option, but this is 16-bit (see msg#15). The character encoding is determined by your Windows settings. See Control Panel-Regional and Language Options, and also msg#20.

Try TextPad [textpad.com] (a multi-file Notepad/wordpad replacement which can also write the file in Windows, DOS, Unix or Mac encoding). That can write the file directly into UTF-8 encoding, and even add the BOM if you wish.

Are you saying that newer editors will show the proper snoodle-c and enya-n with the tilde while I'm editing?

Only if you enter them in hi-bit (ç = alt+135, ñ = alt+164). Re-read encyclo's original post in this thread (msg#1). The HTML-entities are a lo-bit way (chars 20-126) of avoiding the hi-bit incompatabilities between different ISO-8859-x encodings.

AlexK

12:04 pm on Oct 12, 2005 (gmt 0)

Fischerlaender:

If you were using Wordpad then your text was saved as ISO-8859-1 or even Windows-1252, but not as UTF-8.

WordPad is capable of saving text files in RTF, ANSI, DOS or Unicode. If "ANSI", the encoding will be 8-bit plain text, encoded within the local Windows code-page. If localised to US-ASCII, that will be Windows-1252 (other localisations use other code-pages).

The different Windows code-pages differ only in the hi-bit chars (128-255) amongst the Western scripts (aka ISO-8859-x). Thus, a text file consisting of 8-bit chars which are all within the lo-bit range (20-126) of the original US-ASCII encoding can legitimately be declared as UTF-8, Windows-1252, or any of the ISO-8859 series of encodings.

sunnylyon

12:06 pm on Oct 12, 2005 (gmt 0)

what goes on with forms? my pages are ok, but when people fill out my php form and put in special characters they come out as garbage. Given the previous discussion about php, would a cgi form solve the problem?

r12a

to get a better understanding of how to and when not to use HTML entities.

Btw, I don't think anyone has mentioned numeric character references - not even encyclo in msg #1 - which are a safer way of doing similar things as with entities, and are much more scalable. Again, see the above article for an explanation.

AlexK

3:57 pm on Oct 12, 2005 (gmt 0)

sunnylyon:

when people fill out my php form and put in special characters they come out as garbage.

Read msg#21:

Always add accept-charset="UTF-8" to your <form> tags

Also, make sure that the Response header is correct (see msg#8) and that any <meta> headers do not conflict (see msg#10).

Trisha

5:03 pm on Oct 12, 2005 (gmt 0)

I find this topic fascinating, but very confusing!

I'm trying to get this right:

I'm using gentoo linux (barely know how to use it though, someone else installed it) - and I see in bluefish in preferences, under encoding, the default character set is set to UTF-8 - so I guess I'm ok there. And I've got charset=UTF-8 in my meta tags.

In bluefish (gedit too) when I hold down the ctrl and shift at the same time, then type 'f' and '1' - then let the keys up - I see an n with a tilde above it.

If I do that in an .html file and view it in firefox - I see the n with the tilde just fine - but if I do it in an .php file, I get a box with 4 letters in it.

So, I guess I have a php problem like gaouzief was talking about. I have no idea what version of php my host is using or what 'mbstring' is. In this example, I am only using php for includes for the header, navigation, etc.