The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

UTF-8 vs. ISO-8859-1

I have been led to believe that UTF-8 is a larger character set than ISO-8859-1 and that it is increasing in popularity and use throughout the internet.

However, I have noticed that if the page is served as UTF-8, the pound sign shows up with the error character �. Now, I know that to display it the HTML entity &pound; ought to be used. However, if UTF-8 can't display it when not using &pound; and ISO-8859-1 can, why is everyone moving towards UTF-8? What's wrong with good old ISO-8859-1?

The character encoding that you specify for a web page must match the encoding you used when saving your file. If you save your file as ISO-8859-1 and declare the encoding as UTF-8 (or vice versa) there'll be problems if you use characters outside the ASCII range.

ISO-8859-1 is both a character repertoire ('character set') and an encoding. It's a straight single byte one-to-one encoding, which means it contains 256 positions (0x00-0xFF). Quite a few of those are reserved for control characters (C0 in 0x00-0x1F and C1 in 0x80-0x9F). That leaves 192 printable characters, which is enough for simple texts in most Western European languages. Unfortunately, ISO-8859-1 doesn't include some very useful and common characters, like proper quotation marks and dashes. It also doesn't contain the Euro currency character (€). (ISO-8859-15 is meant to replace ISO-8859-1, and contains the Euro sign.)

UTF-8 is an encoding for the Unicode character repertoire. It uses between one and six bytes to encode each character and can thus represent any Unicode character. The first 128 characters (0x00-0x7F) are encoded identically to ISO-8859-1.

The character repertoire used in HTML is ISO-10646, which is virtually the same as Unicode. Both UTF-8 and ISO-8859-1 (and many others) can be used as the encoding, but ISO-8859-1 is much more limited since it can only represent the first 256 characters (of which only 192 are printable).

If you want to include a character that cannot be represented in your chosen encoding, you can use character entities (e.g., &#38;pound;) or numeric character references (&#38;#163; or &#38;#xa3;).

This character is encoded differently in ISO-8859-1 and UTF-8. If you include a literal £ sign, and your declared encoding doesn't match the encoding in which you saved your file, the pound sign will not display correctly.

If you want to use UTF-8, you must save your file with an encoding of UTF-8 and declare the encoding to be UTF-8. The encoding can be declared using the charset attribute in the Content-Type HTTP header, e.g.:

Code:

Content-Type: text/html; charset=utf-8

If the encoding is not specified in the HTTP header, you can specify it using a META element:

HTML Code:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Such a META element will be ignored if the information is sent in the real HTTP headers, though, but it can be useful for when the document is saved to disk and viewed locally.

For (real) XHTML, the encoding should be specified in the XML declaration (and omitted from the HTTP header):

Code:

<?xml version="1.0" encoding="utf-8"?>

This will only be applied if the document is served with an XML MIME type (preferably application/xhtml+xml). In that case, any META equivalent will be ignored.

XML parsers are only required to support UTF-8 and UTF-16. XML parsers used in web browsers are likely to support the same range of encodings as the accompanying HTML parsers, but if you want to be on the safe side you should only use UTF-8 or UTF-16 for XML (including XHTML).

I haven't used Notepad2, but I'd guess that it uses ISO-8859-1 (or Windows-1252) as the default. Windows-1252 is a Microsoft-specific version of ISO-8859-1 that uses the range reserved for C1 control characters (0x80-0x9F) for some useful characters (nice quotation marks, dashes, etc.).

You'll need to look at the file using a tool that can show you the exact character codes. I use the Vim editor which can do this, but you could also use any DUMP utility. If you know any programming language, it would be a trivial exercise to write a program that displays the character codes.

Look at the code for the pound sign. If it's 163 (decimal) or A3 (hex), then you're probably using ISO-8859-1, although it could also be Windows-1252. If it's two bytes (C2 A3), you're using UTF-8.

When you open files that only contain ascii bytes then Notepad2 will assume windows-1252 (known bug, there should be a pref or it should honor the default). To prevent this include the BOM or a non-ascii character somewhere.

If you have many files perhaps you want to automate the conversion (using iconv for instance).