Everything You Need to Know About Character Encoding

In addition to DTDs, one of the other important first things you need to know about making web pages is the importance & usage of Character Encoding.

Purpose

Setting the character encoding tells web browsers what language, and therefore what writing system and characters, you’re using on the webpage.

Some Character Encodings

There are lots of different character encodings you could potentially use on your webpages. In this section, I’ll look at the biggies you should know.

US-ASCII

Around since 1960, the American Standard Code for Information Interchange (ASCII, pronounced askee) is based on the English alphabet, along with some other characters, giving a total of 128:

94 printable characters (a, A, 1, +)

33 non-printing control characters (most of which are now obsolete)

1 space

The following figure shows the 128 characters found in ASCII:

ASCII doesn’t provide for any special characters—like the Euro (€), anything that’s not English, or any formatting (nothing bold or italic)—so it’s often called plain text.

Needless to say, don’t use ASCII as your character encoding—it’s way too limited!

ISO-8859-1

ISO-8859-1 is a standardized character encoding.

The ISO part stands for International Standards Organization, the same group that has determined standards for

CD-ROMs & DVD-ROMs

Film speed

Paper sizes

Screw threads

Water-resistant watches

Bicycle tires

Shoe sizes

8859-1 is the number of the ISO standard (in this case, for a particular character encoding)

ISO-8859-1 is also known as

Latin alphabet No. 1

ISO Latin 1

ISO-8859-1 is a common character encoding on the Web. It contains:

all the characters found in ASCII,

the various accented characters and letters needed for writing Western European languages (like French & Spanish),

along with some special characters.

You can see those additional characters in the figure below:

ISO-8859-1 used to be the recommended character entity for webpages, but that time is long gone. Instead, use UTF-8, discussed next.

In addition to ISO-8859-1, by the way, there are many other ISO-8859 encodings, including these:

ISO 8859-2: Central & East European

ISO 8859-3: South European, Maltese & Esperanto

ISO 8859-4: North European

ISO 8859-5: Cyrillic

ISO 8859-6: Arabic

ISO 8859-7: Modern Greek

ISO 8859-8: Hebrew & Yiddish

ISO 8859-9: Turkish

ISO 8859-10: Nordic (Lappish, Inuit, Icelandic)

ISO 8859-11: Thai

ISO 8859-13: Baltic Rim

ISO 8859-14: Celtic

ISO 8859-16: South-Eastern Europe

UTF-8

UTF-8 (8-bit Unicode Transformation Format) is a newer standard that dates from 1992-1993. Basically, it can encompass every character in every language in the world: more than 107,000 characters found in 90 writing systems.

Because it is so comprehensive, UTF-8 is now widely recommended & steadily becoming the standard way to represent text in files, email, webpages, & software.

How To Specify the Character Encoding

There are several ways you can tell web browsers what character encoding your webpages are using.

Web Server

If your web server is set up to include the character encoding in the HTTP Content-Type header (hidden information that is transferred back and forth between a web browser & a web server), then you don’t need to add anything to your web pages. Instead, the following information is in the HTTP Content-Type header the web server sends out to browsers:

Content-Type: text/html; charset=UTF-8

Keep in mind that this would only work if:

Your webpages are hosted and served via a web server, and

Your web server is configured to send the HTTP Content-Type header

How do you know if these are true?

Ask your hosting provider.

Use the Live HTTP Headers extension for Firefox to view the hidden information transferred back and forth between web servers and web browsers.

Since the webpages we’re creating in class are on your local computer and not on a server, you’ll need to use the next method: an HTML META Element.

HTML META Element

In your webpage, you insert a META element like this inside the HEAD element:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This META element appears very early in your code, even before the TITLE element, so the browser knows how to render the text that your users see.

What You Should Use

In this class, you should use a character encoding META element (again, since your webpages are on your local computer and not hosted on a web server). Which one you use, though, depends upon your DTD.