Unicode

Unicode is necessary for international web development but poses a few pitfalls.

Unicode and HTTP

Now that we have gotten the basics out of the way, let's
consider how Unicode documents are transferred across the Web. The
basic problem is this: when your browser receives a document, how
does it know if it should interpret the bytes as Latin-1, Big-5
Chinese or UTF-8?

The answer lies in the Content-type HTTP header. Every time
an HTTP server sends a document to a browser, it identifies the
type of content it is sending using a MIME-style designation, such
as text/html, image/png or application/msword. If you receive a
JPEG image (image/jpeg), there is only one way to represent the
image. But if you receive an HTML document (text/html), the
Content-type header must indicate the character set and/or encoding
that is being used. We do this by adding a charset= designation to
the end of the header, separating the type from the charset. For
example:

Content-type: text/html; charset=utf-8

Purists rightly say that UTF-8 is an encoding and not a
character set. Unfortunately, it's too late to do anything about
this. This is similar to the fact that the word “referrer” is
misspelled in the HTTP specification as “referer”; everyone knows
that it's wrong but is afraid to break existing software.

If no Content-type is specified, it is assumed to be Latin-1.
Moreover, if no Content-type is specified, individual documents can
set (or override) the value within a metatag. Metatags cannot
override an explicit setting of the character set, however.

As you begin to work with different encodings, you will
undoubtedly discover an HTTP server that has not been configured
correctly and that is announcing the wrong character set in the
Content-type header. An easy way to check this is to use Perl's LWP
(library for web programming), which includes a number of useful
command-line programs for web developers, for example:

$ HEAD http://yad2yad.huji.ac.il/

Typing the above on my Linux box returns the HTTP response
headers from the named site:

As you can see, the Content-type header is declaring the document
to be in UTF-8.

Mozilla and other modern browsers allow the user to override
the explicitly stated encoding. Although this should not normally
be necessary for end users, I often find this functionality to be
useful when developing a site.

Unicode and HTML

Although it's nice to know we can transfer UTF-8 documents
via HTTP, we first need some UTF-8 documents to send. Given that
ASCII documents are all UTF-8 documents as well, it's easy to
create valid UTF-8 documents, so long as they contain only ASCII
characters. But what happens if you want to create HTML pages that
contain Hebrew or Greek? Then things start to get interesting and
difficult.

There are basically two ways to include Unicode characters in
an HTML document. The first is to type the characters themselves
using an editor that can work with UTF-8. For example, GNU Emacs
allows me to enter text using a variety of keyboard options and
then save my document in the encoding of my choice, including
UTF-8. If I try to save a Chinese document in the Latin-1 encoding,
Emacs will refuse to comply, warning me that the document contains
characters that do not exist in Latin-1. Unfortunately, for people
like me who want to use Hebrew, Emacs doesn't yet handle
right-to-left input.

A better option, and one which is increasingly impressive all
of the time, is Yudit, an open-source UTF-8-compliant editor that
handles many different languages and directions. It can take a
while to learn to use Yudit, but it does work. Yudit, like Emacs,
allows you to enter any character you want, even if your operating
system or keyboard does not directly support all of the desired
languages.

Both Emacs and Yudit are good options if you are working on
Linux, if you are willing to tinker a bit, and if you don't mind
writing your HTML by hand. But nearly all of the graphic designers
I know work on other platforms, and getting them to work with HTML
editors that use UTF-8 has been rather difficult.

Luckily, Mozilla comes with not only a web browser but a
full-fledged HTML editor as well. As you might expect, Mozilla's
composer module is a bit rough around the edges but handles most
tasks just fine.

Another option is to use HTML entities. The best-known
entities are &lt;, &gt; and &amp; which make it
possible to insert the <, > and & symbols into an HTML
document without having to worry that they will be interpreted as
tags.

Creating the above document does not require a
Unicode-compliant editor, and it will render fine in any modern
browser, regardless of the Content-type that was declared in the
HTTP response headers. However, editing a file that uses entities
in this way is tedious and difficult at best. Unfortunately, the
save-as-HTML feature in the international editions of Microsoft
Word uses this extensively, which makes it easy for Word users to
create Unicode-compliant documents but difficult for people to edit
them later.

Comment viewing options

Where do I find the Linux ASCII codes for Denmark and Germany? I know what the individual foreign characters are but I don't know how to use them on letters or my kmail So I can write to my family.
Can you help?

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.