Unicode

Contents

Unicode

The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing). This system is called ASCII, the American Standard Code for Information Interchange. Almost all modern day operating systems use it as well as many older computer systems.

Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world. This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:

the 1-byte character set is used to cover the simple English alphabet;

the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;

the 3-byte character set is used to cover additional language alphabets;

the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.

In addition to the character sets used the standard also defines "handedness", as in which way the text flows. Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.

While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used. The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.

There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code. There is a variation called UTF-EBCDIC to enable legacy applications running on these systems to utilise Unicode.

Using UTF-8 in FreeBSD

First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.

$ cd /usr/share/locale/; ls *UTF-8 -d

Add the following environment variable to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.

export LC_ALL=sv_SE.UTF-8

Now login and logout to have the effects apply.
After that you should enable UTF-8 support in your terminal, see the application section for this.

Converting files

Now you're ready to convert some files, this is done with the command iconv, install it if you don't already have it.

# pkg_add -r libiconv

Then use the following to convert a file.

$ iconv -f iso8859-1 -t utf-8 file > file.new

This is a small script that converts a bunch of files and creates a backup of them in another directory.

Applications

xterm

To make xterm play nice I added

$ echo "xterm*locale: UTF-8" >> ~/.Xdefaults

It could also be necessary to change the font see Unicode support on FreeBSD.

irssi + screen

Unfortunately I haven't found any way to get irssi+screen+FiSH to work with out a restart of irssi.
So restart screen with the new locales, this config will enable you to send ISO8859-1 by default in irssi.