Mostly programming and video games.

Category Archives: Character sets

Character encoding is a lot like regular expressions. Every time I need to know how it works, I have to learn it all over again. The real reason for this post is to provide a cheat sheet for those times, but first I’ll give you the skinny on character encoding.

Consider the quotation mark character (“). There are many ways this character can be represented on a computer. Ultimately, like any character, the quotation mark boils down to a series of zeros and ones that can be understood by the various electronic doodads connected to your motherboard. To those doodads, the quotation mark is simply 100010.

How would you like to type that every time you needed a quotation mark? Just think, if you wanted to quote Monty Python and the Holy Grail, you’d have to type:

Every binary number has an equivalent decimal number as well as a hexadecimal number and an octal number. The binary number 100010 is equal to the decimal number 34. It’s also equal to the hexadecimal number 22 and the octal number 42. Don’t think about it too hard, just take my word for it:

100010 = 34 decimal = 22 hexadecimal = 42 octal

This makes it much easier to quote Monty Python, but it’s still kinda cryptic and painful to type it in decimal:

Fortunately for you (and especially for me, since I quote Monty Python a lot), there are layers and layers of software that run on top of the hardware in your computer that can translate for you. So you almost never have to type anything in binary or decimal anymore. These days, you will almost always be working with a character set, which maps each of those zeros and ones to their respective glyphs to make them easier to use. You can think of a character set like this:

" -> 100010
A -> 101001
n -> 1101110
o -> 1101111
...

A Unicode character set will map those characters to hexadecimal numbers instead of binary numbers:

t -> 74
h -> 68
e -> 65
r -> 72
...

Except instead of hexadecimal numbers they are called Unicode points, and they look like this:

s -> U+0073
h -> U+0068
r -> U+0072
u -> U+0075
...

Then some other layer will take care of mapping those hexadecimal numbers to zeros and ones for the sake of the doodads we mentioned earlier.

In normal fonts like Arial or Times, the decimal number 169 corresponds to the glyph for the copyright symbol. You can also use the hexadecimal number A9 pretty much the same way but with the letter x in front of it:

&#xA9;

Or you can use the hexadecimal number in your stylesheet instead. This is the approach taken by Dave Gandy when he created Font Awesome:

The reason I needed to do this (and you might find it useful too) was so I could see all the characters a font contained. I couldn’t use the Character Map utility in Windows because it didn’t allow me to adjust the font size, and I didn’t want to spend all day looking for a replacement. I found it much easier to simply print all the characters to a page using some JavaScript: