4.2 Character sets and encodings in tags

This section attempts to describe some intricacies regarding character
sets and encodings in tags.

Text in ID3v2 tags can be encoded in a variety of ways.
ID3v2.3 and earlier standards support only text encoded in ISO 8859-1 and UCS-2.
ID3v2.4 added support for UTF-8 and UTF-16BE1, and replaced UCS-2 with UTF-16.

If you are using id3lib, only ISO 8859-1 and UCS-2/UTF-16 encodings of ID3v2.4 tags
are supported. The current C API of id3lib must be extended in order to
support UTF-8 and UTF-16BE for ID3v2.4. (Especially, a function
ID3Field_GetEncoding is missing.)

TagLib seems to support all encodings used in ID3v2.4 tags.

Unfortunately, many applications still put UTF-8 encoded text in ID3v2.3 and earlier
tags. This is incorrect according to the standard2 -
single-byte text should be encoded in ISO 8859-1 and nothing else. TagLib handles
all single-byte text in ID3v1 and ID3v2.3 tags as ISO 8859-1, while id3lib gives
you the option to treat the data as you like. At the moment, GMediaServer assumes
single-byte text is encoded in ISO 8859-1 when using id3lib.

Footnotes

[1] The difference between UTF-16,
UTF-16BE and UTF-16LE is that strings encoded with UTF-16 must start with a byte
order mark, a so called BOM. For UTF-16 the BOM is either 0xFF 0xFE (denoting
little endian) or 0xFE 0xFF (denoting big endian).