The Unicode Standard includes characters from the Basic
Multilingual Plane (BMP) and supplementary characters that lie
outside the BMP. This section describes support for Unicode in
MySQL. For information about the Unicode Standard itself, visit
the Unicode Consortium Web
site.

BMP characters have these characteristics:

Their code point values are between 0 and 65535 (or
U+0000 and U+FFFF).

They can be encoded in a variable-length encoding using 8,
16, or 24 bits (1 to 3 bytes).

They can be encoded in a fixed-length encoding using 16 bits
(2 bytes).

They are sufficient for almost all characters in major
languages.

Supplementary characters lie outside the BMP. Their code point
values are between U+10000 and
U+10FFFF). Unicode support for supplementary
characters requires character sets that have a range outside BMP
characters and therefore take more space than BMP characters.

MySQL supports these Unicode character sets:

utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character.

ucs2, the UCS-2 encoding of the Unicode
character set using two bytes per character.

utf8mb4, a UTF-8 encoding of the Unicode
character set using one to four bytes per character.

utf16, the UTF-16 encoding for the
Unicode character set using two or four bytes per character.
Like ucs2 but with an extension for
supplementary characters.

utf16le, the UTF-16LE encoding for the
Unicode character set. Like utf16 but
little-endian rather than big-endian.

utf32, the UTF-32 encoding for the
Unicode character set using four bytes per character.

Characters outside the BMP compare as REPLACEMENT CHARACTER and
convert to '?' when converted to a Unicode
character set that supports only BMP characters
(utf8 or ucs2).

If you use character sets that support supplementary characters
and thus are “wider” than the BMP-only
utf8 and ucs2 character
sets, there are potential incompatibility issues for your
applications; see Section 11.1.9.8, “Converting Between 3-Byte and 4-Byte Unicode Character Sets”.
That section also describes how to convert tables from
utf8 to the (4-byte)
utf8mb4 character set, and what constraints
may apply in doing so.

A similar set of collations is available for most Unicode
character sets. For example, each has a Danish collation, the
names of which are ucs2_danish_ci,
utf16_danish_ci,
utf32_danish_ci,
utf8_danish_ci, and
utf8mb4_danish_ci. The exception is
utf16le, which has only two collations. For
information about Unicode collations and their differentiating
properties, including collation properties for supplementary
characters, see Section 11.1.10.1, “Unicode Character Sets”.

The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores
characters in big-endian byte order and does not use a byte
order mark (BOM) at the beginning of values. Other database
systems might use little-endian byte order or a BOM. In such
cases, conversion of values will need to be performed when
transferring data between those systems and MySQL. The
implementation of UTF-16LE is little-endian.

If you get into trouble from a PHP-based web application, check the characterset configurations of these components:

1) the MySQL database 2) php.ini 3) httpd.conf 4) your server

Posted by
lorenz pressler
on
May 2, 2006

if you get data via php from your mysql-db (everything utf-8) but still get '?' for some special characters in your browser (<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />), try this: