The Unicode Standard includes characters from the Basic
Multilingual Plane (BMP) and supplementary characters that lie
outside the BMP. This section describes support for Unicode in
MySQL. For information about the Unicode Standard itself, visit
the Unicode Consortium
website.

BMP characters have these characteristics:

Their code point values are between 0 and 65535 (or
U+0000 and U+FFFF).

They can be encoded in a variable-length encoding using 8, 16,
or 24 bits (1 to 3 bytes).

They can be encoded in a fixed-length encoding using 16 bits
(2 bytes).

They are sufficient for almost all characters in major
languages.

Supplementary characters lie outside the BMP:

Their code point values are between U+10000
and U+10FFFF).

Unicode support for supplementary characters requires
character sets that have a range outside BMP characters and
therefore take more space than BMP characters (up to 4 bytes
per character).

The UTF-8 (Unicode Transformation Format with 8-bit units) method
for encoding Unicode data is implemented according to RFC 3629,
which describes encoding sequences that take from one to four
bytes. The idea of UTF-8 is that various Unicode characters are
encoded using byte sequences of different lengths:

Characters outside the BMP compare as REPLACEMENT CHARACTER and
convert to '?' when converted to a Unicode
character set that supports only BMP characters
(utf8mb3 or ucs2).

If you use character sets that support supplementary characters
and thus are “wider” than the BMP-only
utf8mb3 and ucs2 character
sets, there are potential incompatibility issues for your
applications; see Section 1.9.8, “Converting Between 3-Byte and 4-Byte Unicode Character Sets”.
That section also describes how to convert tables from the
(3-byte) utf8mb3 to the (4-byte)
utf8mb4, and what constraints may apply in
doing so.

A similar set of collations is available for most Unicode
character sets. For example, each has a Danish collation, the
names of which are utf8mb4_danish_ci,
utf8mb3_danish_ci,
utf8_danish_ci,
ucs2_danish_ci,
utf16_danish_ci, and
utf32_danish_ci. The exception is
utf16le, which has only two collations. For
information about Unicode collations and their differentiating
properties, including collation properties for supplementary
characters, see Section 1.10.1, “Unicode Character Sets”.

The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores
characters in big-endian byte order and does not use a byte order
mark (BOM) at the beginning of values. Other database systems
might use little-endian byte order or a BOM. In such cases,
conversion of values will need to be performed when transferring
data between those systems and MySQL. The implementation of
UTF-16LE is little-endian.

MySQL uses no BOM for UTF-8 values.

Client applications that communicate with the server using Unicode
should set the client character set accordingly; for example, by
issuing a SET NAMES 'utf8mb4' statement. Some
character sets cannot be used as the client character set.
Attempting to use them with SET
NAMES or SET CHARACTER
SET produces an error. See
Impermissible Client Character Sets.

The following sections provide additional detail on the Unicode
character sets in MySQL.