For many years Americans have exchanged text using the ASCII character set;
since essentially all U.S. systems support ASCII,
this permits easy exchange of English text.
Unfortunately, ASCII is completely inadequate in handling the characters
of nearly all other languages.
For many years different countries have adopted different techniques for
exchanging text in different languages, making it difficult to exchange
data in an increasingly interconnected world.

More recently, ISO has developed ISO 10646,
the “Universal Mulitple-Octet Coded Character Set (UCS)”.
UCS is a coded character set which
defines a single 31-bit value for each of all of the world’s characters.
The first 65536 characters of the UCS (which thus fit into 16 bits)
are termed the “Basic Multilingual Plane” (BMP),
and the BMP is intended to cover nearly all of today’s spoken languages.
The Unicode forum develops the Unicode standard, which concentrates on
the UCS and adds some additional conventions to aid interoperability.
Historically, Unicode and ISO 10646 were developed by competing groups,
but thankfully they realized that they needed to work together and they now
coordinate with each other.

If you’re writing new software that handles internationalized characters,
you should be using ISO 10646/Unicode as your basis for handling
international characters.
However, you may need to process older documents in various older
(language-specific) character sets, in which case, you need to ensure that
an untrusted user cannot control the setting of another document’s
character set (since this would significantly affect the document’s
interpretation).

Most software is not designed to handle 16 bit or 32 bit characters,
yet to create a universal character set more than 8 bits was required.
Therefore, a special format called UTF-8
was developed to encode these
potentially international
characters in a format more easily handled by existing programs and libraries.
UTF-8 is defined, among other places, in IETF RFC 3629 (updating RFC 2279),
so it’s a
well-defined standard that can be freely read and used.
UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127)
encode to themselves as a single byte,
while characters with larger values are encoded into 2 to 4 (originally 6)
bytes of information (depending on their value).
The encoding has been specially designed to have the following
nice properties (this information is from the RFC and Linux utf-8 man page):

The classical US ASCII characters (0 to 0x7f) encode as themselves,
so files and strings which contain only 7-bit ASCII characters
have the same encoding under both ASCII and UTF-8.
This is fabulous for backward compatibility with the many existing
U.S. programs and data files.

All UCS characters beyond 0x7f are encoded as a multibyte
sequence consisting only of bytes in the range 0x80 to 0xfd.
This means that no ASCII byte can appear as part of another
character. Many other encodings permit characters such as an
embedded NIL, causing programs to fail.

It’s easy to convert between UTF-8 and a 2-byte or 4-byte
fixed-width representations of characters (these are called
UCS-2 and UCS-4 respectively).

The lexicographic sorting order of UCS-4 strings is preserved,
and the Boyer-Moore fast search algorithm can be used directly
with UTF-8 data.

All possible 2^31 UCS codes can be encoded using UTF-8.

The first byte of a multibyte sequence which represents
a single non-ASCII UCS character is always in the range
0xc0 to 0xfd and indicates how long this multibyte
sequence is. All further bytes in a multibyte sequence
are in the range 0x80 to 0xbf. This allows easy resynchronization;
if a byte is missing, it’s easy to skip forward to the “next”
character, and it’s always easy to skip forward and back to the
“next” or “preceding” character.

In short, the UTF-8 transformation format is becoming a dominant method
for exchanging international text information because it can support all of the
world’s languages, yet it is backward compatible with U.S. ASCII files
as well as having other nice properties.
For many purposes I recommend its use, particularly when storing data
in a “text” file.

The reason to mention UTF-8 is that
some byte sequences are not legal UTF-8, and
this might be an exploitable security hole.
UTF-8 encoders are supposed to use the “shortest possible”
encoding, but naive decoders may accept encodings that are longer than
necessary.
Indeed, earlier standards permitted decoders to accept
“non-shortest form” encodings.
The problem here is that this means that potentially dangerous
input could be represented multiple ways, and thus might
defeat the security routines checking for dangerous inputs.
The RFC describes the problem this way:

Implementers of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.

A particularly subtle form of this attack could be carried out
against a parser which performs security-critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters. For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
00, but allow the illegal two-octet sequence C0 80 (illegal because
it’s longer than necessary) and interpret it
as a NUL character (00). Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.

Thus, when accepting UTF-8 input, you need to check if the input is
valid UTF-8.
Here is a list of all legal UTF-8 sequences; any character
sequence not matching this table is not a legal UTF-8 sequence.
This list is from
The Unicode Standard Version 7.0 - Core Specification (2014).
In the following table, the first column shows the various character
values being encoded into UTF-8.
The second column shows how those characters are encoded as binary values;
an “x” indicates where the data is placed (either a 0 or 1), though
some values should not be allowed because they’re not the shortest possible
encoding.
The last row shows the valid values each byte can have
(in hexadecimal).
Thus, a program should check that every character meets one of the patterns
in the right-hand column.
A “-” indicates a range of legal values (inclusive).
Of course, just because a sequence is a legal UTF-8 sequence doesn’t
mean that you should accept it (you still need to do all your other
checking), but generally you should check any UTF-8 data for UTF-8 legality
before performing other checks.

Table 5-1. Legal UTF-8 Sequences

UCS Code (Hex)

Binary UTF-8 Format

Legal UTF-8 Values (Hex)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0-BF 80-BF

1000-CFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EC 80-BF 80-BF

D000-D7FF

1110xxxx 10xxxxxx 10xxxxxx

ED 80-9F 80-BF

E000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

EE-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90-BF 80-BF 80-BF

40000-FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F 80-BF 80-BF

As I noted earlier, there are two standards for character sets,
ISO 10646 and Unicode, who have agreed to synchronize their
character assignments.
The earlier definitions of UTF-8 in ISO/IEC 10646-1:2000 and the IETF RFC
also supported
five and six byte sequences to encode characters beyond U+10FFFF,
but such values can’t be used to support Unicode characters.
IETF RFC 3629 modified the UTF-8 definition, and one of the changes
was to specifically make any encodings beyond 4 bytes illegal
(i.e., characters must be between U+0000 and U+10FFFF inclusively).
Thus, the five and six byte UTF-8 encodings for characters
beyon U+10FFFF aren’t legal any more,
and you should normally reject them (unless you have a special purpose
for them).

This is set of valid values is tricky to determine, and in fact
earlier versions of this document got some entries
wrong (in some cases it permitted overlong characters).
Language developers should include a function in their libraries
to check for valid UTF-8 values, just because it’s so hard to get right.

I should note that in some cases, you might want to cut slack (or use
internally) the hexadecimal sequence C0 80. This is an overlong sequence
that, if permitted, can represent ASCII NUL (NIL). Since C and C++
have trouble including a NIL character in an ordinary string,
some people have taken
to using this sequence when they want to represent NIL as part of the
data stream; Java even enshrines the practice.
Feel free to use C0 80 internally while processing data, but technically
you really should translate this back to 00 before saving the data.
Depending on your needs, you might decide to be “sloppy” and accept
C0 80 as input in a UTF-8 data stream.
If it doesn’t harm security, it’s probably a good practice to accept this
sequence since accepting it aids interoperability.

Handling this can be tricky.
You might want to examine the C routines developed by Unicode to
handle conversions, available at
ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c.
It’s unclear to me if these routines are open source software (the
licenses don’t clearly say whether or not they can be modified), so
beware of that.

This section has discussed UTF-8, because it’s the most popular
multibyte encoding of UCS, simplifying a lot of international text
handling issues.
However, it’s certainly not the only encoding; there are other encodings,
such as UTF-16 and UTF-7, which have the same kinds of issues and
must be validated for the same reasons.

Another issue is that some phrases can be expressed in more than one
way in ISO 10646/Unicode.
For example, some accented characters can be represented as a single
character (with the accent) and also as a set of characters
(e.g., the base character plus a separate composing accent).
These two forms may appear identical.
There’s also a zero-width space that could be inserted, with the
result that apparently-similar items are considered different.
Beware of situations where such hidden text could interfere with the program.
This is an issue that in general is hard to solve; most programs don’t
have such tight control over the clients that they know completely how
a particular sequence will be displayed (since this depends on the
client’s font, display characteristics, locale, and so on).
One approach is to require clients to send data in a normalized form,
and if you don’t trust the clients, force their data into that form.
The W3C recommends
Normalization Form C in their draft document
Character Model for the World Wide Web.
Normalization form C is a good approach, because it’s what nearly all
programs do anyway, and it’s slightly more efficient in space.
See the W3C document for more information.