Summary

Because Unicode contains such a large number of characters and
incorporates the varied writing systems of the world, incorrect
usage can expose programs or systems to possible security attacks.
This document specifies mechanisms that can be used to detect
possible security problems.

Status

This is a draft document
which may be updated, replaced, or superseded by other documents at
any time. Publication does not imply endorsement by the Unicode
Consortium. This is not a stable document; it is inappropriate to
cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent
specification. Conformance to the Unicode Standard does not imply
conformance to any UTS.

Please submit corrigenda and other comments with the online
reporting form [Feedback]. Related
information that is useful in understanding this document is found
in References. For the latest version of
the Unicode Standard see [Unicode]. For a
list of current Unicode Technical Reports see [Reports].
For more information about versions of the Unicode Standard, see [Versions].

Unicode Technical Report #36, "Unicode Security
Considerations" [UTR36] provides guidelines
for detecting and avoiding security problems connected with the use
of Unicode. This document specifies mechanisms that are used in that
document, and can be used elsewhere. Readers should be familiar with
[UTR36] before continuing. See
also the Unicode FAQ on Security Issues [FAQSec].

Identifiers are special-purpose strings used for
identification—strings that are deliberately limited to particular
repertoires for that purpose. Exclusion of characters from
identifiers does not affect the general use of those characters, such
as within documents. Unicode Standard Annex #31,
"Identifier and Pattern Syntax" [UAX31]
provides a recommended method of determining which strings should
qualify as identifiers. The UAX #31 specification extends the common
practice of defining identifiers in terms of letters and numbers to
the Unicode repertoire.

That specification also permits other protocols to use that method as
a base, and to define a profile that adds or removes
characters. For example, identifiers for specific programming
languages typically add some characters like "$", and
remove others like "-" (because of the use as minus),
while IDNA removes "_" (among others)—see Unicode
Technical Standard #46,"Unicode IDNA Compatibility
Processing" [UTS46], as well as [IDNA2003], and [IDNA2008].

This document provides for additional identifier profiles for
environments where security is an issue. These are profiles of the
extended identifiers based on properties and specifications of the
Unicode Standard [Unicode], including:

The XID_Start and XID_Continue properties defined in the
Unicode Character Database (see [DCore])

The toCasefold(X) operation defined in Chapter
3, Conformance of [Unicode]

The NFKC and NFKD normalizations defined in Chapter
3, Conformance of [Unicode]

The data files used in defining these profiles follow the UCD File
Format, which has a semicolon-delimited list of data fields
associated with given characters, with each field referenced by
number. For more details, see [UCDFormat].

The file [idmod] provides data for a profile of
identifiers in environments where security is at issue. The file
contains a set of characters recommended to be restricted from use.
It also contains a small set of characters that are recommended as
additions to the list of characters defined by the XID_Start and
XID_Continue properties, because they may be used in identifiers in a
broader context than programming identifiers.

The restricted characters are characters not in common use, and
are removed to further reduce the possibilities for visual confusion.
They include the following:

characters not in modern use

characters only used in specialized fields, such as
liturgical characters, phonetic letters, and mathematical
letter-like symbols

characters in limited use by very small communities

The principle has been to be more conservative initially, allowing
for the set to be modified in the future as requirements for
characters are refined. For information on handling modifications
over time, see Section 2.9.1, Backward
Compatibility in Unicode Technical Report #36, "Unicode
Security Considerations" [UTR36].

An implementation following the General Security Profile does not
permit restricted characters, unless it documents the
additional characters that it does allow. Common candidates for such
additions include characters for scripts listed in Table
6, Aspirational Use Scripts and Table
7, Limited Use Scripts of [UAX31]. However,
characters from these scripts have not been examined for confusables
or to determine specialized, non-modern, or limited-use characters.

The distinctions among the Type values is not strict; if there are multiple
Types for restricting a character only one is given. The important
characteristic is the Status: whether or not the character is
restricted. As more information is gathered about
characters, this data may change in successive versions. That can
cause either the Status or Type to change for a particular character.
Thus users of this data should be prepared for changes in successive
versions, such as by having a grandfathering policy in place for
in place for previously supported characters or registrations.

Restricted characters should be treated with caution in registration,
and disallowed unless there is good reason to allow them in the
enviroment in question. However, the set of Status=allowed characters
are not typically used as-is by implementations. Instead, they are applied as filters to the set of characters C that are supported by the identifier syntax, generating a new set C′. Typically there are also particular characters or classes of characters from C that are retained as Exception characters.

C′ = (C ∩ {Status=allowed}) ∪ Exceptions

The implementation may simply restrict use of new identifiers to C′, or may apply some other strategy. For example, there might be an appeal process for registrations of ids that contain characters outside of C' (but still inside of C), or in user interfaces for lookup of identifiers,
warnings of some kind may be appropriate. For more information, see [UTR36].

The Exception characters would be implementation-specific. For example, a particular implementation might extend the default
Unicode identifier syntax by adding Exception characters with the Unicode
property XID_Continue=False, such as “$”,
“-”, and “.”. Those characters are specific to
that identifier syntax, and would be retained even though
they are not in the Status=allowed set. Some implementations may also wish to add Exception characters from the notes on MidLetter in [UAX29], or may wish to add the [CLDR] exemplar characters for particular supported languages with unusual characters.

The implementation may also apply other restrictions discussed in this document, such as checking for confusible characters or doing mixed-script detection.

This list is also used in deriving the IDN Identifiers list
given below. It is, however, designed to be applied to general

Version 1 of this document defined operations and data that apply to
[IDNA2003], which has been superseded by [IDNA2008] and Unicode Technical
Standard #46,"Unicode IDNA Compatibility
Processing" [UTS46]. The identifier
modification data can be applied to whichever specification of IDNA
is being used. For more information, see the [IDN
FAQ].

The tables in the data file [confusables]
provide a mechanism for determining when two strings are visually
confusable. The data in these files may be refined and extended over
time. For information on handling modifications over time, see Section 2.9.1, Backward Compatibility in Unicode Technical Report #36,
"Unicode Security Considerations" [UTR36].

The data is organized into four different tables, depending on
the desired parameters. Each table provides a mapping from source
characters to target strings. On the basis of this data, there are
three main classes of confusable strings:

Definitions

X and Y are single-script confusables if they are confusable
according to the Single-Script table, and each of them is a single
script string according to Section 5, Mixed-Script Detection, and it is the same script for each.
Examples: "so̷s" and "søs" in Latin, where the first word
has the character "o" followed by the character U+0337 ( ̷ )
COMBINING SHORT SOLIDUS OVERLAY.

X and Y are mixed-script confusables if they are confusable
according to the Mixed-Script table, and they are not single-script
confusables. Examples: "paypal" and "paypal",
where the second word has the character U+0430 ( а )
CYRILLIC SMALL LETTER A.

X and Y are whole-script confusables if they are mixed-script
confusables, and each of them is a single script string.
Example: "scope" in Latin and "scope" in
Cyrillic.

To see whether two strings X and Y are confusable according to a
given table (abbreviated as X ≅ Y), an implementation
uses a transform of X called a skeleton(X) defined by:

Successively mapping each source character in X to the
target string according to the specified data table.

Reapplying NFD.

The resulting strings skeleton(X) and skeleton(Y) are
then compared. If they are identical (codepoint-for-codepoint), then
X ≅ Y according to the table.

Note: The strings skeleton(X) and skeleton(Y)
are not intended for display, storage or transmission.
They should be thought of as an intermediate processing form,
similar to a hashcode. The characters in skeleton(X) and skeleton(Y)
are not guaranteed to be identifier characters.

Implementations do not have to recursively apply the mappings,
because the transforms are idempotent. That is,

skeleton(skeleton(X)) = skeleton(X)

This mechanism imposes transitivity on the data, so if X ≅ Y and Y ≅
Z, then X ≅ Z. It is possible to provide a more sophisticated
confusable detection, by providing a metric between given characters,
indicating their "closeness." However, that is
computationally much more expensive, and requires more sophisticated
data, so at this point in time the simpler mechanism has been chosen.
That means that in some cases the test may be overly inclusive.
However the frequency of such cases in real data should be small.

Each line in the data file has the following format: Field 1 is the
source, Field 2 is the target, and Field 3 is a type identifying the
table. For example,

The types are explained in Table 2, Confusable Data Table Types.
The comments provide the character names. If the data was derived via
transitivity, there is an extra comment at the end. For instance, in
the above example the derivation was:

U+309A ( ゚ ) COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND
MARK →

U+FF9F ( ﾟ ) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK →

U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK →

U+030A ( ̊ ) COMBINING RING ABOVE

To reduce security risks, it is advised that identifiers use
casefolded forms, thus eliminating uppercase variants where possible.
Characters with the script values COMMON or INHERITED are ignored
when testing for differences in script.

This table is used to test cases of
single-script confusables, where the output allows for mixed case
(which may be later folded away). For example, this table contains
the following entry not found in SL:

This table is used to test cases of
mixed-script and whole-script confusables, where both the source
character and the target string are case folded. For example, this
table contains the following entry not found in SL or SA:

This table is used to test cases of
mixed-script and whole-script confusables, where the output allows
for mixed case (which may be later folded away). For example, this
table contains the following entry not found in SL, SA, or ML:

Data is also provided for testing a string to see if a string X has
any whole-script confusable, using the file [confusablesWS].
This file consists of a list of lines of the form:

<range>; <sourceScript>; <targetScript>; <type> #comment

The types are either L for lowercase-only, or A for any-case,
where the any-case ranges are broader (including uppercase and
lowercase characters). If the string is only lowercase, use the
lowercase-only table. Otherwise, first test according to the any-case
table, then casefold the string and test according to the
lowercase-only table.

In using the data, all lines with the same sourceScript and targetScript
are collected together to form a set of Unicode characters, after
filtering to the allowed characters from Section
3.1, General Security
Profile for Identifiers. Logically, the file is a set of tuples of the form <sourceScript,
unicodeSet, targetScript>. For example, the following lines are
present for Latin to Cyrillic:

They logically form a tuple <Latin, [a c-e ... \u0292],
Cyrillic>, which indicates that a Latin string containing
characters only from that Unicode set can have a whole-script
confusable in Cyrillic (lowercase-only). Note that if the
implementation needs a set of allowed characters
that is different from those in Section 3.1, General Security Profile for
Identifiers, this process needs to be used to generate a different
set of data.

To test whether a single-script string givenString has a
whole-script confusable in targetScript, the following process
is used:

Remove all [:script=common:] and [:script=inherited:]
characters from givenSet

Let givenScript be the script of the characters in givenSet

(if there is more than one script, fail with error).

See if there is a tuple <sourceScript, unicodeSet,
targetScript> where

sourceScript = givenScript

unicodeSet⊇
givenSet

If so, then there is
a whole-script confusable in targetScript

The test is actually slightly broader than a whole-script
confusable test. It tests whether the given string has a whole-script
confusable string in another script, possibly with the addition or
removal of common/inherited characters such as numbers and combining
marks characters to both strings. In practice, however, this
broadening has no significant impact.

Implementations would normally read the data into appropriate data
structures in memory for processing. A quick additional optimization
is to keep, for each script, a fastReject set, containing
characters in the script contained in none of the unicodeSet
values.

The following Java sample shows how this can be done (using the Java
version of [ICU]):

The data in [confusablesWS] is built
using the data in [confusables], and
subject to the same caveat:The data in these files may be
refined and extended over time. For information on handling that, see
Section 2.9.1, Backward Compatibility
of [UTR36].

For each script found in the given string, see if all the
characters in the string outside of that script have whole-script
confusables for that script (according to Section 4.1, Whole-Script Confusables)
.

Example 1: "pаypаl", with Cyrillic "а"s.

There are two scripts, Latin and Cyrillic. The set of
Cyrillic characters {a} has a whole-script confusable in
Latin. Thus the string is a mixed-script confusable.

Example 2: "toys-я-us", with one
Cyrillic character "я".

The set of Cyrillic characters {я} does not
have a whole-script confusable in Latin (there is no Latin character
that looks like "я", nor does the
set of Latin characters {o s t u y} have a whole-script confusable
in Cyrillic (there is no Cyrillic character that looks like
"t" or "u"). Thus this string is not a
mixed-script confusable.

Example 3: "1iνе", with a Greek "ν" and
Cyrillic "е".

There are three scripts, Latin, Greek, and Cyrillic. The set
of Cyrillic characters {е} and the set of Greek characters {ν} each
have a whole-script confusable in Latin. Thus the string is a
mixed-script confusable.

The Unicode Standard supplies information that can be used for
determining the script of characters and detecting mixed-script text.
The determination of script is according to the Unicode Standard
Annex #24, "Unicode Script Property" [UAX24]
, using data from the Unicode Character Database [UCD].
For a given input string, the logical process is the
following:

Define a set of sets of scripts SOSS.

For each character in the string:

Use the Script_Extensions property to find the set of
scripts that the character has.

Remove Common and Inherited from that set of scripts.

If the result is not empty, add that set to SOSS.

If no single script is common to all of the sets in SOSS, then
the string contains mixed scripts.

Characters with the script values Common and Inherited
are ignored, because they are used with more than one script. For
example, "abc-def" counts as a single script Latin because
the script of "-" is ignored.

A set of scripts S is said to cover a SOSS if S
intersects each element of SOSS. For example, {Latin, Greek} covers
{{Latin, Georgian}, {Greek, Cyrillic}}, because:

The actual implementation of this algorithm can be optimized; as
usual, the specification only depends on the results. The following
Java sample using [ICU] shows how the above
process can be implemented:

Restriction Levels 1-5 are defined here for use in implementations.
These place restrictions on the use of identifiers according to the
appropriate Identifier Profile as specified in Section 3, Identifier
Characters. The lists of Recommended and Aspirational scripts are
taken from Table
5, Recommended Scripts and Table
6, Aspirational Use Scripts of [UAX31]. For
more information on the use of Restriction Levels, see Section
2.9 Restriction Levels and Alerts in [UTR36].

Whenever scripts are tested for in the following definitions, characters with Script_Extension=Common and Script_Extension=Inherited are ignored.

There are three different types of numbers in Unicode. Only numbers
with General_Category = Decimal_Numbers (Nd) should be allowed in
identifiers. However, characters from different decimal number
systems can be easily confused. For example, U+0660 ( ٠ )
ARABIC-INDIC DIGIT ZERO can be confused with U+06F0 ( ۰ )
EXTENDED ARABIC-INDIC DIGIT ZERO, and U+09EA ( ৪ )
BENGALI DIGIT FOUR can be confused with U+0038 ( 8 )
DIGIT EIGHT.

For a given input string which does not contain non-decimal
numbers, the logical process of detecting mixed numbers is the
following:

For each character in the string:

Find the decimal number value for that character, if any.

Map the value to the unique zero character for that number
system.

If there is more than one such zero character, then the string
contains multiple decimal number systems.

The actual implementation of this algorithm can be optimized; as
usual, the specification only depends on the results. The following
Java sample using [ICU] shows how this can be done
:

There are additional enhancements that may be useful in spoof
detection. This includes such mechanisms as marking
strings as "mixed script" where they contain both
simplified-only and traditional-only Chinese characters, using the
Unihan data in the Unicode Character Database [UCD],
or detecting sequences of the same nonspacing mark.

Other enhancements useful in spoof detection include the
following:

Mark strings as "mixed script," where they contain both
simplified-only (S) and traditional-only (T) Chinese characters, using the
Unihan data in the Unicode Character Database [UCD].

The test can only be applied if the characters are meant to be Chinese.​ So “写真だけの結婚式​” is Japanese, and shouldn't be subject to this test.

The test for S vs T needs to be not whether the character has a T or S variant​, but whether the character is an S or T variant.

Forbid sequences of the same nonspacing mark

Check to see that all the characters are in the sets of
exemplar characters for at least one language in the Unicode Common
Locale Data Repository [CLDR].

As discussed in Unicode Technical
Report #36, "Unicode Security Considerations" [UTR36], confusability among characters cannot be
an exact science. There are many factors that make confusability a
matter of degree:

Shapes of characters vary greatly among fonts used to
represent them. The Unicode Standard uses representative glyphs in
the code charts, but font designers are free to create their own
glyphs. Because fonts can easily be created using an arbitrary glyph
to represent any Unicode code point, character confusability with
arbitrary fonts can never be avoided. For example, one could design
a font where the ‘a’ looks like a ‘b’ , ‘c’ like a ‘d’, and so on.

Writing systems using contextual shaping (such as Arabic,
and many South Asian systems) introduce even more variation in text
rendering. Characters do not really have an abstract shape in
isolation and are only rendered as part of cluster of characters
making words, expressions, and sentences. It is a fairly common
occurrence to find the same visual text representation corresponding
to very different logical words that can only be recognized by
context, if at all.

Font style variants such as italics may introduce a
confusability which does not exist in another style. For example, in
the Cyrillic script, the U+0442 ( т )
CYRILLIC SMALL LETTER TE looks like a small caps Latin ‘T’ in normal
style, while it looks like a small Latin ‘m’ in italic style.

In-script confusability is extremely user-dependent. For example, in
the Latin script, characters with accents or appendices may look
similar to the unadorned characters for some users, especially if
they are not familiar with their meaning in a particular language.
However, most users will have at least a minimum understanding of the
range of characters in their own script, and there are separate
mechanisms available to deal with other scripts, as discussed in [UTR36].

As described elsewhere, there are cases where the confusable data may be
different than expected. Sometimes this is because two characters or
two strings may only be confusable in some fonts. In other cases, it
is because of transitivity. For example, the dotless and dotted I are
considered equivalent (ı ↔ i), because they look the same when
accents such as an acute are applied to each. However, for
practical implementation usage, transitivity is sufficiently
important that some oddities are accepted.

The data may be enhanced in future versions of this
specification. For information on handling changes in data over
time, see Section 2.9.1, Backward Compatibility of [UTR36].

The confusability tables were created by collecting a number of
prospective confusables, examining those confusables according to a
set of common fonts, and processing the result for transitive
closure.

The prospective confusables were gathered from a number of
sources. Erik van der Poel contributed a list derived from running a
program over a large number of fonts to catch characters that shared
identical glyphs within a font, and Mark Davis did the same more
recently for fonts on Windows and the Macintosh. Volunteers from
Google, IBM, Microsoft and other companies gathered other lists of
characters. These included native speakers for languages with
different writing systems. The Unicode compatibility mappings were
also used as a source. The process of gathering visual confusables is
ongoing: the Unicode Consortium welcomes submission of additional
mappings. The complex scripts of South and Southeast Asia need
special attention. The focus is on characters that can be in the
recommended profile for identifiers, because they are of most
concern.

The fonts used to assess the confusables included those used by
the major operating systems in user interfaces. In addition, the
representative glyphs used in the Unicode Standard were also
considered. Fonts used for the user interface in operating systems
are an important source, because they are the ones that will usually
be seen by users in circumstances where confusability is important,
such such as when using IRIS (Internationalized Resource Identifiers)
and their sub-elements (such as domain names). These fonts have a
number of other relevant characteristics:

They rarely changed in updates to operating systems and
applications; changes brought by system upgrades tend to be gradual
to avoid usability disruption.

Because user interface elements need to be legible at low
screen resolution (implying a low number of pixels per EM), fonts
used in these contexts tend to be designed in sans-serif style,
which has the tendency to increase the possibility of confusables.
There are, however, some languages such as Chinese where a serif
style is in common use.

Strict bounding box requirements create even more
constraints for scripts which use relatively large ascenders and
descenders. This also limits space allocated for accent or tone
marks, and can also create more opportunities for confusability.

Pairs of prospective confusables were removed if they were always
visually distinct at common sizes, both within and across fonts. The
data was then closed under transitivity, so that if X≅Y and Y≅Z, then
X≅Z. In addition, the data was closed under substring operations, so
that if X≅Y then AXB≅AYB. It was then processed to produce the
in-script and cross-script tables, so that a single table can be used
to map an input string to a resulting skeleton.

A skeleton is intended only for internal use for testing
confusability of strings; the resulting text is not suitable for
display to users, because it will appear to be a hodgepodge of
different scripts. In particular, the result of mapping an identifier
will not necessary be an identifier. Thus the confusability mappings
can be used to test whether two identifiers are confusable (if their
skeletons are the same), but should definitely not be used as a
"normalization" of identifiers.

The idmod data is gathered in the following way. The basic assignments are first derived as follows, based on UCD character properties and on information in [UAX31], selecting the first condition that matches.

When the Script_Extensions property for a character contains multiple Script property values, the least-restricted Script from the set is used for the derivation. What that means is that if any one of the Script property values is in Table 5, then idmod is set to recommended. Otherwise if any one of the Script property values is in Table
7 or Table
6, then idmod is set to limited-use. Otherwise, idmod is set to historic, as per Table
4. The script information in Table 5, Table
7, Table
6, and Table
4 are in machine-readable form in CLDR, as scriptMetadata.txt. Table
4 also has some conditions that are not dependent on script: they are irrelevant for this derivation.

If the resulting assignment for idmod is any of the Type values except for the first four, it may then be overridden if necessary, based on information from various sources, including the core specification of the Unicode Standard, annotations in the code charts, information regarding CLDR exemplar characters, and external feedback.

The following files provide data used to implement the
recommendations in this document. The data may be refined in future
versions of this specification. For more information,
see Section 2.9.1, Backward Compatibility of [UTR36].

The Unicode Consortium welcomes feedback on additional
confusables or identifier restrictions. There are online forms at [Feedback] where you can suggest additional
characters or corrections.

A summary view of the
confusables: Groups each set of confusables together, listing them
first on a line starting with #, then individually with names and
code points. See Section 4, Confusable
Detection

Mark Davis and Michel Suignard authored the bulk of the
text, under direction from the Unicode Technical Committee. Steven
Loomis and other people on the ICU team were very helpful in
developing the original proposal for this technical report. Thanks
also to the following people for their feedback or contributions to
this document or earlier versions of it, or to the source data for confusables or idmod: Julie Allen, Andrew Arnold, David Corbett, Douglas Davidson, Chris Fynn, Martin Dürst, Asmus Freytag, Deborah
Goldsmith, Paul Hoffman, Denis Jacquerye, Cibu Johny, Patrick L.
Jones, Peter Karlsson, Mike Kaplinskiy, Gervase Markham, Eric Muller, Erik van der
Poel, Michael van Riper, Marcos Sanz, Alexander Savenkov, Dominikus
Scherkl, Chris Weber, and Kenneth Whistler. Thanks to Peter Peng for
his assistance with font confusables.

Versions of the Unicode
Standard
http://www.unicode.org/standard/versions/For
information on version numbering, and citing and referencing the
Unicode Standard, the Unicode Character Database, and Unicode
Technical Reports.

Updated Highly Restrictive to allow non-ASCII Latin in the
combinations with CJK scripts.

Updated Minimally Restrictive to focus on Recommended and
Aspirational scripts, since we have little information about other
scripts. Limited-Use and Exclusion scripts are still permitted at
the Highly Restrictive level (depending on the identifier
profile), but not in combination with Latin.