Abstract

This document specifies rules for deciding whether a code point, considered in isolation or in context, is a candidate for inclusion in an Internationalized Domain Name (IDN).

It is part of the specification of Internationalizing Domain Names in Applications 2008 (IDNA2008).

Status of This Memo

This is an Internet Standards Track document.

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.

Copyright Notice

Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

RFC 4690 [RFC4690] suggests an inclusion-based approach for selecting the code points from The Unicode Standard [Unicode52] that should be included in the list of code points that may be used in Internationalized Domain Names.

The IAB has concluded that there is a consensus within the broader community that lists of code points should be specified by the use of an inclusion-based mechanism (i.e., identifying the characters that are permitted), rather than by excluding a small number of characters from the total Unicode set as Stringprep [RFC3454] and Nameprep [RFC3491] do today. That conclusion should be reviewed by the IETF community and action taken as appropriate.

This document reviews and classifies the collections of code points in the Unicode character set by examining various properties of the code points. It then defines an algorithm for determining a derived property value. It specifies a procedure, and not a table, of code points so that the algorithm can be used to determine code point sets independent of the version of Unicode that is in use.

This document is not intended to specify precisely how these property values are to be applied in IDN labels. That information appears in the Protocol document [RFC5891], but it is important to understand that the assignment of a value of this property to a particular character is not sufficient to determine whether it can be used in a given label. In particular, some combinations of allowed code points are not advisable for use in IDNs due to rules specific to a script or class of characters. The requirement for such rules is linked to the operations in the Protocol document and especially to the characters designated as requiring contextual rules.

The value of the property is to be interpreted as follows.

o PROTOCOL VALID: Those that are allowed to be used in IDNs. Code points with this property value are permitted for general use in IDNs. However, that a label consists only of code points that have this property value does not imply that the label can be used in DNS. See the Protocol document for algorithms to make decisions about labels in domain names. The abbreviated term PVALID is used to refer to this value in the rest of this document.

Faltstrom Standards Track [Page 3]

RFC 5892 IDNA Code Points August 2010

o CONTEXTUAL RULE REQUIRED: Some characteristics of the character, such as it being invisible in certain contexts or problematic in others, require that it not be used in labels unless specific other characters or properties are present. The abbreviated term CONTEXT is used to refer to this value in the rest of this document. There are two subdivisions of CONTEXTUAL RULE REQUIRED, one for Join_controls (called CONTEXTJ) and for other characters (called CONTEXTO). These are discussed in more detail below and in the Protocol document.

o DISALLOWED: Those that should clearly not be included in IDNs. Code points with this property value are not permitted in IDNs.

o UNASSIGNED: Those code points that are not designated (i.e., are unassigned) in the Unicode Standard.

The mechanisms described here allow determination of the value of the property for future versions of Unicode (including characters added after Unicode 5.2). Changes in Unicode properties that do not affect the outcome of this process do not affect IDN. For example, a character can have its Unicode General_Category value (see [Unicode52]) change from So to Sm or from Lo to Ll, without affecting the algorithm results. Moreover, even if such changes were the result, the BackwardCompatible list (Section 2.7) can be adjusted to ensure the stability of the results.

Some code points need to be allowed in exceptional circumstances but should be excluded in all other cases; these rules are also described in other documents. The most notable of these are the Join Control characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-JOINER. Both of them have the derived property value CONTEXTJ. A character with the derived property value CONTEXTJ or CONTEXTO (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate rule has been established and the context of the character is consistent with that rule. It is invalid to either register a string containing these characters or even to look one up unless such a contextual rule is found and satisfied. Please see Appendix A, "The Contextual Rules Registry", for more information.

This document is part of a series that, together, constitute a proposal for updating the IDNA standards to resolve issues uncovered in recent years, cover a broader range of scripts, and provide for migration to newer versions of Unicode. See the Rationale document [RFC5894] for a broader discussion.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

The derived property obtains its value based on a two-step procedure. First, characters are placed in one or more character categories based on either core properties defined by the Unicode Standard or by treating the code point as an exception and addressing the code point by its code point value. These categories are not mutually exclusive.

In the second step, set operations are used with these categories to determine the values for an IDN-specific property. Those operations are specified in Section 3.

Unicode property names and property value names may have short abbreviations, such as gc for the General_Category property, and Ll for the Lowercase_Letter property value of the gc property.

In the following specification of categories, the operation that returns the value of a particular Unicode character property for a code point is designated by using the formal name of that property (from PropertyAliases.txt) followed by '(cp)'. For example, the value of the General_Category property for a code point is indicated by General_Category(cp).

This category is used in the second step to preserve the traditional "hostname" (LDH -- as described in the Definitions document [RFC5890]) characters ('-', 0-9, and a-z). In general, these code points are suitable for use for IDN. Note that there are other rules regarding the code point U+002D HYPHEN-MINUS that are specified in the IDNA Protocol Specification [RFC5891].

This category explicitly lists code points for which the category cannot be assigned using only the core property values that exist in the Unicode standard. The values are according to the table below:

This category includes the code points that property values in versions of Unicode after 5.2 have changed in such a way that the derived property value would no longer be PVALID or DISALLOWED. If changes are made to future versions of Unicode so that code points might change the property value from PVALID or DISALLOWED, then this table can be updated and keep special exception values so that the property values for code points stay stable.

Elimination of conjoining Hangul Jamo from the set of PVALID characters results in restricting the set of Korean PVALID characters just to preformed, modern Hangul syllable characters. Old Hangul syllables, which must be spelled with sequences of conjoining Hangul Jamo, are not PVALID for IDNs.

J: General_Category(cp) is in {Cn} and Noncharacter_Code_Point(cp) = False

This category consists of code points in the Unicode character set that are not (yet) assigned. It should be noted that Unicode distinguishes between "unassigned code points" and "unassigned characters". The unassigned code points are all but (Cn - Noncharacters), while the unassigned *characters* are all but (Cn + Cs).

As described above (Section 1) and in more detail in the IDNA Protocol document [RFC5891], possible values of the IDN property are:

o PVALID

o CONTEXTJ

o CONTEXTO

o DISALLOWED

o UNASSIGNED

The algorithm to calculate the value of the derived property is as follows. If the name of a rule (such as Exception) is used, that implies the set of code points that the rule defines, while the same name as a function call (such as Exception(cp)) implies the value cp has in the Exceptions table.

The categories and rules defined in Sections 2 and 3 apply to all Unicode code points. The table in Appendix B shows, for illustrative purposes, the consequences of the categories and classification rules, and the resulting property values.

The list of code points that can be found in Appendix B is non-normative. Sections 2 and 3 are normative.

IANA has created a registry with the derived properties for the versions of Unicode released after (and including) version 5.2. The derived property value is to be calculated in cooperation with a designated expert [RFC5226] according to the specifications in Sections 2 and 3 and not by copying the non-normative table found inAppendix B.

If non-backward-compatible changes or other problems arise during the creation or designated expert review of the table of derived property values, they should be flagged for the IESG. Changes to the rules (as specified in Sections 2 and 3), including BackwardCompatible (Section 2.7) (a set that is at release of this document is empty) require IETF Review, as described in RFC 5226 [RFC5226].

For characters that are defined in the IDNA derived property value registry (Section 5.1) as CONTEXTO or CONTEXTJ and that therefore require a contextual rule, IANA has created and now maintains a list of approved contextual rules. Additions or changes to these rules require IETF Review, as described in [RFC5226].

Appendix A contains further discussion and a table from which that registry can be initialized.

Security Considerations for this version of IDNA, except for the special issues associated with right-to-left scripts and characters, are described in the Definitions document [RFC5890]. Specific issues for labels containing characters associated with scripts written right to left appear in the Bidi document [RFC5893].

As discussed in Section 5.2 and in the IANA Considerations section of the Rationale document [RFC5894], a registry of rules that define the contexts in which particular PROTOCOL-VALID characters, characters associated with a requirement for Contextual Information, are permitted. These rules are expressed as tests on the label in which the characters appear (all, or any part of, the label may be tested).

The grammatical rules are expressed in pseudo-code. The conventions used for that pseudo-code are explained here.

Each rule is constructed as a Boolean expression that evaluates to either True or False. A simple "True;" or "False;" rule sets the default result value for the rule set. Subsequent conditional rules that evaluate to True or False may re-set the result value.

A special value "Undefined" is used to deal with any error conditions, such as an attempt to test a character before the start of a label or after the end of a label. If any term of a rule evaluates to Undefined, further evaluation of the rule immediately terminates, as the result value of the rule will itself be Undefined.

cp represents the code point to be tested.

FirstChar is a special term that denotes the first code point in a label.

LastChar is a special term that denotes the last code point in a label.

.eq. represents the equality relation.

A .eq. B evaluates to True if A equals B.

.is. represents checking the position in a label.

A .is. B evaluates to True if A and B have same position in the same label.

.ne. represents the non-equality relation.

A .ne. B evaluates to True if A is not equal to B.

.in. represents the set inclusion relation.

A .in. B evaluates to True if A is a member of the set B.

Faltstrom Standards Track [Page 13]

RFC 5892 IDNA Code Points August 2010

A functional notation, Function_Name(cp), is used to express either string positions within a label, Boolean character property tests of a code point, or a regular expression match. When such function names refer to Boolean character property tests, the function names use the exact Unicode character property name for the property in question, and "cp" is evaluated as the Unicode value of the code point to be tested, rather than as its position in the label. When such function names refer to string positions within a label, "cp" is evaluated as its position in the label.

RegExpMatch(X) takes as its parameter X a schematic regular expression consisting of a mix of Unicode character property values and literal Unicode code points.

Script(cp) returns the value of the Unicode Script property, as defined in Scripts.txt in the Unicode Character Database.

Canonical_Combining_Class(cp) returns the value of the Unicode Canonical_Combining_Class property, as defined in UnicodeData.txt in the Unicode Character Database.

Before(cp) returns the code point of the character immediately preceding cp in logical order in the string representing the label. Before(FirstChar) evaluates to Undefined.

After(cp) returns the code point of the character immediately following cp in logical order in the string representing the label. After(LastChar) evaluates to Undefined.

Note that "Before" and "After" do not refer to the visual display order of the character in a label, which may be reversed or otherwise modified by the bidirectional algorithm for labels including characters from scripts written right to left. Instead, "Before" and "After" refer to the network order of the character in the label.

The clauses "Then True" and "Then False" imply exit from the pseudo-code routine with the corresponding result.

Repeated evaluation for all characters in a label makes use of the special construct:

For All Characters:

Expression;

End For;

Faltstrom Standards Track [Page 14]

RFC 5892 IDNA Code Points August 2010

This construct requires repeated evaluation of "Expression" for each code point in the label, starting from FirstChar and proceeding to LastChar.

The different fields in the rules are to be interpreted as follows:

Code point: The code point, or code points, to which this rule is to be applied. Normally, this implies that if any of the code points in a label is as defined, then the rules should be applied. If evaluated to True, the code point is OK as used; if evaluated to False, it is not OK.

Overview: A description of the goal with the rule, in plain English.

Lookup: True if application of this rule is recommended at lookup time; False otherwise.

Overview: This may occur in a formally cursive script (such as Arabic) in a context where it breaks a cursive connection as required for orthographic rules, as in the Persian language, for example. It also may occur in Indic scripts in a consonant-conjunct context (immediately following a virama), to control required display of such conjuncts.

Overview: Note that the Script of Katakana Middle Dot is not any of "Hiragana", "Katakana", or "Han". The effect of this rule is to require at least one character in the label to be in one of those scripts.

If one applies the rules (Section 3) to the code points 0x0000 to 0x10FFFF to Unicode 5.2, the result is as follows.

This list is non-normative, and only included for illustrative purposes. Specifically, what is displayed in the third column is not the formal name of the code point (as defined in Section 4.8 of The Unicode Standard [Unicode52]). The differences exist, for example, for the code points that have the code point value as part of the name (for example, CJK UNIFIED IDEOGRAPH-4E00) and the naming of Hangul syllables. For many code points, what you see is the official name.

[Unicode] The Unicode Consortium, "The Unicode Standard, Version 5.0", 2007. Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. This printed reference has now been updated online to reflect additional code points. For code points, the reference at the time this document was published is to Unicode 5.2.