UNICODE CONFORMANCE MODEL

This is the first working draft of a proposed Unicode Conformance Model

Status

This document is a working draft for a proposedUnicode Technical
Report. Publication does not imply endorsement by the Unicode
Consortium. This is a draft document which may be updated, replaced, or superseded
by other documents at any time. This is not a stable document; it is inappropriate
to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative
material. Conformance to the Unicode Standard does not imply conformance
to any UTR. Other specifications, however, are free to make normative
references to a UTR.

Please submit corrigenda and other comments with the online reporting
form [Feedback]. Related information that is useful
in understanding this document is found in the References.
For the latest version of the Unicode Standard see [Unicode].
For a list of current Unicode Technical Reports see [Reports].
For more information about versions of the Unicode Standard, see [Versions].

[Note to
reviewers: This 'R' version incorporates many edits from the editorial
committee, however, some input from that review has not yet been realized,
but just noted as ed. notes.]

The Unicode Standard [Unicode] is a very large and
complex standard. Because of this, and because of the nature and role of the
standard, it is often rather difficult to determine, in any particular case,
just exactly what conformance to the Unicode Standard means.

The Unicode Standard forms the foundation
which supports a large variety of operations on textual data, from data interchange
protocols to complex tasks like sorting, rendering or content analysis. All of these
processes expose
implementations to the complexities of human languages and writing systems.

Earlier character sets were either small, or had a clearly limited field
of application, (such as by geographical area), or both. By contrast, the Unicode Standard
aims to be universal. A universal character encoding standard cannot rely on implicit agreements about the nature
and behavior of the characters it encodes, it must provide explicit constraints
on their identity and intended use. At the same time, the standard must allow
implementations the necessary flexibility to address the expectations of
its users, while providing enough constraints to guarantee predictable interchange
of data and consistency between implementations.

This Conformance Model explains the issue of conformance relating to the
Unicode Standard so that users better understand the contexts in which products
are making claims for support of the Unicode Standard, and implementers better
understand how to meet the formal conformance requirements while satisfying
the expectations of their users. It does not alter, augment or override the
actual Unicode conformance requirements found in the text of the Unicode Standard.
Rather it attempts to provide a conceptual framework to make it easier for users
and implementers to identify and understand the specific conformance requirements
contained in [Unicode].

This model defines conformance terminology, specifies different areas and
levels of conformance, and describes what it means to make a claim of conformance
or "support" of the standard. This model is not a framework for
conformance verification testing,
although it could be used to develop such a framework, should that prove desirable.
At this time no such framework as been developed by the Unicode Consortium,
nor have any conformance verification tests been required or sanctioned.

Many of the concepts presented here are equally applicable to other standards
developed by the Unicode Consortium, such as the
Unicode Collation Algorithm
[UCA], and the specifications for Unicode support in regular
expressions [RegEx].

This section gives a basic introduction to the terminology that will be discussed
in more detail in sections below.

2.1 Conformance

In the context of formal standards, conformance refers to a set of
rules or criteria whereby a relevant entity such as an element of information
interchange, a device, an application, or a piece of hardware, can be evaluated
as either meeting or not meeting the specification in the standard. In general,
a formal standard will have a conformance clause or clauses, which will be stated
in terms of conditionals, such as "X is in conformance with Y specification
of this standard if Z", or modals, often in uppercase, such as "An X that conforms
with Y specification of this standard SHALL Z". The modal verbs that standards
language commonly associates with such statements are often carefully defined
to avoid any ambiguities in interpretation. In common practice, they involve
specialized usage of "SHALL" and "MUST" for requirements, but also "MAY" for
permitted deviations and "SHOULD" for non-binding recommendations.

If a standard is complex, the conformance clause or clauses themselves may
also be complex. Occasionally, a conformance clause may simply be stated along
the lines of "X is in conformance with this standard if it follows the specification
in section W" where section W may consist of hundreds of pages and constitute
most of the rest of the standard.

Formal standards often distinguish between normative and informative
content. This distinction may be highly conventionalized, or even subject to
rules specified in other standards, such as for ISO standards, or the distinction
may be less formally maintained.

Normative content of a standard is content which is required for all of the
conformance requirements to be meaningful. Typically a standard will have normative
definitions for terms used in the rest of the specification, normative
references to other standards or sources whose content is referred to indirectly,
and normative clauses, specifications, or sections, which actually define the
content of the standard to which the conformance clauses apply.

Informative content of a standard is all material which has been added for
clarification, but which, in the judgment of the standard's maintainers, could
in principle be omitted without materially affecting the specification to which
the conformance clauses refer. If a standard is changed over time, the status
of some particular content could change from informative to normative, or vice
versa, depending on whether it became required for conformance or was no longer
required for conformance.

In the context of the Unicode Conformance Model, conformance verification
means an external (third party) determination that under some specified set
of circumstances an entity meets one or more requirements of the conformance
clauses of the standard. In other words, while conformance clauses are
merely a logical statement of requirements, conformance verification
implies the existence of conformance verification tests, that have been applied to entities
in order to make such determinations.

A conformance claim can simply be stated. It is an assertion that entity
X meets a requirement of the standard.

A verification of a conformance claim, on the other hand, is the result
of the specific application of a test designed to determine the validity
of a conformance claim. Such tests are called conformance verification tests.

The Unicode Consortium does not endorse a particular methodology for
conformance verification.

A standard may include tests or "benchmarks" as part of the text of the standard,
or as external documents associated with the standard. While there is some overlap
in general usage of the terms "conformance test" and "conformance verification
tests", a systematic distinction is drawn between the two in the Unicode Conformance
Model.

2.5.1 Conformance Tests

A conformance test for the Unicode Standard is a list of data certified by
the Unicode Technical Committee [UTC] to be "correct" with regard to some particular
requirement for conformance to the standard. In some instances, as for example,
the implementation of the bidirectional algorithm, producing a definitive list
of correct results is difficult or impossible, and in such cases, a conformance
test may consist of an implemented algorithm certified by the UTC to produce
correct results for any pertinent input data. Conformance tests for the Unicode
Standard are essentially benchmarks that someone can use to determine if their
algorithm, API, etc., claiming to conform to some requirement of the standard,
does in fact match the data that the UTC asserts define such conformance.

2.5.2 Conformance Verification Tests

A conformance verification test for the Unicode Standard is a test, usually designed and implemented by a third party not associated
with the Unicode Standard or the UTC, intended to test a product which claims
conformance to one or more aspects of the Unicode Standard, for actual conformance
to the standard. Thus a conformance verification test is a test of a product.
Such a test, may, of course, make use of one or more of the Unicode conformance
tests to determine the results of its conformance verification.

In the context of the Unicode Conformance Model, the term support
refers to a more generalized claim of intent to conform to one or another requirement
of the standard. A claim of Unicode support may in fact be difficult to verify,
because it can be vague in detail. However, at least it indicates in principle
that the developer or user of an entity intends conformance. More specifically,
support often refers to a claim of particular repertoire coverage. For example,
an application may claim support for Unicode Greek. That should be interpreted
as meaning that Unicode Greek characters will be handled in conformance with
the standard, and that all other relevant aspects of processing of those characters
with which that particular application is concerned, will be done in such a
way as not to violate the conformance clauses of the standard.

Some formal standards are developed once and then are essentially frozen
and stable forever. For such standards, stability of content and the corresponding
stability of conformance claims is not an issue.

For a standard aimed at the universal encoding of characters, such stability is not possible. The standard is
necessarily evolving and expanding over time, to extend its coverage to
include all
the writing systems of the world. And as experience in its implementation accumulates,
further aspects of character processing are added to the formal content of the
standard. This fundamentally dynamic quality of the Unicode Standard complicates
issues of conformance, because of the continually expanding content to which
conformance requirements pertain. This expansion is both an expansion in breadth
by adding more characters, and scripts, and in depth by adding more aspects
of character processing.

Invariance refers to those aspects of the content of the Unicode Standard
that have been formally defined as unchangeable, even as the standard continues
its development. The guarantee of the stability of the formal
Unicode character names is a fairly trivial example. While in principle such
names could be changed, and were changed once between Version 1.0 and Version 1.1, the [UTC]
has determined that such changes are too disruptive and have too little benefit
to be tolerated. Accordingly, the stability of character names has been promoted
to the status of an invariant in the standard.

The Unicode Standard is regularly versioned, as new characters are added.
A formal system of versioning is in place, involving three levels of versions:

major versions

minor versions

update versions

All three levels have carefully controlled rules for the type of documentation
required, handling of the associated data files, and allowable types of change
between versions. For more information about Unicode versioning see [Versions].
Other standards developed by the Unicode Consortium may use a single level versioning
scheme.

Conformance claims clearly must be specific to versions of the Unicode
Standard, but the level of specificity needed for a claim may vary according
to the nature of the particular conformance claim being made. Some
standards developed by the Unicode Consortium require separate conformance
to a specific version (or later), of the Unicode Standard. This version is
sometimes called the base version. In such cases, the version of the standard and
the version of the Unicode standard to which the conformance claim refers must
be compatible.

If a technical deficiency in the specifications of the Unicode Standard is
identified, it may be corrected by a change in the next version, or, if sufficiently
important, by a formal corrigendum. A corrigendum often applies to several earlier
versions. Implementations can claim conformance to any of these versions with
the given corrigendum applied. For more on corrigenda see [Versions].

Errata are used to describe other known defects in the text. Unlike corrigenda
they cannot be referenced in a conformance claim. For more information on errata
see [Errata].

This section will serve as a guide to the particular way that the Unicode
Standard expresses conformance requirements, both in terms of where they are
located and how they are expressed. It also explores the peculiar aspects of
conformance related to the synchronized status of the Unicode Standard and the
independent but closely aligned International Standard ISO/IEC 10646, which
has its own conformance clauses expressed using ISO conventions.

Chapter 3, "Conformance" of [Unicode] contains formal
definitions of terms referenced in the conformance clauses. While modifications
of these definitions between versions of the Unicode Standard have been, and
will continue to be necessary, every effort is made to keep the numbering of
the definitions stable. This makes it easier to maintain external specifications
that cite a particular definition.

The conformance clauses in Section 3.2, "Conformance Requirements" of [Unicode]
define the requirements for a conformant implementation. They are expressed
in terms of the definitions, but also refer to additional specifications contained
in Unicode Standard Annexes. While modifications of these clauses between versions
of the Unicode Standard have been, and will continue to be necessary, every
effort is made to keep the numbering of the clauses stable. This makes it easier
to maintain external specifications that cite a particular clause.

A Unicode Standard Annex (UAX) contains part of the standard, published as
a standalone document. The relation between conformance to the Unicode Standard
and conformance to each of the Unicode Standard Annexes is spelled out in detail
in Section 3.2, "Conformance Requirements" of [Unicode]. Some of the
conformance clauses refer explicitly to specifications contained in UAXs, such
as the Bidirectional Algorithm
[Bidi] or
Normalization Forms [Normalization]. Normative
material in other UAXs is defined by any of the mechanisms described below.

Unicode algorithms are specified as a series of logical steps. In many cases,
the input to the algorithm is a string of character properties: in other words,
the results of the algorithm are identical for different input strings, as long
as each input string maps to the same string of character property values. Conformance
to a Unicode algorithm does not require repeating the steps as described, but
rather requires achieving the same outputs for the same inputs. This provides
the necessary flexibility for implementations to pursue optimizations. Whether
or not conformance to a given algorithm is required by Unicode conformance,
implementations claiming to implement one of these algorithms must do so in
conformance with its specification.

Some algorithms provide explicit methods for tailoring, or customizing a
general algorithm to the needs of a specific language, locality or application.
Other algorithms simply describe the best default practice, and customization
is assumed for any practical application. An example of this is the line breaking
algorithm in [LineBreak]. Whether or not conformance
to a given algorithm is required by Unicode Conformance, implementations claiming
to implement one of these algorithms must disclose the use of tailoring or
customization.

The Unicode Standard and ISO/IEC 10646 share the same repertoire of coded
characters, including the character code position, character name and
identity. However, the two standards differ in the precise terms of their
conformance specifications. Any conformant Unicode implementation will
conform to ISO/IEC 10646, but because the Unicode Standard imposes
additional constraints on character semantics and transmittability, not all
implementations that are compliant with ISO/IEC 10646 will be compliant with
the Unicode Standard. For a detailed description see Appendix C, "Relationship to ISO/IEC 10646"
of [Unicode].

There are several broad areas of application where Unicode Conformance makes
specific types of requirements. Because not all applications and implementations
cover all these areas, some aspects of Unicode conformance may not be applicable
to them.

Representation covers all aspects of being able to express and transmit Unicode
data. It is a requirement applicable to certain protocols (for example, XML),
but might apply to the storage aspects of databases and other file formats as
well. Conformant representation applies to correct use of encoding forms and
encoding schemes, as well as the ability to represent all Unicode code points.
In addition, issues related to [Normalization]
are important.

Conformant transcoding between Unicode and all other, so called legacy,
character encodings, retains the identity of the transcoded characters.
In addition, it may claim to retain a specific normalization form for the
converted data. See [Normalization]. [CharMapML]
defines a format for expressing character mappings. Implementations may
choose to conform to that format in order to be able to interchange mapping
tables.

String processing covers all operations on Unicode texts that can be carried
out without considering layout and specifically without considering fonts. String
processing encompasses a large variety of operations including, but not limited
to text segmentation, text parsing, handling regular expressions, searching,
and sorting, as well as creating formatted text representation of data types. For
a number of these operations model algorithms and other specifications exist
to which an implementation may claim conformance, such as [UCA].
[RegEx], [Boundaries], [LineBreak].

Layout comprises all operations that go from backing store to displayed text.
The same operations are run in reverse for selection. These operations are dependent on font data,
but are considered separately from fonts because the same implementation typically can
work with a range of different fonts. Some operations, such as suppressing
the display of certain ignorable code points are typically handled by the
layout system without involving fonts. Conformance issues for layout
processes include reordering from logical to display ordering, as well as
positional shape selection. For bidirectional reordering, conformance to
[Bidi] is required. For positional shaping and script-specific layout, model
algorithms exist, or are being developed for Arabic and Syriac, Devanagari,
Tamil and other Indic Scripts, as well as Mongolian. While the requirements
of high end typography typically exceed these script-specific
specifications, conformance requires a relation between specific constructs
in the writing system and corresponding character code sequences, so that
these constructs can be interchanged reliably.

The Unicode Standard does not standardize the actual appearance of characters,
but instead intends that they should be depicted within a customary range of
design interpretations. Conformance to the Unicode Standard therefore primarily refers
to those tables in the fonts that correlate character codes with the glyphs
in the font, for example 'cmap' tables, and to claims of "coverage" of
the Unicode
repertoire by fonts.

Conformance-related issues for character input consists of coverage of Unicode repertoire, conversion of input to Unicode
character values for storage, and consistency with the text models required
for particular scripts and text layout. The entities here are mostly IME's and
keyboards (drivers).

Unicode Technical Standard #18,
Unicode Regular Expression [RegEx] is an example of a standard that has well defined
levels of conformance. Each implementation can claim conformance to a
specific level, and each level makes specific conformance requirements. By
contrast, conformance to the Unicode standard is not organized into such
discrete levels. However, there are some areas where the standard allows
limited, or partial support of some requirements.

5.1 Repertoire Coverage

The Unicode standard explicitly
does not require that all implementations support all Unicode characters.
Any implementation may support an arbitrary subset of Unicode characters,
and in fact, may support different sets of characters for different
operations.

However, if an implementation
claims conformance to an algorithm such as normalization, or implements a
UTF-8 converter, such implementations are expected to support the entire
range of Unicode code points.

Note: an
implementation may define an algorithm, such as identifier matching,
that uses normalization as part of the algorithm but also restricts the
allowable set of input characters. In that case, any implementation of
that algorithm is free to use a limited implementation of normalization,
because the limit on the input makes it impossible to distinguish
between a full and limited implementation of normalization.

5.2 Full Conformance

This and the next section consider conformance separately for each of the
major areas of Section 4. Full conformance in a given area is not
necessarily the same as full support for that area, as conformance
requirements in many cases are minimal requirements. Exceptions are certain
well-defined areas such as encoding forms or normalization that have few or no options
and few or no levels.

5.3 Partial Conformance - Levels of Support Defined

This section will provide both a typology for levels of conformance in an
area, by presenting an alternative to the notion that all aspects of Unicode conformance are either/or
issues together with specific lists of levels of conformance and support where they
can be pulled out of the standard.

[Ed.: For example, the standard explicitly talks
about levels of surrogate support, which is an example that should be abstracted, along with others,
to provide the basis for determining how to make various claims of conformance.]

5.4 Best Practices

[Ed. This section could describe best practices of deciding levels of conformance
or it could describe how conformance requirements relate to best practices in
a given area.]

[ The following
content is just sketched out in outline form. Could also cover what
should be tagged with a Unicode version and when.]

6.1 Inter-level Compatibility Issues

Conformant implementations will have to interact with both down-level and
up-level implementations. This creates particular issues. The Unicode conformance
requirements are structured to encourage implementations to passively support
data containing characters assigned in future versions of the standard.

6.1.1 Down-level Compatibility

[Ed. Note:
describe what strategies does the standard follow? What are implementation
strategies.]

6.1.2 Up-level Compatibility

[Ed. Note:
describe what strategies does the standard follow? For example,
assigning some properties for unassigned code points. What are implementation
strategies.]

6.2 Repertoire Matching

It is generally not helpful to tag data created by an implementation with
the version level of Unicode supported by that implementation. Because
the repertoire of that version of Unicode is far larger than the actual set
of characters used in the data, a large part of text data created and
interchanged worldwide can be represented in all versions of Unicode.
Therefore, the version level of the implementation bears little relation to
the repertoire needed to cover the data.

Most implementations will not equally support the entire repertoire of Unicode
characters for a given version. In fact, there is no conformance requirement
to support any specific part of the repertoire. Therefore, even if the version
level of a receiving implementation is higher than that of the creating implementation
there is no guarantee that both support the repertoire covered by the data,
or support it equally well.

[Unicode] defines no method for enumerating or identifying common sub-repertoires
of the standard, but ISO/IEC 10646 does so. Implementations can use the [DerivedAge]
for each character code to avoid sending character codes to a down-level
system which lacks a definition for them. Because character coding is strictly additive,
implementations
receiving data can easily identify characters that are not defined in the version
of the standard to which they conform and take appropriate action. In many cases,
appropriate action consists of passing through such data, or treating them as
characters possessing default properties. (See UTR#23: Unicode Character
Property Model [PropertyModel] for more details on default properties).

[Ed. note:
Eric to give input.]

6.3 Matching Areas and Levels of Conformance between Implementations and
Components

A mere matching of version numbers between an implementation and components
it relies on will not be sufficient, because components may subset the repertoire
they support or choose a different level of conformance, where available.

Versions of the Unicode Standard,
http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing
the Unicode Standard, the Unicode Character Database, and Unicode Technical
Reports.