Summary/Abstract

This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.

Status of this document (Unicode Consortium)

This proposed draft is published for review purposes. This draft has been
considered by the Unicode Technical Committee and approved as proposed draft
for internal review by Unicode Members and members of W3C Internationalization
WG. At its next meeting, the Unicode Technical Committee may approve, reject,
or further amend this document. It is intended that this
document will become a joint Unicode - W3C document.

This document does not, at this time, imply any endorsement by the Consortium's
staff or member organizations. Please mail comments to
unicore@unicode.org.

Status of this document (W3C)

This is a W3C Working Draft worked on jointly by the
W3C Internationalization
Working Group/Interest Group
(Members only)
and the Unicode Technical Committee. For public discussion of this working
draft, please use the mailing lists www-international@w3.org and
unicode@unicode.org (please crosspost to both lists). For internal discussions,
please use the relevant mailing list (again with crossposting). Please send
editorial comments to the authors.

The material in this draft is still in a rather early stage. Currently the
draft shows the approximate range of intended coverage (e.g. in terms of
which kinds of characters will be addressed, and what kind of information
that is intended to be provided for each kind), while large parts still need
more work and discussion. It is not exactly clear yet what the exact proposal
for each character may be, and how this document will be related to other
W3C specifications. One potential way to proceed is to work towards publishing
this document as a Note, and to reference it, normatively or otherwise, from
the Character Model
[CharMod] document.

Publication as a Working Draft does not imply endorsement by the W3C membership.
This is a draft document and may be updated, replaced or obsoleted by other
documents at any time. It is inappropriate to cite W3C Drafts as other than
"work in progress". A list of current W3C working drafts can be found at
http://www.w3.org/TR.

Table of Contents

The Unicode Standard contains a large number of characters in order to cover
the scripts of the world. It also contains characters for compatibility with
older character encodings, and characters with control-like functions included
for various reasons. It also provides specifications for use of these characters.

For document and data interchange, the Internet and the World Wide Web are
more and more making use of marked-up text. In many instances, markup provides
the same, or essentially similar features to those provided by formatting
characters in the Unicode Standard [Unicode] for use
in plain text. While there may be valid reasons to support these characters
and their specifications in plain text, their use in marked-up text can conflict
with the rules of the markup language.

[a more extensive overview of Unicode and markup will be added to level out
the background of various audiences]

1.1 Notation

This report uses XML [XML] as a prominent and general
example of markup. The XML namespace notation
[Namespace] is used to indicate that a certain element
is taken from a specific markup language. As an example, the prefix 'html:'
indicates that this element is taken from [XHTML]. This
means that the examples containing the namespace prefix 'html:' are assumed
to include a namespace declaration of xmlns:html="..." [Ed. note: insert
the appropriate URI for XHTML later].

Characters are denoted using the notation used in the Unicode Standard, i.e.
U+ followed by their hexadecimal number such as "U+1234". In XML or HTML
this would be expressed as "&#x1234;". [Should this be replaced by the
XML convention? Probably not, because we don't want to see these in XML :-)]

A later version of this document will discuss each of the character categories.
For each of the categories/characters, the following points may be discussed:

Short description of semantics

Reason for inclusion in Unicode

Specific problems when used with markup

Other areas where problems may occur (e.g. plain text)

What kind of markup to use in place

What to if detected (remove/ignore/replace/complain,...)

The following subsection gives an example:

3.1 Object Replacement Character, U+FFFC

Short description: The object replacement character is used to stand
in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was included
in Unicode only in order to reserve a codepoint for a very frequent
application-internal use. Many text-processing applications store the text
and the associated markup (or in some cases styling information) of a document
in separate structures. The actual text is kept in a single linear structure;
additional information is kept separately with pointers to the appropriate
text positions. The overall implementation makes sure that these two structures
are kept in sync. If the text contains objects such as images, it is extremely
helpful for implementations to have a sentinel in the text itself; any additional
information is kept separately.

Problems when used in markup: Including an object replacement character
in markup text does not work because the additional information (what object
to include,...) is not available.

Problems with other uses: The object replacement character is also
problematic when used in plain text, because there is no way in plain text
to provide the actual object information or a reference to it.

Replacement markup: The markup to be used in place of the Object
Replacement Character depends on the object in question and the markup context
it is used in. Typical cases are <html:img src'...' />, <html:object
...>, or <html:applet ...>. These constructs allow to provide all
additional information needed to identify and use the object in question.

What to do if detected: In a proxy context context, ignore. In a
browser context, treat as either a missing image, or a REPLACEMENT CHARACTER
When received in an editing context, if the actual object is accessable,
replace the character by the appropriate markup for that object. Otherwise
remove, ideally providing a warning.

3.2 Interlinear Annotation Characters, U+FFF9-U+FFFB

Short description: The interlinear annotation characters are used
to delimit interlinear annotations in certain circumstances.

Reason for inclusion: The interlinear annotation characters were
included in Unicode only in order to reserve codepoints for very frequent
application-internal use.The interlinear annotation characters are used to
delimit interlinear annotations in contexts where other delimiters are not
available, and where non-textual means exist to carry formatting information.
Many text-processing applications store the text and the associated markup
(or in some cases styling information) of a document in separate structures.
The actual text is kept in a single linear structure; additional information
is kept separately with pointers to the appropriate text positions. This
is called out-of-band information. The overall implementation makes sure
that these two structures are kept in sync. If the text contains interlinear
annotations, it is extremely helpful for implementations to have delimiters
in the text itself; even though delimiters are not otherwise used for style
markup.With this method, and unlike the case of the object replacement character,
all textual information can remain in the standard text stream, but any
additional formatting information is kept separately. In addition, the
Interlinear Annotation Anchor serves as a place holder for formatting information
for the whole annotation object, the same way a paragraph mark can be a
placeholder to attach paragraph formatting information.

Problems when used in markup: Including interlinear annotation
characters in markup text does not work because the additional formatting
information (how to position the annotation,...) is not available.

Problems with other uses: The interlinear annotation characters
are also problematic when used in plain text, and are not intended for that
purpose. In particular, on older display systems that ignore or replace the
Interlinear Annotation Characters, the meaning of the text may be changed.

Replacement markup: The markup to be used in place of the Interlinear
Annotation Characters depends on the formatting an nature of the interlinear
annotation in question. For ruby, please see [Ruby].

What to do if detected: In a proxy context or browser context, remove
U+FFF9 and remove all characters between U+FFFA and following U+FFFB. When
received in an editing context, either remove in the same manner, maybe with
a warning to the user, or convert into appropriate ruby markup for further
editing and formatting by the user.

The Unicode Standard provides compatibility mappings for a number of characters.
Compatibility mappings indicate a relationship to another character, but
the exact nature of the relationship varies. In some cases the relationship
means "is based on" in some other cases it denotes a property. When plain
text is marked up, it may make sense to map some of these characters to their
compatibility equivalents and suitable markup. It is important to
understand the nature of the distinctions between characters and their
compatibility equivalents and the context in where these distintions matter.
It is never advisable to apply compatibility mappings indiscriminantly. This
section provides guidance on when and how to apply compatibility mappings.
It is organized by the "compatibility tag" associated with each compatibility
mapping.

4.1 Overview

The following table gives an overview of the various compatibility characters,
organized by "compatibility tag". The first column contains the tag value
of the "compatibility tag" from the Unicode database. Although these tags
use "<" and ">", they should not be confused with XML tags. Code
range indicates which codepoints the entry applies to. Substitute
indicates whether the codes can be substituted using the compatibility
equivalent according to Normalization Form KC of [UTR 15].
Markup indicates the available markup. For some cases, instead of
or in addition to markup, style information [CSS2] is
needed. [Discussion about style info to be added in the future.]

Tag value

Code range

Substitute

Markup

Comment

<vertical>

all

yes

none

Presentation forms

<initial>

all

yes

none

Presentation forms

<medial>

all

yes

none

Presentation forms

<final>

all

yes

none

Presentation forms

<isolated>

all

yes

none

Presentation forms

<super>

all

yes

<sup>

<sub>

all

yes

<sub>

<small>

all

no

none

Precise usage unknown. Maintain, but don't generate

<no-break>

all

no

none

The compatibility mapping is merely a way to indicate the equivalent
character that is not non-breaking. The distinction

<font>

all

no

none

Variant forms that are used as symbols

<compat>

2100-2101

no

none

Variant forms that are used as symbols

2105-2106

no

none

Variant forms that are used as symbols

2121

yes

?hiv?

For use as single code point in vertical layout

2160-2175

yes

?hiv?

For use as single code point in vertical layout

3131-318E

no

none

Do not-conjoin

2000-200A

no

none

No equivalent markup exists for spaces

3200-3243

?

?hiv?

String used as symbol in vertical layout

249C-24B5

?

?hiv?

String used as symbol in vertical layout

2474-249B

yes

bullet style

Number used as symbol in vertical layuot

2155-215F

yes

none

As long as fraction slash is supported!

00BC-00BE

yes

none

As long as fraction slash is supported!

all other

no

none

Maintain, semantic distinctions apply

<circled>

all

no

none

Bullets or dingbats analogous to 2776-2793

<squared>

3358-337D

yes?

?hiv?

For use as single code point in vertical layout

<squared>

33E0-33FE

yes?

?hiv?

For use as single code point in vertical layout

32C0-32CB

yes?

?hiv?

For use as single code point in vertical layout

33A7

?

Variant form used as symbol in vertical layout

33A8

?

Variant form used as symbol in vertical layour

33AE-33AF

?

Variant forms used as symbols in vertical layout

33C6

?

Variant form used as symbol in vertical layout

3300-3357

yes

?sqared?

Multiline cluster for vertical layout

<narrow>

all

no

none

No equivalent markup exists

<wide>

all

no

none

No equivalent markup exists

Notes

At the time of this writing it was not known what the appropriate markup
would be for squared kana clusters or horizontal in vertical symbols.

4.2 Generating characters

Presentation forms and characters for which adequate representation exists
as marked up text should never be generated for new data. Many of the characters
with <font> tag are suitable for new data, as long as they are used
in the manner they are intended, that is as symbols, with definite semantic
differentiation between the different forms. They should not be used to create
styled text, but styled text should not be used to carry the essential semantic
distinction needed for example for mathematics.

4.3 Bullets

[Ed. Note: this is an example of a detail section for particular compatibility
characters.]

Short description: Characters with a <circled> tag or characters
with <compat> tag and compatibility mapping to a parenthesized string.

Reason for inclusion: They are most frequently used for enumerated
bullets, but the characters with a <circled> tag often occur as dingbats
or footnote markers in tables.

Problems when used in markup: These characters do not cause undue
interaction with markup

Problems with other uses: None

Replacement markup: (bullet style) When generating marked up text
these characters occur only internal to the user agent as bullet styles are
rendered. When marking up plain text data they could be converted to suitable
bullet styles, if such use can be properly inferred.

Compatibility mappings of the form (n) or (n.) can be kept
as single characters, or replaced by bullet styles. A conversion to bullet
styles allows a simple extension of the set to arbitrary numbers. This is
in contrast to circled characters: Very few browsers can properly generate
arbitrary circled numbers, therefore conversion to bullet styles does not
easily allow an extension of the set of accessible circled numbers.

What to do if detected: In a proxy context or browser context no
action needs to be taken.When received in an editing context, substitution
of a bullet style may be appropriate. However, the same characters are very
often used as dingbat-like symbols in tables, so the user should have the
choice of whether to replace.

4.4 [Template]

Short description:

Reason for inclusion:

Problems when used in markup:

Problems with other uses:

Replacement markup:

What to do if detected: In a proxy context or browser context......
When received in an editing context,.... .

This technical report covers all relevant characters in the Unicode Standard,
Version 3.0.

As the Unicode standard is updated and new characters get added, new characters
that are not suitable for markup may also be added. However, the Unicode
Technical Committee only introduces such characters where there is a very
strong industry requirement. As markup becomes more prevalent, the need for
such characters is reduced substantially. This report itself may be updated
periodically to give additional background information.

In the context of the Unicode Standard, the material in this technical report
is informative. However, other documents, particularly markup language
specifications, may specify conformance including normative references to
this document.

Changes from the initial draft: Fixed the header. Fixed the numbering. Fixed
the title. Put references to final version of data files based on naming
conventions. Minor wording changes. Added proposed language on annotation
characters to match example on FFFC. Posted for internal review by UTC and
W3C (AF)

The Unicode Consortium makes no expressed or implied warranty of any kind,
and assumes no liability for errors or omissions. No liability is assumed
for incidental and consequential damages in connection with or arising out
of the use of the information or programs contained or accompanying this
technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered
in some jurisdictions.