I propose that we make the following addition to UAX #15. (This is a rough
draft, and would be refined in discussion.)

* Add the following text, probably in an appendix.

XX Bounded Reordering Format

There are certain environments where people would like to make use of
normalization, but where there are implementation constraints such as in
buffered serialization. Consider the extreme case of a string containing a 2
followed by 10,000 umlauts followed by one dot-below, then a 3. In the process of
normalization, the dot-below at the end must be reordered to immediately after
the 2, which means that 10,003 characters need to be considered as a whole
before the result can be output.

While such text is not illegal, it is not practically meaningful; yet the
possibility of encountering it forces a conformant, serializing implementation
to provide large buffer capacity, or provide a special exception mechanism just
for such degenerate cases. To address this situation, the following mechanism is
specified:

D5. A string is said to be in Bounded Reordering Format if it does not
contain any sequences of non-starters longer than 30 characters in length.

Such a string can be normalized in buffered serialization with buffer
size of 32 characters, or 128 bytes.

The value of 30 is chosen to be significantly beyond what is required
for any linguistic or technical usage. While it would have been feasible to
chose a smaller number, this value provides a very wide margin, yet is well
within the buffer size limits of practical implementations.

D6. The Bounded Reordering Process is defined as the process of
producing a string in Bounded Reordering Format by processing the text from
start to finish, inserting a CGJ character within long sequences of
non-starters. The exact position of the inserted CGJs are determined according
to the following logical process, which describes the generation of an output
string from an input string.

If the input string is empty, return.

Set nonStarterCount to zero.

For each code point C in the input string:

Produce the NFKD decomposition S.

If nonStarterCount plus the number of initial non-starters in S is
greater than 30, append a CGJ to the output string, and set the counter
to zero.

Append C to the output string.

If there are no starters in S, add the number of code points in S to
nonStarterCount.

Otherwise, set nonStarterCount to the number of trailing
non-starters in S (which may be zero).

Return the output string.

This mechanism ensures that not only is the resulting text in Bounded
Reordering Format, but that any normalization of the result is in Bounded
Reordering Format. This is true for any input string that doesn't contain
unassigned code points.

Implementations can optimize the above specification as long as they
produce the same results. In particular, the information used in Step 3 can
be precomputed: it does not require the actual normalization of the
character. In Unicode 5.0, for example, it consists of the following:

Count

Initial non-starters

Starters

Trailing non-starters

Example

1,112,646

0

1

0

-

all
others

793

0

1

1

00A8

DIAERESIS

248

0

1

2

01D5

LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON

36

0

1

3

1F82

GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI

385

1

0

0

0300

COMBINING GRAVE ACCENT

4

2

0

0

0344

COMBINING GREEK DIALYTIKA TONOS

With any normal text, the Bounded Reordering Process will not modify the text
at all. Where modification does result, it is no longer canonically equivalent
to the original, but the modifications are minor and don't disturb any
meaningful content. The modified text contains all of the content of the
original, with the only difference being that reordering is blocked across long
groups non-starters. Any text in Bounded Reordering Format can then be
normalized with very small buffers using any of the standard Normalization
Forms.

This logical process can be implemented in the same implementation pass as
normalization for efficient processing. In such case, the choice of whether to
do the Bounded Reordering Process or not can be controlled by an input
parameter.

* Add a conformance clause, so that people can refer to it.

UAX15-C4. A process that purports to transform text according to the Bounded
Reordering Process must do so in accordance with the specifications in this
document.