> Every time UAX 14 comes up, some member of the WG notes that taking UAX
> 14 literally doesn't work well. Therefore I've been careful to reference
> it, but leave that reference non-normative so that implementors can apply
> their own judgement to the information it contains.
In Unicode 5.0 we've clearly separated those statements in UAX#14 that
speak about the characters that one could consider "line break
controls", i.e. that were encoded to provide specific interaction with
line breaking, from those statements that speak about all other
characters, the line break behavior of which results from convention.
To enable reliable interchange, the behavior of the control-like
characters should be as uniform as possible, therefore we've made their
identity and behavior normative in Unicode. The behavior of all other
characters is subject to stylistic. orthographic and typographic
conventions, which in many cases require explicit tailoring. The case
made about Korean having two accepted modes of line breaking is
explicitly recognized in the UAX#14 document.
I believe that the clear recognition of the fact that a NO BREAK
character, for example, NBHY, is encoded only because it allows users to
prevent line breaks, and that allowing it to be tailored defeats its
purpose, will ultimately help clarify what UAX#14 attempts to do, which
is to give precise description of how these special characters are to be
treated so that they work as expected, in the context of: providing a
baseline implementation for all characters.
The design point for the latter is that it should be suitable for mixed
language, mixed text scripts and work reasonably well for simple systems
(small devices) or simple text solutions on bigger systems. As a result,
it adopts the treatment of punctuation based on East Asian line breaking
concepts while keeping runs of 'words' and 'numbers' in other scripts
together, unless separated by spaces, hyphens and the like. (The support
of South East Asian scripts requires additional specifications not
provided in UAX#14 - a known limitation).
CSS is of course free to support many different modes of line breaking
for the regular characters, or even approach this subject differently -
because, again, the conventions for the large majority of characters are
neither universal, nor unique. (We are of course interested in improving
our baseline implementation--if a better default generic baseline
behavior exists, I'd like to find out about it - with rules and
examples, if possible).
However, for the line break controls, CSS should *not* deviate from
UAX#14, because doing so, effectively redefines characters that were
encoded for their linebreak behavior. This does not mean that we think
UAX#14 is infallible: we just found out that our specification of they
way that NBHY, NBSP etc.interact with hyphens and soft-hyphens was
inadvertantly made too restrictive. The 5.0 formulation is counter to
widespread practice and needs for Polish and Portuguese. That is being
fixed in 5.0.1. Therefore, instead of silently deviating, the CSS
editors should make sure that the normative part of the UAX#14
specification is corrected (if necessary) and then follow it - and
discourage any deviation from that normative part by implementations.
For the non-normative part, as I already pointed out, we are interested
in learning about specific improvements, with the goal to make something
like the UAX#14 an attractive baseline implementation in situations
where tailoring is either not possible or not feasible.
A./