Mike wrote:
> We decided to adopt Will's suggestion of adopting the official Unicode
> titlecase algorithm. Unfortunately, I can't find a specification of it
> on the web right now: The Unicode site says PDFs for the standard are
> temporarily offline. Could somebody help me out?
If you're asking for citations, you should cite:
the main Unicode standard (currently 5.0)
Unicode Standard Annex #29
( http://www.unicode.org/reports/tr29/ )
If you're asking for a simple specification of the
algorithm, there isn't one. About the best you can
do is to change the sentence that, in 5.92, says
The string-titlecase procedure converts the
first character to title case in each
contiguous sequence of cased characters
within string, and it downcases all other
cased characters; for the purposes of
detecting cased-character sequences,
case-ignorable characters are ignored (i.e.
they do not interrupt the sequence).
to
The string-titlecase procedure converts the
first cased character of each word to title
case, and downcases all other cased characters.
That sounds simple, but the hair lies in the
definition of a word. The Unicode standard
explicitly defers to UAX 29 on this point, and
the definition of a word in UAX 29 is neither
simple nor categorical. (In particular, word
breaking is allowed to be locale-sensitive.)
The fact that the Unicode committee understands
that categorical specifications are not always
desirable does not bother me, of course; I point
it out only because it might bother some of the
other editors.
I guess I should also point out that UAX 29 is
implicitly part of the specification for both
string-downcase and string-foldcase, since the
casing of Greek sigma is defined with respect to
word breaks, which are defined by UAX 29.
Will