phillip hazel's PCRE implementation does UTF-8 rather well. If you
are looking for a UTF-8 base. It may be worth a look.

thx, that's pretty much the right thing, but also kind of a biggy.
Their character property table amounts to 88K,
while the Plan 9 thingy is about 12K
(not 2K, dropped the 1 in the previous post).

FYI:
In PCRE 6.7 ChangeLog:
Version 6.5 01-Feb-06
[...]
18. Changes to the handling of Unicode character properties:
(a) Updated the table to Unicode 4.1.0.
(b) Recognize characters that are not in the table as "Cn" (undefined).

(c) I revised the way the table is implemented to a much improved
format
which includes recognition of ranges. It now supports the
ranges that

are defined in UnicodeData.txt, and it also amalgamates other

characters into ranges. This has reduced the number of entries
in the

table from around 16,000 to around 3,000, thus reducing its size

considerably. I realized I did not need to use a tree structure
after

all - a binary chop search is just as efficient. Having reduced the
number of entries, I extended their size from 6 bytes to 8 bytes to
allow for more data.

(d) Added support for Unicode script names via properties such as
\p{Han}.