On Wed, Apr 02, 2003 at 03:07:03PM -0500,
Martin Duerst <duerst@xxxxxx> wrote
a message of 28 lines which said:
> The danger of bundles being too big can easily happen for European
> languages, with a bundle that defines that all accented versions of
> a character are treated as the same as the base character.
Yes, see my previous message, in the thread "New Internet Draft on
registering IDNs". A typical example is the label
"3suisses-assurances" (which actually exist in '.fr') which has a
bundle of 306,250 labels with a table that uses (almost) all the
Latin-1 characters.
Not all of Latin-1 characters exist in French so we could downsize the
table and therefore the bundles. But, on the other hand, for a
registry like '.eu', we will need an even larger table since Europe
requires more than just Latin-1.
> In that case, Paul's approach (also described by Adam) of using
> equivalence classes won't scale.
It doesn't scale if you want to actually generate the bundle and
publish them in a static zone file. I tried for the '.fr' zone which
is quite small - 150,000 domains - and the resulting zone file was
larger than '.com' even before the domains starting with the letter A
were fully processed. But you have other approaches:
* a dynamic DNS server like PowerDNS <http://www.powerdns.com/> with a
back-end that will match a label to its bundle at query-time,
* Option 2 or 3 of Paul's draft, which do not require to actually
store the complete bundle.
> What may work is that an accented character blocks the base
> character, but not characters with a different accent.
Interesting. We could also draw inspiration from most Web search
engines. They work that way: If there is no composed character in the
query, they search "accent-insensitive". If there is at least one,
they switch to "accent-sensitive".