From user-return-12933-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Mon Sep 27 07:09:16 2010
Return-Path:
Delivered-To: apmail-couchdb-user-archive@www.apache.org
Received: (qmail 54778 invoked from network); 27 Sep 2010 07:09:15 -0000
Received: from unknown (HELO mail.apache.org) (140.211.11.3)
by 140.211.11.9 with SMTP; 27 Sep 2010 07:09:15 -0000
Received: (qmail 94675 invoked by uid 500); 27 Sep 2010 07:09:13 -0000
Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org
Received: (qmail 94385 invoked by uid 500); 27 Sep 2010 07:09:10 -0000
Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@couchdb.apache.org
Delivered-To: mailing list user@couchdb.apache.org
Received: (qmail 94372 invoked by uid 99); 27 Sep 2010 07:09:09 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Sep 2010 07:09:09 +0000
X-ASF-Spam-Status: No, hits=0.0 required=10.0
tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.214.180 as permitted sender)
Received: from [209.85.214.180] (HELO mail-iw0-f180.google.com) (209.85.214.180)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Sep 2010 07:09:04 +0000
Received: by iwn8 with SMTP id 8so6086813iwn.11
for ; Mon, 27 Sep 2010 00:08:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:received:mime-version:received:in-reply-to
:references:from:date:message-id:subject:to:content-type
:content-transfer-encoding;
bh=AnLtmmOcShHQfLJjdu6IFspS4GXASQbiwkozA3SReuI=;
b=KnvBejfPbYtLv3Co/T6lRpDb9HIK97V04iVfkqhpYBvAtbM5knwmzuCxiPs5Bi9Z+/
tWbO4A2bNBOWduutHLgo7QHWy545VxxHgv/NZJ6d2ePTTG+lmGXSdf8eDft6u4ZewOy+
CtASpSOQFRv4mihkDFoKg1QpuU2E2QMG210rU=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:content-type:content-transfer-encoding;
b=L9bPaQVSVbodkqo/1slnMPJqGariHt88ZhZbwZy5g1pTabXQR13wLxgOHP5/ecUyOI
4MI0iqSaAOIWEhpT/9kMpZC3jsg5U012YLGuzhcxwZf9ioJTEWK/Fnb3QWdFrEARnkHq
bOQQOLEgHF/9p0olpNWPZYiqrOP41FCRIozb0=
Received: by 10.231.182.204 with SMTP id cd12mr8769734ibb.101.1285571324214;
Mon, 27 Sep 2010 00:08:44 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.231.30.194 with HTTP; Mon, 27 Sep 2010 00:08:03 -0700 (PDT)
In-Reply-To:
References:
From: Paul Davis
Date: Mon, 27 Sep 2010 03:08:03 -0400
Message-ID:
Subject: Re: Locale and rule based view collation
To: user@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
On Mon, Sep 27, 2010 at 2:59 AM, Noah Diewald wrot=
e:
> On Sun, Sep 26, 2010 at 7:43 PM, Paul Davis =
wrote:
>> On Sun, Sep 26, 2010 at 8:37 PM, Noah Diewald w=
rote:
>>> On Sat, Sep 25, 2010 at 6:38 PM, Paul Davis wrote:
>>>> On Sat, Sep 25, 2010 at 7:21 PM, Chris Anderson wr=
ote:
>>>>> On Sat, Sep 18, 2010 at 4:47 PM, Noah Diewald wrote:
>>>>>> I was wondering if there were any plans to make use of more of the I=
CU
>>>>>> collation API in CouchDB.
>>>>>>
>>>>>> I'm using CouchDB to make natural language documentation software an=
d
>>>>>> it seems like a shame that I might have to use ICU for creating sort
>>>>>> keys to get sort orders right for view keys in certain languages whe=
n
>>>>>> ICU is already used internally by CouchDB. It kind of looks like
>>>>>> something could be added in at about the same place as the option fo=
r
>>>>>> case or no case collations in couch_icu_driver.c but I feel under
>>>>>> qualified to play around with it. I think that having an option in t=
he
>>>>>> view to specify collation customization would be really great and it
>>>>>> must be something that even people working with less obscure languag=
es
>>>>>> than I am could benefit from.
>>>>>>
>>>>>
>>>>> we definitely plan to make this configurable, just a matter of writin=
g
>>>>> code. for now there might be a way to set it on a per-server-instance
>>>>> basis with environment variables. I am no expert on the topic, but I
>>>>> vaguely recall someone mentioning this possibility.
>>>>>
>>>>> Chris
>>>>>
>>>>>> --
>>>>>> Noah Diewald
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chris Anderson
>>>>> http://jchrisa.net
>>>>> http://couch.io
>>>>>
>>>>
>>>> I'm pretty sure that Chris is right that there's a server wide
>>>> environment setting that affects ICU collation, but I can't say with
>>>> any certainty.
>>>>
>>>> Its always been on the to-do list to provide the ability to have
>>>> language based sorts that are defined at the view or database level,
>>>> but as Chris points out, no one's gotten around to doing that.
>>>> Currently the major issues would revolve around recoding the
>>>> icu_driver to have smarts in how it's created, as well as refactoring
>>>> how we access the driver.
>>>>
>>>> If we bumped our minimum Erlang VM version to R13, writing this as a
>>>> NIF would probably be orders of magnitude easier because of resource
>>>> types and what not.
>>>>
>>>> Once those hard parts are figured out, exposing it to the outside
>>>> world should be as easy as going through the bike shedding motions on
>>>> what the _design/doc syntax would look like.
>>>>
>>>> HTH,
>>>> Paul Davis
>>>>
>>>
>>> It is great to know that this type of thing is on the todo list. If
>>> custom rules were supported and not just predefined locales, some of
>>> the questionable NIFs I'm writing to make sort keys in my application
>>> layer could be removed some day and life would be simpler.
>>>
>>> I don't think that the environment variables help me personally with
>>> supporting multiple languages with different sort orders, especially
>>> since the collation customizations for two of the languages that I'm
>>> focusing on require custom rules. It would be really awesome if
>>> CouchDB supported ICU custom collation rules in views right out of the
>>> box. It might go a long way to making CouchDB a favorite with
>>> linguists. (CouchDB should be a favorite with linguists anyway because
>>> it is such a pleasure to use but this could make it extra favorite.)
>>>
>>> Thank you both for the replies.
>>>
>>> --
>>> Noah Diewald
>>>
>>
>> I'm not sure what you mean by custom rules. I'm not extremely familiar
>> with the collation API, but as I recall it had a thing that allowed a
>> user to pass a string based config to it that it would use to affect
>> the collation algorithm. Are you needing something beyond that?
>>
>> Paul Davis
>>
>
> I don't think I'm needing anything more if we're talking about the
> same thing but maybe we're not.
>
> Sorry about the "customization rule" stuff. Now that I look back, the
> ICU documentation consistently calls them tailoring rules, sorry to be
> unclear. I'm just learning this stuff.
>
> Here is my understanding of instantiating ICU collators just to see if
> we are on the same page.
>
> There are two ways of instantiating collators. The predefined
> collators are instantiated with locale strings like "en_US". Custom
> collators are instantiated using tailoring rules.[1]
>
> The ICU users guide says that a tailoring rule "overrides the default
> order of code points and the values of the ICU Collation Service
> attributes".[2], which seems like a strange definition because
> tailoring allows one to specify complex base letters that consist of
> more than one code point. UTS 10 says "Tailoring is any well-defined
> syntax that takes the Default Unicode Collation Element Table and
> produces another well-formed Unicode Collation Element Table."[3] In
> ICU a tailoring rule is a string that looks like this:
>
> "& C < =C4=8D <<< =C4=8C < =C4=87 <<< =C4=86"
>
> So a string is used for configuration in both cases of collator
> instantiation but a different api function is used to instantiate the
> collator depending on whether one is using a predefined collator or a
> tailoring rule. Any way of instantiating an ICU collator other than
> passing in an empty string or "root" as the locale may or may not
> result in =C2=A0a custom UCET derived from the DUCET so it was not a good
> idea to just talk about customization since that is vague.
>
> I'm dealing with languages that require tailoring and it is likely
> that most people wouldn't need tailoring just to be able to use a
> specific language for a specific view and that specifying a locale
> would be just fine. On the other hand, tailoring is very powerful and
> could be used to customize collation for reasons other than matching
> the alphabet of a rare language.
>
> Another aspect of what I need is that I specifically need different
> collation algorithms for different views. In one case I'll want to
> sort by English, in another I'll want to sort by Potawatomi or
> Menominee or something else.
>
> 1. http://userguide.icu-project.org/collation/api
> 2. http://userguide.icu-project.org/collation/customization
> 3. http://www.unicode.org/reports/tr10/
>
> --
> Noah Diewald
>
Cool. I'm concerned about a small API difference that gets selected.
Was just concerned for a bit that you were doing things like passing
function pointers to an API which would increase the overhead by a
couple orderes of magnitude. My earlier characterization of the level
of difficulty is about at the right level still.
Paul Davis