(Actually, this seems more like a job for a type class.)
2010/8/17 Gábor Lehel <illissius at gmail.com>:
> Someone mentioned earlier that IHHO all of this messing around with
> encodings and conversions should be handled transparently, and I guess
> you could do something like have the internal representation be along
> the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
> then implement every function in the API equivalently for each
> representation (with only the performance characteristics differing),
> with input/output functions being specialized for each encoding, and
> then only do a conversion when necessary or explicitly requested. But
> I assume that would have other problems (like the implicit conversions
> causing hard-to-track-down performance bugs when they're triggered
> unintentionally).
>> On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles <pumpkingod at gmail.com> wrote:
>> Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
>> UTF-16 "segments" in it list of strict text elements :) Then big chunks of
>> western text will be encoded efficiently, and same with CJK! Not sure what
>> to do about strict Data.Text though :)
>>>> On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil at malde.org> wrote:
>>>>>> Michael Snoyman <michael at snoyman.com> writes:
>>>>>> > As far as space usage, you are correct that CJK data will take up more
>>> > memory in UTF-8 than UTF-16.
>>>>>> With the danger of sounding ... alphabetist? as well as belaboring a
>>> point I agree is irrelevant (the storage format):
>>>>>> I'd point out that it seems at least as unfair to optimize for CJK at
>>> the cost of Western languages. UTF-16 uses two bytes for (most) CJK
>>> ideograms, and (all, I think) characters in Western and other phonetic
>>> scripts. UTF-8 uses one to two bytes for a lot of Western alphabets,
>>> but three for CJK ideograms.
>>>>>> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
>>> while an ASCII letter is about six bits. Thus, the information density
>>> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
>>> 15/16 vs 6/16 for UTF-16. In other words a given document translated
>>> between Chinese and English should occupy roughly the same space in
>>> UTF-8, but be 2.5 times longer in English for UTF-16.
>>>>>> -k
>>> --
>>> If I haven't seen further, it is by standing in the footprints of giants
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>>Haskell-Cafe at haskell.org>>>http://www.haskell.org/mailman/listinfo/haskell-cafe>>>>>> _______________________________________________
>> Haskell-Cafe mailing list
>>Haskell-Cafe at haskell.org>>http://www.haskell.org/mailman/listinfo/haskell-cafe>>>>>>>> --
> Work is punishment for failing to procrastinate effectively.
>
--
Work is punishment for failing to procrastinate effectively.