Context Navigation

Ticket #1120
(new enhancement)

UTF8-Bytestring Pango interface.

Reported by:

guest

Owned by:

axel

Priority:

normal

Milestone:

Component:

Pango bindings

Version:

0.9.12

Keywords:

Cc:

jeanphilippe.bernardy@…

Description

Pango interface currently jumps through hoops to provide a String-based interface for pango functions. If (like in Yi) the user code uses UTF8 internally, it is harmful both from a performance and ease-of-use point of view. (Offsets have to be corrected twice in opposite directions)

It would be useful (and perhaps not too difficult) to provide an interface directly based on (UTF8) bytestrings.

Change History

I am surprised by this, actually. I thought you would internally keep Unicode strings rather than raw byte sequences. I thought it would be prudent in every application to avoid mixing multi-byte characters with composite characters. Using Unicode strings gets rid of the first obstacle and only leaves one difficulty do deal with. So Yi does this both at once? Or does it have it's own abstractions?

In principle, we can provide an interface to Pango using byte arrays but this would be a lot of work for few users and for those who don't need it, it would be confusing. If you want this interface for pure performance reasons, then I think I would politely decline this request unless you have evidence that conversion back and forth is really a bottleneck.

A reason for using utf8 internally is that I checked gtk documentation, and it said it returned utf8 offsets. I did not imagine you'd adapt it in gtk2hs. :) It turns out the Yi code in not (much) more complex anyway: position in buffer is used abstractly almost everywhere.
Also, since we used bytestrings for performance reasons anyway, it was rather natural to encode unicode as UTF8 in it. An obvious additional benefit is saved memory in the usual ascii case.
I was guessing this is a common use-case, maybe I'm wrong.

Also, using PangoLayout? is quite CPU-intensive. I can't tell the share of the haskell layer though.

Can I suggest that a better approach is to use an abstract type like ByteString? but that represents a sequence of Unicode Chars rather than bytes. That's a much nicer interface than a ByteString? which is assumed to be valid UTF8.

A student of mine is implementing a Unicode text type with an external api and internal representation and performance that is very similar to that of bytestrings. The work should be completed during this summer.

Then Gtk2Hs could have a nice interface and decent performance for large chunks of text.

it sounds as if you are suggesting an additional interface that re-implements every existing Pango function using a PackedString? such that the UTF8 <-> Unicode conversion is still done by our Pango binding. If you'd actually use ByteString? where every element presumably occupies 8 bits, how would you store Unicode characters in it?

What JPB suggests is an interface using raw UTF8 strings that avoids the overhead of our Pango interface. I guess he would use some packed representation such as ByteString?, but I'm not sure.

I don't have a problem adding both interfaces. However, to ensure that "normal" people are not confused, could we have these additional APIs in Pango.PackedString?.* and Pango.ByteString?.* or similar?

Duncan,
it sounds as if you are suggesting an additional interface that re-implements every existing Pango function using a PackedString? such that the UTF8 <-> Unicode conversion is still done by our Pango binding. If you'd actually use ByteString? where every element presumably occupies 8 bits, how would you store Unicode characters in it?

Yes, but using a Haskell type that is specifically designed to represent Unicode text. Internally it'd almost certainly use UTF-8 and provide fast conversion to UTF-8 encoded memory buffers.

It should not be necessary to have two full implementations of all functions since one set should be easy to implement in terms of the other.

What JPB suggests is an interface using raw UTF8 strings that avoids the overhead of our Pango interface. I guess he would use some packed representation such as ByteString?, but I'm not sure.

Right, and I'm suggesting something similar but using a type that is designed for the purpose rather than (ab)using ByteString? to represent unicode text.

I don't have a problem adding both interfaces. However, to ensure that "normal" people are not confused, could we have these additional APIs in Pango.PackedString?.* and Pango.ByteString?.* or similar?