I think it would be useful to have a helper function akin to Microsoft's
ScriptItemizeOpenType()* that breaks a Unicode string into individually
shapeable items (runs) and provides an array of feature tags for each
shapeable item for OpenType processing.

Thanks Adam. So far the focus has been to unify the shaping logic (most
important for Indic). Itemization, while pretty well defined, is something
everyone does slightly differently. It requires:

Except for the first item which is well-defined by Unicode, the other steps
are less well-defined and different usecases require slightly different
solutions. For example, web browsers have very strict font assignment rules
that follow the CSS spec. Other applications, less so. It would be harder to
justify using a unified itemizer. At least initially. But yes, that's one of
the logical next steps.

I may have misinterpretted but mask, lig_id and probably component, feel to be OT specific in that a consumer of the output is unlikely to ever need them.

Yes and no. Mask is used to mark which user features should be applied to
which glyphs, and I think at least AAT can/will use that too. For lig_id and
component, they are not inherently OT-specific. They are implementation
details of how HarfBuzz implements the OT spec. We may decide to hide them
too, and just have another internal member. Individual shapers can use the
internal members as they wish then. That's actually a good idea. Unless I
find a use for the client having access to those values, it better be hidden.
I'll make that change now.
I'm thinking about adding a some other fields here though (without changing
the size). Things like justification points, etc.

The disadvantage I see with having a single buffer that changes its contents from chars to glyphs is that then you lose the association map between underlying chars and glyphs. I suppose it can be recreated using the component information, but it's going to be problematic when it comes to cursor hit testing.

The decision is only relevant inside the hb_shape() call. The user has the
original text still. Please see the last part of my reply to Carl Worth.

For script and language, it's a bit more delicate. I'm also convinced that
they belong to the buffer. With script it's fine, but with language it
introduces a small implementation hassle: that I would have to deal with
copying/interning language tags, something I was trying to avoid. The other
options are:
- Extra parameters to hb_shape(). I rather not do this. Keeping details
like this out of the main API and addings setters where appropriate makes the
API cleaner and more extensible.
- Use the feature dict for them too. I'm strictly against this one. The
feature dict is already too highlevel for my taste.

Why do you say the feature dict is too high level? It seems just the right place, to me. Or it could be stored in the buffer, since it is buffer specific.

It's just not as efficient and easy to use as I like. But it's just fine for
user features, yes.

One question: is a buffer representing a single run for which the language doesn't change or is it potentially multiple runs that are yet to be segmented?

The way I'd recommend using it is for one run. The API already limits it to
one font anyway. Doesn't mean we can't add API to do multiple runs in the
future though.
behdad