How to count composed characters in NSString?

I have been trying to find this in the documentation and list archives
but without success so far. What is the best way to count the number
of characters in an NSString taking account of the fact that some
characters may take up multiple 16 bit slots. Using "-
(NSUInteger)length" is thus not the right way. Using a series of calls
to "rangeOfComposedCharacterSequenceAtIndex:" seems like a
possibility, but I am not sure this would be the most efficient way.
Is there a simple and straightforward solution? I would like to be
able to display the number of characters in a string and not report
the wrong results for foreign languages (which I would get if I simply
took the length of the string). I need a solution that does not only
work in Leopard (i.e. CFStringTokenizer is not an option) and that
does not require using the lower level UCFindTextBreak.

> Hi,
>
> I have been trying to find this in the documentation and list
> archives but without success so far. What is the best way to count
> the number of characters in an NSString taking account of the fact
> that some characters may take up multiple 16 bit slots. Using "-
> (NSUInteger)length" is thus not the right way.

If I am reading you right, you are saying that -length will give you
the wrong results because some characters in Unicode are represented
by multibyte sequences. This is incorrect: -length will give you the
number of Unicode characters in a string, not the number of bytes.

However, there are characters like "combining grave accent" (U+0300)
that will usually not be displayed as a separate character, so there
is a potential problem if you want to know how many characters will
actually be displayed. The solution is to put the string into one of
the composed Normalization Forms with either -
precomposedStringWithCanonicalMapping (NFC) or -
precomposedStringWithCompatibilityMapping (NFKC), depending on your
needs. Then calling -length should give you the result you are looking
for.

>
> On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:
> >> Hi,
>>
>> I have been trying to find this in the documentation and list
>> archives but without success so far. What is the best way to count
>> the number of characters in an NSString taking account of the fact
>> that some characters may take up multiple 16 bit slots. Using "-
>> (NSUInteger)length" is thus not the right way.>
> If I am reading you right, you are saying that -length will give you
> the wrong results because some characters in Unicode are represented
> by multibyte sequences. This is incorrect: -length will give you the
> number of Unicode characters in a string [...].

This surprises me. I always thought that "length" gives you the
number of shorts in the Utf-16 encoding of the string, which - as I
used to think - is not the same as the number of Unicode code points
in this string.

>
> On Sun, 28 Sep 2008 03:27:48 -0500, Michael Gardner <gardnermj...>>> wrote:
>>
>> On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:
>> >>> Hi,
>>>
>>> I have been trying to find this in the documentation and list
>>> archives but without success so far. What is the best way to count
>>> the number of characters in an NSString taking account of the fact
>>> that some characters may take up multiple 16 bit slots. Using "-
>>> (NSUInteger)length" is thus not the right way.>>
>> If I am reading you right, you are saying that -length will give you
>> the wrong results because some characters in Unicode are represented
>> by multibyte sequences. This is incorrect: -length will give you the
>> number of Unicode characters in a string [...].>
> This surprises me. I always thought that "length" gives you the
> number of shorts in the Utf-16 encoding of the string, which - as I
> used to think - is not the same as the number of Unicode code points
> in this string.
>
> But maybe you are right and I am confused.

Upon further investigation, I may be wrong. I based my assertion upon
Apple's NSString documentation ("Returns the number of Unicode
characters in the receiver"), and upon some quick tests I ran. But
this reply made me look into the issue in greater depth.

I re-did my tests more throughly, and it does appear that -length
returns the number of 16-bit words (code units), not the number of
Unicode characters (code points), in the string. If this is true, I
would call it a bug either in the code or in the documentation, which
David should submit to Apple.

I apologize for the apparent misinformation in my previous, hasty reply.

In the meanwhile, David, perhaps you can find a library that can work
with UTF-8 strings. What are you using the length values for?

> Hi,
>
> I have been trying to find this in the documentation and list archives but
> without success so far. What is the best way to count the number of
> characters in an NSString taking account of the fact that some characters
> may take up multiple 16 bit slots. Using "- (NSUInteger)length" is thus not
> the right way. Using a series of calls to
> "rangeOfComposedCharacterSequenceAtIndex:" seems like a possibility, but I
> am not sure this would be the most efficient way. Is there a simple and
> straightforward solution? I would like to be able to display the number of
> characters in a string and not report the wrong results for foreign
> languages (which I would get if I simply took the length of the string). I
> need a solution that does not only work in Leopard (i.e. CFStringTokenizer
> is not an option) and that does not require using the lower level
> UCFindTextBreak.

First I recommend you simply give up on the concept. You've stumbled
into a tough problem, one which is not all that useful, and it may be
better to skip it. Of course I don't know what you're using it for,
but in general counting the number of characters in a string is not a
useful thing to do.

> Upon further investigation, I may be wrong. I based my assertion
> upon Apple's NSString documentation ("Returns the number of Unicode
> characters in the receiver"), and upon some quick tests I ran. But
> this reply made me look into the issue in greater depth.
>
> I re-did my tests more throughly, and it does appear that -length
> returns the number of 16-bit words (code units), not the number of
> Unicode characters (code points), in the string. If this is true, I
> would call it a bug either in the code or in the documentation,
> which David should submit to Apple.

i think the docs are clear. In the discussion section for "length" it
says: "The number returned includes the individual characters of
composed character sequences, so you cannot use this method to
determine if a string will be visible when printed or how long it will
appear."

I did file a bug (ID 6253075) as you suggested, because I think there
should be a simple API for this.

> I apologize for the apparent misinformation in my previous, hasty
> reply.

Well, I mad an error too. i suggested that on 10.5 the
CFStringTokenizer could be used, but only now noticed that it only
supports larger units (words and up). Thus there is no easy API to
count the number of characters in a way that surrogate pairs or other
"long" unicode characters are treated as a single character.

> In the meanwhile, David, perhaps you can find a library that can
> work with UTF-8 strings. What are you using the length values for?

I need to be able to display the number of characters to the user in a
way that makes sense to them. If they see 3 I should report 3. I also
need it to cut-off certain input to the number of "real" characters
and should not generate results that only make sense for a language
like English where each 16 bits equals a single character.

Using some kind of UTF-8 library may be possible, but that would
require converting all the time between UTF-16 and UTF-8, which is not
efficient for a program that has to do a lot of these kind of
calculations.

> I need to be able to display the number of characters to the user in a way
> that makes sense to them. If they see 3 I should report 3. I also need it to
> cut-off certain input to the number of "real" characters and should not
> generate results that only make sense for a language like English where each
> 16 bits equals a single character.

Perhaps more information on why this is a requirement would be
helpful. Since it's apparent that you're going to be dealing with
languages other than English, there won't be only one set of rules for
you to follow. For example, in Dutch, IJ is one letter. In Spanish,
you might treat ll and ch as one letter or not, depending on which
region you're using and whether you're performing collation or just
counting the number of letters. If you can explain why counting
characters is important to your app, we might be able to help you
better.

Users don't see characters, they see glyphs. If you want your count to
maximally agree with user perception, you need to be counting glyphs,
not characters.

See NSLayoutManager, esp:

- (NSRange)glyphRangeForCharacterRange:(NSRange)charRange

-- and friends.

If you are showing strings to the user, do so in an NSTextView, and
then query the NSLayoutManager associated with that view.

On Sep 27, 2008, at 9:37 PM, <cocoa-dev-request...> wrote:

> Message: 15
> Date: Sat, 27 Sep 2008 21:23:25 +0200
> From: David Niemeijer <lists...>
> Subject: How to count composed characters in NSString?
> To: <cocoa-dev...>
> Message-ID: <8A34E3C1-EE83-4180-B524-E262DDDF768A...>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Hi,
>
> I have been trying to find this in the documentation and list archives
> but without success so far. What is the best way to count the number
> of characters in an NSString taking account of the fact that some
> characters may take up multiple 16 bit slots. Using "-
> (NSUInteger)length" is thus not the right way. Using a series of calls
> to "rangeOfComposedCharacterSequenceAtIndex:" seems like a
> possibility, but I am not sure this would be the most efficient way.
> Is there a simple and straightforward solution? I would like to be
> able to display the number of characters in a string and not report
> the wrong results for foreign languages (which I would get if I simply
> took the length of the string). I need a solution that does not only
> work in Leopard (i.e. CFStringTokenizer is not an option) and that
> does not require using the lower level UCFindTextBreak.
>
> Thanks,
>
> david.

> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 28 Sep 2008 20:17:26 +0200
> From: David Niemeijer <lists...>
> Subject: Re: How to count composed characters in NSString?
> To: Cocoa-Dev List <cocoa-dev...>
> Message-ID: <B24844F1-78CF-4C28-A602-4AAE64D6C3A8...>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Michael,
>
> On 28 sep 2008, at 14:41, Michael Gardner wrote:>> Upon further investigation, I may be wrong. I based my assertion
>> upon Apple's NSString documentation ("Returns the number of Unicode
>> characters in the receiver"), and upon some quick tests I ran. But
>> this reply made me look into the issue in greater depth.
>>
>> I re-did my tests more throughly, and it does appear that -length
>> returns the number of 16-bit words (code units), not the number of
>> Unicode characters (code points), in the string. If this is true, I
>> would call it a bug either in the code or in the documentation,
>> which David should submit to Apple.>
> i think the docs are clear. In the discussion section for "length" it
> says: "The number returned includes the individual characters of
> composed character sequences, so you cannot use this method to
> determine if a string will be visible when printed or how long it will
> appear."
>
> I did file a bug (ID 6253075) as you suggested, because I think there
> should be a simple API for this.
> >> I apologize for the apparent misinformation in my previous, hasty
>> reply.>
> Well, I mad an error too. i suggested that on 10.5 the
> CFStringTokenizer could be used, but only now noticed that it only
> supports larger units (words and up). Thus there is no easy API to
> count the number of characters in a way that surrogate pairs or other
> "long" unicode characters are treated as a single character.

David,
Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the
kinds of text boundaries that I think you are interested in. You would
just need to iterate over the string calling this for each iteration
to find the next boundary.

>
> David,
> Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the
> kinds of text boundaries that I think you are interested in. You
> would just need to iterate over the string calling this for each
> iteration to find the next boundary.

Apologies, I see now that your in your original post you already
mentioned rangeOfComposedCharacterSequenceAtIndex. That would be
preferred :-)
-Peter

> Michael,
>
> On 28 sep 2008, at 14:41, Michael Gardner wrote:>> Upon further investigation, I may be wrong. I based my assertion
>> upon Apple's NSString documentation ("Returns the number of Unicode
>> characters in the receiver"), and upon some quick tests I ran. But
>> this reply made me look into the issue in greater depth.
>>
>> I re-did my tests more throughly, and it does appear that -length
>> returns the number of 16-bit words (code units), not the number of
>> Unicode characters (code points), in the string. If this is true, I
>> would call it a bug either in the code or in the documentation,
>> which David should submit to Apple.>
> i think the docs are clear. In the discussion section for "length"
> it says: "The number returned includes the individual characters of
> composed character sequences, so you cannot use this method to
> determine if a string will be visible when printed or how long it
> will appear."

But composed character sequences aren't the problem; surrogate pairs
are. Composed character sequences can be taken care of by using either
-precomposedStringWithCanonicalMapping or -
precomposedStringWithCompatibilityMapping. In my opinion, -length
should take surrogate pairs into account, which is what the docs seem
to imply.

> On Sep 28, 2008, at 1:17 PM, David Niemeijer wrote:
> >> Michael,
>>
>> On 28 sep 2008, at 14:41, Michael Gardner wrote:>>> Upon further investigation, I may be wrong. I based my assertion
>>> upon Apple's NSString documentation ("Returns the number of
>>> Unicode characters in the receiver"), and upon some quick tests I
>>> ran. But this reply made me look into the issue in greater depth.
>>>
>>> I re-did my tests more throughly, and it does appear that -length
>>> returns the number of 16-bit words (code units), not the number of
>>> Unicode characters (code points), in the string. If this is true,
>>> I would call it a bug either in the code or in the documentation,
>>> which David should submit to Apple.>>
>> i think the docs are clear. In the discussion section for "length"
>> it says: "The number returned includes the individual characters of
>> composed character sequences, so you cannot use this method to
>> determine if a string will be visible when printed or how long it
>> will appear.">
> But composed character sequences aren't the problem; surrogate pairs
> are. Composed character sequences can be taken care of by using
> either -precomposedStringWithCanonicalMapping or -
> precomposedStringWithCompatibilityMapping.

Not true. Not all possible combinations of base characters followed by
combining characters even have a mapping to a single precimposed
character.

Essentially, what one wants to do is count all of the characters with
a combining class of zero, however, even this isn't without issues.

> In my opinion, -length should take surrogate pairs into account,
> which is what the docs seem to imply.
>
> -Michael

> I need to be able to display the number of characters to the user in
> a way that makes sense to them. If they see 3 I should report 3. I
> also need it to cut-off certain input to the number of "real"
> characters and should not generate results that only make sense for
> a language like English where each 16 bits equals a single character.

What you are describing is the notion that Unicode sometimes refers to
as a "user-perceived character", which in general can be somewhat
ambiguous, since different users may have different perceptions, and
since there are writing systems in which character boundaries are not
at all similar to those in English. To handle this sort of issue
programmatically, Unicode defines what are known as "grapheme
clusters", but there is not a single notion of grapheme cluster; there
are several such notions, depending on precisely what it is you want.

These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character sequences.
For some purposes composed character sequences may be sufficient;
NSString gives prominence to the notion of composed character
sequence, because that is the most important concept for arbitrary
text processing, but if you are really interested in user-perceived
characters you may wish to use something else.

The most problematic scripts for this sort of determination include:
handwriting-based scripts such as Arabic, in which (depending on the
ligatures used in a particular font) character boundaries may not be
readily perceptible; composed scripts such as Hangul, in which the
script elements are in turn composed of smaller, individually
meaningful graphic elements; and scripts involving reordering and
combining, such as Devanagari and other Indic or Indic-influenced
scripts.

There is still another similar but not quite identical notion, which
is used for determining the number and position of insertion points
during editing. In Leopard, NSLayoutManager has API support for
determining insertion point positions within a line of text as it is
laid out. Note that insertion point boundaries are not identical to
glyph boundaries; a ligature glyph in some cases, such as an "fi"
ligature in Latin script, may require an internal insertion point on a
user-perceived character boundary.

> But composed character sequences aren't the problem; surrogate pairs are.
> Composed character sequences can be taken care of by using either
> -precomposedStringWithCanonicalMapping or
> -precomposedStringWithCompatibilityMapping. In my opinion, -length should
> take surrogate pairs into account, which is what the docs seem to imply.

The NSString API is inherently either UCS-2 or UTF-16. As UCS-2
doesn't cover all of Unicode, it ends up being UTF-16.

The API defines NSString as an ordered collection of 16-bit unichars.
The length is necessarily the number of 16-bit unichars in the string,
nothing else would really make sense. Short of creating a new API that
works on pure Unicode code points, the only thing to do is to document
the fact that -length gives you the number of UTF-16 code units, not
the number of Unicode characters.

(As an aside, changing the API to work with Unicode code points is
something I don't think is really worthwhile. Aside from having to
support the old API which would no doubt be a great deal of hassle,
Unicode code points are pretty useless on their own anyway. You always
end up having to convert and deal with precomposed characters an all
the rest of the Unicode mess regardless. Adding surrogate pairs to all
of that really doesn't increase the burden any further.)

> On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
> >> I need to be able to display the number of characters to the user
>> in a way that makes sense to them. If they see 3 I should report 3.
>> I also need it to cut-off certain input to the number of "real"
>> characters and should not generate results that only make sense for
>> a language like English where each 16 bits equals a single character.>
> What you are describing is the notion that Unicode sometimes refers
> to as a "user-perceived character", which in general can be somewhat
> ambiguous, since different users may have different perceptions, and
> since there are writing systems in which character boundaries are
> not at all similar to those in English. To handle this sort of
> issue programmatically, Unicode defines what are known as "grapheme
> clusters", but there is not a single notion of grapheme cluster;
> there are several such notions, depending on precisely what it is
> you want.
>
> These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

> >, which gives a number of examples and some algorithms for

> determining grapheme cluster boundaries. Grapheme clusters are
> similar to but not quite identical to composed character sequences.
> For some purposes composed character sequences may be sufficient;
> NSString gives prominence to the notion of composed character
> sequence, because that is the most important concept for arbitrary
> text processing, but if you are really interested in user-perceived
> characters you may wish to use something else.

Thanks for your clarification. It is indeed the "grapheme clusters"
that I am after. I need to be able to do things such as capitalize the
first letter of a string and in doing statistical text analysis
determine the number of "characters" of a text string. This
description from the URL you pointed at fits my use quite well:
"Grapheme cluster boundaries are important for collation, regular
expressions, UI interactions (such as mouse selection, arrow key
movement, backspacing), segmentation for vertical text, identification
of boundaries for first-letter styling, and counting “character”
positions within text." Using glyphs in this case is not appropriate
as in text analysis the text itself is not displayed, nor is using
[aString length] because it just reports the number of UTF-16 code
units. I realize there is no perfect approach, but I am just trying to
do something that brings me closest to what a user would expect.

Peter confirmed earlier that
CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for
me. But, if I read Douglas' comment then I am beginning to wonder
whether this is the equivalent of UCFindTextBreak's
kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I
used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike
NSString, UCFindTextBreak is not available on one of the platforms I
need to support, so what would be the right way to get at the cluster
breaks using the NSString API? (Please contact me off list if you need
further clarification.)

> Hi Douglas and Peter,
>
> On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:>> On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
>> >>> I need to be able to display the number of characters to the user
>>> in a way that makes sense to them. If they see 3 I should report
>>> 3. I also need it to cut-off certain input to the number of "real"
>>> characters and should not generate results that only make sense
>>> for a language like English where each 16 bits equals a single
>>> character.>>
>> What you are describing is the notion that Unicode sometimes refers
>> to as a "user-perceived character", which in general can be
>> somewhat ambiguous, since different users may have different
>> perceptions, and since there are writing systems in which character
>> boundaries are not at all similar to those in English. To handle
>> this sort of issue programmatically, Unicode defines what are known
>> as "grapheme clusters", but there is not a single notion of
>> grapheme cluster; there are several such notions, depending on
>> precisely what it is you want.
>>
>> These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

>> >, which gives a number of examples and some algorithms for

>> determining grapheme cluster boundaries. Grapheme clusters are
>> similar to but not quite identical to composed character
>> sequences. For some purposes composed character sequences may be
>> sufficient; NSString gives prominence to the notion of composed
>> character sequence, because that is the most important concept for
>> arbitrary text processing, but if you are really interested in user-
>> perceived characters you may wish to use something else.>
> Thanks for your clarification. It is indeed the "grapheme clusters"
> that I am after. I need to be able to do things such as capitalize
> the first letter of a string and in doing statistical text analysis
> determine the number of "characters" of a text string. This
> description from the URL you pointed at fits my use quite well:
> "Grapheme cluster boundaries are important for collation, regular
> expressions, UI interactions (such as mouse selection, arrow key
> movement, backspacing), segmentation for vertical text,
> identification of boundaries for first-letter styling, and counting
> “character” positions within text." Using glyphs in this case is not
> appropriate as in text analysis the text itself is not displayed,
> nor is using [aString length] because it just reports the number of
> UTF-16 code units. I realize there is no perfect approach, but I am
> just trying to do something that brings me closest to what a user
> would expect.
>
> Peter confirmed earlier that
> CFStringGetRangeOfComposedCharactersAtIndex would be the way to go
> for me. But, if I read Douglas' comment then I am beginning to
> wonder whether this is the equivalent of UCFindTextBreak's
> kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past
> I used to use UCFindTextBreak with kUCTextBreakClusterMask, but
> unlike NSString, UCFindTextBreak is not available on one of the
> platforms I need to support, so what would be the right way to get
> at the cluster breaks using the NSString API? (Please contact me off
> list if you need further clarification.)
>
> Cheers,
>
> david.

David,
CFStringGetRangeOfComposedCharactersAtIndex and -[NSString
rangeOfComposedCharacterSequenceAtIndex:] are the modern replacements
for UCFindTextBreak with kUCTextBreakClusterMask and indeed they now
are closer to the original intent of kUCTextBreakClusterMask that the
current implementation of kUCTextBreakClusterMask is (since
UCFindTextBreak was converted to follow Unicode/ICU default text
segmentation rules).

The modern functions treat all of the following as a cluster:
- A surrogate pair (of course, since it is a single character);
- A base character followed by a sequence of combining marks (whether
or not this is something that would be composed under NFC);
- A Hangul syllable expressed as a sequence of conjoining jamo;
- An Indic consonant cluster such as consonant + virama + consonant +
vowel matra. It is this latter cluster that is no longer treated as a
single entity by UCFindTextBreak with kUCTextBreakClusterMask.

> CFStringGetRangeOfComposedCharactersAtIndex and -[NSString
> rangeOfComposedCharacterSequenceAtIndex:] are the modern
> replacements for UCFindTextBreak with kUCTextBreakClusterMask and
> indeed they now are closer to the original intent of
> kUCTextBreakClusterMask that the current implementation of
> kUCTextBreakClusterMask is (since UCFindTextBreak was converted to
> follow Unicode/ICU default text segmentation rules).
>
> The modern functions treat all of the following as a cluster:
> - A surrogate pair (of course, since it is a single character);
> - A base character followed by a sequence of combining marks
> (whether or not this is something that would be composed under NFC);
> - A Hangul syllable expressed as a sequence of conjoining jamo;
> - An Indic consonant cluster such as consonant + virama + consonant
> + vowel matra. It is this latter cluster that is no longer treated
> as a single entity by UCFindTextBreak with kUCTextBreakClusterMask.

Ok, understood. This looks good. Based on the discussion I have
updated my bug report 6253075. I think a "convenience" method that
returns the cluster count would be very useful as it is probably
faster than if we manually role a counter method using repeated calls
to rangeOfComposedCharacterSequenceAtIndex and because it will, by its
simple availability, reduce some of the confusion that I sense on this
list as to what the most appropriate way is to count "characters".
There would be "length" to count the number of UTF-16 units and a
"numberOfCharacters" to count the clusters that are closest to the
human conception of characters.