The change from issue1054943 is indeed bogus. As written, the code will happily run over starters, even though a blocked start means that subsequent characters can't possibly be combinable. That way, the code manages to combine, in 'Li\u030dt-s\u1e73\u0301', the final U+0301 with the i - even though there are several starters in-between.
I think the code should work like this:
if comb!=0 and comb1==0:
#starter after character with higher class:
# not combinable, and all subsequent characters will be blocked
# as well
break
if comb!=0 and comb1==comb:
# blocked combining character, continue searching
i1++
continue
# candidate pair, check whether *i and *i1 are combinable
It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore.

>> It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore.
>
> Why not ? It looks a lot like a security fix.
Indeed, you could argue that. It's up to the 2.6 release manager, I guess.

After a bit of debugging, the crash is due to the "skipped" array being overflowed in nfc_nfkc() in unicodedata.c. "cskipped" goes up to 21 while the array only has 20 entries. This happens in all branches (but only crashes in 2.7 right now for probably unimportant reasons).
And the problem was indeed introduced by Victor's patch in issue1054943. Just before, "cskipped" would only go up to 1.

"Ooops", sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 29 and the test suite passed, it was enough for me. I don't understand the normalization code, so I don't know how to fix it.

The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1>=comb.) I don't understand why linear search through "skipped" array is needed. At the very least instead of adding their positions to the "skipped" list, used combining characters can be replaced by a non-character to be later skipped. A better algorithm should be able to avoid the whole issue of "skipping" by properly computing the length of the decomposed character. See internalCompose() at http://www.unicode.org/reports/tr15/Normalizer.java.
I'll try to come up with a patch.

Attached patch, issue10254.diff, is essentially Martin's code from msg120018 and Part3 tests from NormalizationTest.txt.
Since this bug exposes a buffer overflow condition, I think it qualifies as a security issue, so I am adding 2.6 to versions.
Passing Part3 tests and not crashing on crash.py is probably good enough for a commit, but I don't have a proof that length 20 skipped buffer is always enough. As the next step, I would like to consider an alternative algorithm that would not require a "skipped" buffer.

Am 17.12.2010 01:56, schrieb STINNER Victor:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> "Ooops", sorry. I just applied the patch suggested by Marc-Andre
> Lemburg in msg22885 (#1054943). As the patch worked for the examples
> given in Unicode PRI 29 and the test suite passed, it was enough for
> me. I don't understand the normalization code, so I don't know how to
> fix it.
So lacking a new patch, I think we should revert the existing change
for now.

> The logic suggested by Martin in msg120018 looks right to me, but the
> whole code seems to be unnecessarily complex. (And comb1==comb may
> need to be changed to comb1>=comb.) I don't understand why linear
> search through "skipped" array is needed. At the very least instead
> of adding their positions to the "skipped" list, used combining
> characters can be replaced by a non-character to be later skipped.
The skipped array keeps track of what characters have been integrated
into a base character, as they must not appear in the output.
Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined,
N: not combined). You need to remember not to output C, whereas you
still need to output N. I don't think replacing them with a
non-character can work: which one would you chose (that cannot also
appear in the input)?
The worst case (wrt. cskipped) is the maximum number of characters that
can get combined into a single base character. It used to be (and I
hope still is) 20 (decomposition of U+FDFA).

> Passing Part3 tests and not crashing on crash.py is probably good
> enough for a commit, but I don't have a proof that length 20 skipped
> buffer is always enough.
I would agree with that. I still didn't have time to fully review the
patch, but assuming it fixes the cases in msg119995, we should proceed
with it.

On Fri, Dec 17, 2010 at 3:47 AM, Martin v. Löwis <report@bugs.python.org> wrote:
..
> The worst case (wrt. cskipped) is the maximum number of characters that
> can get combined into a single base character. It used to be (and I
> hope still is) 20 (decomposition of U+FDFA).
>
The C forms (NFC and NFKC) do canonical composition and U+FDFA is a
compatibility composite. (BTW, makeunicodedata.py checks that maximum
decomposed length of a character is < 19, but it would be better if it
would compute and define a named constant, say MAXDLENGTH, to be used
instead of literal 20.) As far as I (and a two-line script) can tell
the maximum length of a canonical decomposition of a character is 4.

> The C forms (NFC and NFKC) do canonical composition and U+FDFA is a
> compatibility composite. (BTW, makeunicodedata.py checks that maximum
> decomposed length of a character is < 19, but it would be better if it
> would compute and define a named constant, say MAXDLENGTH, to be used
> instead of literal 20.) As far as I (and a two-line script) can tell
> the maximum length of a canonical decomposition of a character is 4.
Even better - so allowing for 20 characters should be safe.

On Fri, Dec 17, 2010 at 2:08 PM, Martin v. Löwis <report@bugs.python.org> wrote:
..
>> As far as I (and a two-line script) can tell
>> the maximum length of a canonical decomposition of a character is 4.
>
> Even better - so allowing for 20 characters should be safe.
I don't disagree, but the number of "break" and "continue" statements
before cskipped++ makes me nervous. This said, I am going to add
test cases from the first post to test_unicodedata (I think it is a
better place than test_normalise because the latter is skipped by
default) and commit.
Improving the algorithm is a separate issue.

Attached patch, issue10254a.diff, adds the OP's cases to test_unicodedata and changes the code as I suggested in msg124173 because ISTM that comb >= comb1 matches the pr-29 definition:
"""
D2'. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same or higher combining class as C.
""" http://www.unicode.org/review/pr-29.html
Unfortunately, all tests pass with either comb >= comb1 or comb == comb1, so before I commit, I would like to figure out the test case that would properly exercise this code.

On Mon, Dec 20, 2010 at 2:50 PM, Alexander Belopolsky
<report@bugs.python.org> wrote:
..
> Unfortunately, all tests pass with either comb >= comb1 or comb == comb1, so before
> I commit, I would like to figure out the test case that would properly exercise this code.
>
After some more thought, I've realized that the comb > comb1 case is
impossible if comb1 != 0 (due to canonical reordering step) and if
comb1 == 0, the comb1 to comb comparison is not reached. In other
words, it does not matter whether comparison is done as Martin
suggested in msg120018 or as it is done in the latest patch. The fact
that comb > comb1 case is impossible if comb1 != 0 is actually
mentioned in PR 29 itself. See Table 1: Differences at
http://www.unicode.org/review/pr-29.html.

In the new patch, issue10254b.diff, I've added a test that would crash unpatched code:
>>> unicodedata.normalize('NFC', 'C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸Ç')
Segmentation fault
Martin, I still feel uneasy about the fixed size of the skipped buffer. It is not obvious that skipped combining characters always get removed from the buffer before the next starter is processed.
I would really like another pair of eyes to look at this code before it goes in especially to 2.6.
Victor,
IIRC, you did some stress testing on random data. I wonder if you could test this code after tightening the assert to cskipped < 4. (The current theory is that this should be enough.)