Excellent, thanks for the pointers.
Running the script below shows that
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines
in the script is the right one.

Advertising

What does this mean for us:
"Cf Format a format control character"

Maybe dig back through the Git logs to check the original logic, but the
comments suggest that "Cf" characters have been viewed as zero-width.
That makes sense - they're usually markers indicating things like
bidirectional text flow, so won't be taking space. (Although they may be
causing even more extreme layout effects...)

Soft-hyphen is noted as an explicit exception to the rule in the utf8.c
comments. As of Unicode 4.0, it's supposed to be a character indicating
a point where a hyphen could be placed if a line-wrap occurs, and if
that wrap happens, then it can actually take up 1 space, otherwise not.
So its width could be either 0 or 1, depending. Or, quite likely, the
terminal doesn't treat it specially, and it always just looks like a
hyphen... Thus we err on the safe side and give it width 1.

The comments suggest adding "-00AD +1160-11FF" to the uniset command
line for that tweak and for composing Hangul. (The +200B tweak isn't
necessary any more - Zero-Width Space U+200B became Cf officially in
Unicode 4.0.1:

All of this is only really an approximation - a best-effort attempt to
figure out the width of a string without any actual communication with
the display device. So it'll never be perfect. The choice between double
and single width in particular will often be unpredictable, unless you
had deeper locale knowledge.

Actually, while doing this, I've realised that this was originally
Markus Kuhn's implementation, and that is acknowledged at the top of the
file: