Bugs item #798211, was opened at 2003-08-31 10:55
Message generated for change (Comment added) made by jenglish
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=798211&group_id=12997
Category: 22. Style Engine
Group: 8.4.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Joe English (jenglish)
Assigned to: Frédéric BONNET (fbonnet)
Summary: Double-free in style engine
Initial Comment:
TkStylePkgFree() can end up being called while there
are still extant references to Tk_Style structures
cached in Tcl_Obj's. This causes a reference to
free()d memory, possibly followed by double-free later
on when the Tcl_Obj is deleted.
'valgrind' reports the following:
==10007== Invalid read of size 4
==10007== at 0x81024B6: Tk_FreeStyle
(/usr/local/src/tk/generic/tkStyle.c:1444)
==10007== by 0x810267A: FreeStyleObjProc
(/usr/local/src/tk/generic/tkStyle.c:1628)
==10007== by 0x8169C85: TclFreeObj
(../generic/tclObj.c:749)
==10007== by 0x80E4E65: Tk_DeleteOptionTable
(/usr/local/src/tk/generic/tkConfig.c:360)
==10007== Address 0x4133A01C is 0 bytes inside a
block of size 20 free'd
==10007== at 0x4003D129: free (vg_clientfuncs.c:180)
==10007== by 0x81A33F4: TclpFree
(../generic/tclAlloc.c:702)
==10007== by 0x8116EC6: Tcl_Free
(../generic/tclCkalloc.c:1160)
==10007== by 0x810156A: TkStylePkgFree
(/usr/local/src/tk/generic/tkStyle.c:272)
----------------------------------------------------------------------
>Comment By: Joe English (jenglish)
Date: 2003-08-31 11:14
Message:
Logged In: YES
user_id=68433
As a quick hack, removing the call to TkStylePkgFree() in
generic/tkWindow.c(Tk_DestroyWindow) avoids this problem
(but replaces it with a memory leak).
As a full solution, perhaps it would be better to _not_ keep
reference counts for Style objects, and only free them at
program shutdown. Style objects have a long lifetime: they
are given a refcount of 1 when created by Tk_CreateStyle(),
and this reference (in thread-local storage) isn't released
until TkStylePkgFree is called at shutdown time. So there's
no need to track intermediate reference counts -- the
refcount can never drop to zero as long as the Style engine
is still loaded.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=798211&group_id=12997

Feature Requests item #623787, was opened at 2002-10-15 15:05
Message generated for change (Comment added) made by davidw
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=360894&aid=623787&group_id=10894
Category: 34. tcltest Package
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: David N. Welton (davidw)
Assigned to: Don Porter (dgp)
Summary: programmatic access to summary data
Initial Comment:
If one is running *lots* of tests, it might be nice to
have an option to display the results in a different
way, like:
foo: pass
bar: fail
baz: pass
fcopy: pass
and so forth.
Maybe it would be better to give some sort of
programmatic interface to the summary information, such
as test name, result, desired result, test text, etc...
that way it would be possible to write ones own ways of
collecting the information.
----------------------------------------------------------------------
>Comment By: David N. Welton (davidw)
Date: 2003-08-30 07:12
Message:
Logged In: YES
user_id=240
Upon looking through the code, it seems as if tcltest really
was designed as an application rather than a library.
'puts', 'exit' and the like are used throughout.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-06-27 10:17
Message:
Logged In: YES
user_id=80530
Tcl Bugs 761334 and 761344
suggest other needs along these
lines. Currently summary information
about the test suite is printed by
[cleanupTests], but the data that
makes up that report is not available.
Additional interfaces to access the
internal values like $numTests(Failed)
would allow for combining results
from multiple interps (the issue in
those bug reports) and would allow
for reporting the same data in
alternate formats, as requested here.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2002-10-15 22:18
Message:
Logged In: YES
user_id=80530
seems like the core of a good idea.
If you can flesh it out a bit (what
interface are you looking for, etc.)
we can get a TIP going on it.
----------------------------------------------------------------------
Comment By: David N. Welton (davidw)
Date: 2002-10-15 15:19
Message:
Logged In: YES
user_id=240
I guess some of this is available, maybe another option to
verbose is enough to keep it from giving you more
information than just a plain 'fail'?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=360894&aid=623787&group_id=10894

Bugs item #411825, was opened at 2001-03-28 01:36
Message generated for change (Comment added) made by dossy
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894
Category: 10. Objects
Group: 8.4.4
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Adrian Robert (arobert3434)
Assigned to: Don Porter (dgp)
Summary: Passing list w/UTF-8 from C can fail
Initial Comment:
On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function. In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.
The following C function, when called from Tcl,
illustrates the
problem.
int sendCharList(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
{
char s1[5], s2[5], s3[5], s4[5];
strcpy(s1, "\345\220\240");
strcpy(s2, "\345\214\240");
strcpy(s3, "\351\235\240");
strcpy(s4, "\347\264\240");
Tcl_ResetResult(interp);
Tcl_AppendElement(interp, s1);
Tcl_AppendElement(interp, s2);
Tcl_AppendElement(interp, s3);
Tcl_AppendElement(interp, s4);
return TCL_OK;
}
The Tcl calls:
set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"
should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters). On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl. A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).
A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim
Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't. A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?
I'm also not sure whether it persists in 8.3.2 or 8.4.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-28 07:58
Message:
Logged In: YES
user_id=21885
Donal -- yes, I see your point and now I agree. The rule is
that list elements ending in a list delimiter get quoted, and
since \302\240 is now no longer considered a list delimiter, it
doesn't cause quoting to happen. Thanks.
Don -- I understand what's supposed to happen (at least, I
thought I did) but then, explain this:
% encoding system identity
% fconfigure stdout -encoding binary -translation binary
% TestCmd
foo bar
% string length [TestCmd]
8
% string bytelength [TestCmd]
9
I would have expected to get "foo\302\240 bar" and not
just "foo\240 bar". It's clear from string bytelength that the
\302 is in there, but when I set stdout encoding to binary, it
should give me the raw UTF-8 (9 bytes) and not the
transcoded ISO-8859-1 representation (8 bytes), right?
Or, am I misunderstanding what "-encoding binary" means and
what "encoding system identity" does? I mean, this actually
does what I expect:
% fconfigure stdout -encoding identity
% TestCmd
fooÂ bar
Now it output "foo\302\240 bar" -- why will it do that on "-
encoding identity" but not "-encoding binary"?
Perhaps we can take this discussion to the wiki or email since
it's not directly related to this particular bug -- let me know
what works best for you.
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2003-08-27 19:40
Message:
Logged In: YES
user_id=79902
Behaviour is correct. UTF-8 sequence \302\240 corresponds to ISO8859-1 character \240 (i.e.
non-breaking space.) Non-breaking space is (now, with DGP's patch) considered to not be a space
character and hence not in need of quoting.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 19:39
Message:
Logged In: YES
user_id=80530
Let me advise you try again tomorrow.
By then the anonymous CVS at SF
will have caught up to all my commits.
the output you describe sounds
correct to me. Before you file
another bug report, be sure you
understand that Tcl uses UTF-8
encoding internally and by default
converts to your system encoding
on output.
The two byte sequence \302\240
is the UTF-8 encoding for the character
known in Tcl-Unicode notation as \u00a0
which is the non-breaking space. When
you write that character to output on
a system with system encoding of
iso8859-1 it gets written as the single
byte \240 which is the same character
in that encoding. Likewise, if you were
to read in the byte \240 on the same
system, Tcl will convert it back to UTF-8
so by the time Tcl sees it again, it will
be the 2-byte sequence \302\240 .
When you work with an interactive tclsh,
the results you see have actually been
written to stdout, and are in the system
encoding.
If you don't completely follow what I just
said, do not file another bug report yet,
but let's find another channel to straighten
out any misunderstandings about how Tcl
encodings are supposed to work.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-27 19:12
Message:
Logged In: YES
user_id=21885
Your patch only included tests util-8.5 and util-8.6. I just
checked HEAD and core-8-4-branch and the util.test file
stops at util-8.1.
I'm showing the last checkin for tests/util.test as:
revision 1.11
date: 2003/07/24 16:05:24; author: dgp; state: Exp; lines:
+37 -4
I assume this means you didn't get to check your change in,
yet?
Either way, the C test case I provided on 2003-08-25 20:37
passes after applying the patch, kinda. [llength [TestCmd]]
== 2, but now look at what TestCmd outputs:
% encoding system
iso8859-1
% TestCmd
foo bar
Pushing that through "od -xc", here's the actual bytes that
get output:
666f 6fa0 2062 6172
f o o 240 b a r
Instead of \302\240 coming back out, only \240 came back.
At least this is a *different* problem to solve, now. At least
before it would return "foo\302\240bar" -- now, it's
returning "foo\240 bar" -- I'm not exactly sure which is
worse. :-)
However, the behavior I described on 2003-08-26 13:22
hasn't changed, list elements ending in \302\240 don't get
wrapped with {}. Suppose I should file this as a new bug,
now?
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-27 16:22
Message:
Logged In: YES
user_id=21885
Thank you so much, Don. We're going to apply the patch
and do our tests. I'll let you know how it goes!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 15:59
Message:
Logged In: YES
user_id=80530
Here's a copy of the patch I am
committing to HEAD and to
core-8-4-branch.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 13:58
Message:
Logged In: YES
user_id=80530
committed new tests to test suite
util-8.3 shows dossy's reported bug
util-8.4 shows another TclNeedSpace bug
Fix on the way.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 17:47
Message:
Logged In: YES
user_id=80530
sorry, but it's gonna be another day.
Just as I was testing the patch, a big
storm came through and knocked off
power. Power's back, but the disk
on which the patch is stored has
not come back online yet. Will
get back to this tomorrow.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 15:25
Message:
Logged In: YES
user_id=21885
Sounds good, Don. Thanks for the quick response on this!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 15:21
Message:
Logged In: YES
user_id=80530
Tell you what. Let me commit
a fix to the re-opened bug (should
be able to start work on that shortly;
should not take long). Then
after that fix is in, if you still find
something not meeting your
expectations, you can file a new
bug report on that. Thanks.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 13:22
Message:
Logged In: YES
user_id=21885
I don't know if this should be entered as a seperate bug, but
it's related to this problem (similar fix should address both):
% set a [list "abc "]
{abc }
This is correct -- since the list element ends in whitespace,
it's wrapped with {} for its string representation.
% encoding system
iso8859-1
% set a [list "abc\240\240"]
abc
Here, the string is "abc\240\240" but it's not being wrapped
by {}. But, if [string is space \240] is 1, shouldn't it be?
% encoding system utf-8
% set a [list [encoding convertfrom utf-8 "abc\302\240\302
\240"]]
abc
Here, [string is space [encoding convertfrom utf-8 \302\240]]
= 1. Again, the list element isn't being wrapped with {} --
why?
Parts of Tcl treat \240 or \302\240 as a space (and thus
don't insert a list delimiter character) but others don't treat it
as a space, so stringify'ing list elements don't get {} wrapped
around them.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2003-08-26 06:12
Message:
Logged In: YES
user_id=79902
Sorry. I've no time to chase this today. :^/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 23:30
Message:
Logged In: YES
user_id=80530
I see the ChangeLog comment from dkf's patch:
* generic/tclUtil.c (TclNeedSpace): Rewrote to be
UTF-8 aware.
[Bug 411825, but not that patch which would have
added extra
spaces if there was a real non-ASCII space involved. ]
Trouble here is that Tcl_UniCharIsSpace() is the
wrong test. It is not equivalent to
Tcl_UniCharIsAListElementTerminator()
which is what we really need to test. In particular,
the "non-breaking space" \u00A0 returns true
from Tcl_UniCharIsSpace(), but is not recognized
by the list parser in [llength] as a separator of
list elements.
Looks like the prior fix did correct lots of errors.
Prior to the fix, every UTF-8 sequence ending
in the byte \xA0 (or \240) caused trouble with
TclNeedSpace(). After the fix, only the UTF-8
sequence \xC2\xA0 is a problem.
Here's an interactive sequence in plain Tcl
(no C coding required) that demos the remaining
bug:
% interp create \u00a0
&#65533;
% interp create [list \u00a0 foo]
&#65533; foo
% interp alias {} fooset [list \u00a0 foo] set
fooset
% interp target {} fooset
&#65533;foo
% # Just to be really clear...
% llength [interp target {} fooset]
1
Assigning to dkf. If he doesn't
have it fixed by the time I get
to work tomorrow, we'll get it
done then.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 23:12
Message:
Logged In: YES
user_id=80530
Thank you! A good example clarifies a lot.
Certainly looks like dkf's patch failed
to fix things, doesn't it?
Not clear to me why the patch attached
to this report wasn't accepted instead.
Re-opening.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-25 20:37
Message:
Logged In: YES
user_id=21885
====8<==== utfNbspTest.c ====8<====
/*
* utfNbspTest.c
*
* Appending an element to a previous element that ends
with the
* sequence 0xC2A0 (or \302\240), the UTF code for NO-
BREAK SPACE,
* results in an incorrect list.
*
* $ gcc -o utfNbspTest utfNbspTest.c -L/path/to/libtcl8.4.* -
ltcl8.4
*
*/
#include <tcl.h>
int
TestCmd(ClientData clientData, Tcl_Interp *interp, int argc,
char **argv)
{
Tcl_AppendElement(interp, "foo\302\240");
Tcl_AppendElement(interp, "bar");
return TCL_OK;
}
int
My_AppInit(Tcl_Interp *interp)
{
Tcl_CreateCommand(interp, "TestCmd", (Tcl_CmdProc *)
TestCmd, NULL, NULL);
return TCL_OK;
}
int
main(int argc, char **argv)
{
Tcl_Main(argc, argv, My_AppInit);
Tcl_Exit(0);
/* NOTREACHED */
return 0;
}
====8<==== utfNbspTest.c ====8<====
Here's the transcript showing the error:
$ ./utfNbspTest
% set tcl_patchLevel
8.4.4
% encoding system utf-8
% encoding system
utf-8
% set x [TestCmd]
fooÂ bar
% llength $x
1
% string length $x
7
% string bytelength $x
8
% exit
Of course, yes I know:
1) I should Tcl_Obj'ify everything.
2) Tcl_AppendElement is deprecated (supposedly!)
However, I'm dealing with a good amount of legacy code that
will eventually get changed/modernized, but for now, it needs
to work. If Tcl >8.1 isn't backward compatible, that's fine.
But, to call Tcl_AppendElement "deprecated" when it isn't
backward compatible, well ... that's just wrong.
-- Dossy
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 18:06
Message:
Logged In: YES
user_id=80530
Provide the C code that calls Tcl_AppendElement()
and that gives results that are incorrect in either
Tcl 8.4.4 or the HEAD.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-25 17:42
Message:
Logged In: YES
user_id=21885
I'd really hate to pick at an old scab (this bug was closed
back in 09/2001) but exactly what was "fixed" by dkf's
commit?
Against Tcl 8.4.4, using Tcl_AppendElement() which I know is
deprecated, the problem is still occurring. I guess it has to
do with this behavior:
$ string is space [encoding convertfrom utf-8 \302\240]
1
What's annoying is if you do:
> set a foo\302\240
fooÃÂ
> set a [encoding convertfrom utf-8 foo\302\240]
fooÂ
> lappend a bar
fooÂ bar
> llength $a
2
> string bytelength $a
9
That does the right thing. But if you Tcl_AppendElement(),
you'll get "foo\302\240bar", which is bad.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-19 04:53
Message:
Logged In: YES
user_id=79902
Test and fix committed (SF seems to be working at the mo...)
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 17:23
Message:
Logged In: YES
user_id=80530
Assigning to dkf, since he can't log in and assign it
himself.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 13:10
Message:
Logged In: YES
user_id=80530
The bug is in TclNeedSpace(), in generic/tclUtil.c,
part of the Objects Category.
Is there a reason not to accept the patch already
attached to this report? Will it break
TclNeedSpace for its existing callers?
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 12:47
Message:
Logged In: YES
user_id=80530
Here's a sequence of Tcl commands broken by this bug.
% interp create \u5420
?
% interp create [list \u5420 foo]
? foo
% interp alias {} fooset [list \u5420 foo] set
fooset
% interp target {} fooset
?foo
Re-opening the bug.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 12:17
Message:
Logged In: YES
user_id=80530
Self-explanatory and revealing. I think you're missing
the point, Jeff. Adrian's [sendCharList] command
is trying to return the result
[list \u5420 \u5320 \u9760 \u7d20]
but it's failing because Tcl_AppendElement is
mangling his UTF-8 characters that he has
encoded "by hand".
If I can manage it, I'll post a Tcl script that
demos the bug. I think such a script is possible.
Tcl_AppendElement calls haven't been entirely banished
from the Tcl source code.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-18 11:03
Message:
Logged In: YES
user_id=72656
This should be self-explanatory:
(hobbs) 50 % set var \345\220\240
å
(hobbs) 51 % string length $var
3
(hobbs) 52 % string bytelength $var
6
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-18 10:34
Message:
Logged In: YES
user_id=79902
Jeff just happens to be wrong. :^)
The example code contains valid UTF-8 strings. The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.
Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it. :^(
The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177. Plus
isspace is not usually Unicode-aware...
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 01:46
Message:
Logged In: YES
user_id=80530
Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings? Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-17 20:06
Message:
Logged In: YES
user_id=72656
Ah, but you are making a fatal flaw in your argument - you
are *not* passing UTF-8 strings - you are passing
incorrectly formed strings through Tcl. If you converted
these to UTF-8 first (with Tcl_ExternalToUtf), this would
not have happened. That isn't to say this still doesn't
need fixing - but it is one of those areas in the core
where the distinction between using utf-8 and raw data
became important.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-09-17 19:58
Message:
Logged In: YES
user_id=146959
This is NOT a solution. If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time. The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl. Please see the
comments submitted earlier for this bug for additional
clarification. Thank you.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-05-03 17:08
Message:
Logged In: YES
user_id=72656
The basic answer at this point is that if you want space
chars to be thought of as space chars in Tcl, you should
restrict yourself to the ascii 7-bit set, of which \240
isn't part. It works on some systems, where the locale
isspace('\240') is 1, but that's not reliable.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-01 21:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-01 21:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: miguel sofer (msofer)
Date: 2001-03-29 19:07
Message:
Logged In: YES
user_id=148712
This bug is related to bugs #408568 and #227512.
See TIP #20 at
http://www.cs.man.ac.uk/fellowsd-bin/TIP/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-29 03:55
Message:
Logged In: YES
user_id=80530
I was talking about the man page for Tcl_AppendElement():
http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm
Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly. Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all. This bug is re-opened.
Looking at TclNeedSpace() explains the mysterious platform
dependence. The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true.
I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.
Meanwhile you can use the workaround I posted in the first
comment.
Tcl_Merge() is safe for UTF-8 strings.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 01:50
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:34
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:09
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-28 17:31
Message:
Logged In: YES
user_id=80530
TclNeedSpace() is not UTF-8 aware. That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)
Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894

Bugs item #411825, was opened at 2001-03-28 07:36
Message generated for change (Comment added) made by dkf
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894
Category: 10. Objects
Group: 8.4.4
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Adrian Robert (arobert3434)
Assigned to: Don Porter (dgp)
Summary: Passing list w/UTF-8 from C can fail
Initial Comment:
On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function. In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.
The following C function, when called from Tcl,
illustrates the
problem.
int sendCharList(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
{
char s1[5], s2[5], s3[5], s4[5];
strcpy(s1, "\345\220\240");
strcpy(s2, "\345\214\240");
strcpy(s3, "\351\235\240");
strcpy(s4, "\347\264\240");
Tcl_ResetResult(interp);
Tcl_AppendElement(interp, s1);
Tcl_AppendElement(interp, s2);
Tcl_AppendElement(interp, s3);
Tcl_AppendElement(interp, s4);
return TCL_OK;
}
The Tcl calls:
set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"
should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters). On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl. A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).
A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim
Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't. A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?
I'm also not sure whether it persists in 8.3.2 or 8.4.
----------------------------------------------------------------------
>Comment By: Donal K. Fellows (dkf)
Date: 2003-08-28 00:40
Message:
Logged In: YES
user_id=79902
Behaviour is correct. UTF-8 sequence \302\240 corresponds to ISO8859-1 character \240 (i.e.
non-breaking space.) Non-breaking space is (now, with DGP's patch) considered to not be a space
character and hence not in need of quoting.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-28 00:39
Message:
Logged In: YES
user_id=80530
Let me advise you try again tomorrow.
By then the anonymous CVS at SF
will have caught up to all my commits.
the output you describe sounds
correct to me. Before you file
another bug report, be sure you
understand that Tcl uses UTF-8
encoding internally and by default
converts to your system encoding
on output.
The two byte sequence \302\240
is the UTF-8 encoding for the character
known in Tcl-Unicode notation as \u00a0
which is the non-breaking space. When
you write that character to output on
a system with system encoding of
iso8859-1 it gets written as the single
byte \240 which is the same character
in that encoding. Likewise, if you were
to read in the byte \240 on the same
system, Tcl will convert it back to UTF-8
so by the time Tcl sees it again, it will
be the 2-byte sequence \302\240 .
When you work with an interactive tclsh,
the results you see have actually been
written to stdout, and are in the system
encoding.
If you don't completely follow what I just
said, do not file another bug report yet,
but let's find another channel to straighten
out any misunderstandings about how Tcl
encodings are supposed to work.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-28 00:12
Message:
Logged In: YES
user_id=21885
Your patch only included tests util-8.5 and util-8.6. I just
checked HEAD and core-8-4-branch and the util.test file
stops at util-8.1.
I'm showing the last checkin for tests/util.test as:
revision 1.11
date: 2003/07/24 16:05:24; author: dgp; state: Exp; lines:
+37 -4
I assume this means you didn't get to check your change in,
yet?
Either way, the C test case I provided on 2003-08-25 20:37
passes after applying the patch, kinda. [llength [TestCmd]]
== 2, but now look at what TestCmd outputs:
% encoding system
iso8859-1
% TestCmd
foo bar
Pushing that through "od -xc", here's the actual bytes that
get output:
666f 6fa0 2062 6172
f o o 240 b a r
Instead of \302\240 coming back out, only \240 came back.
At least this is a *different* problem to solve, now. At least
before it would return "foo\302\240bar" -- now, it's
returning "foo\240 bar" -- I'm not exactly sure which is
worse. :-)
However, the behavior I described on 2003-08-26 13:22
hasn't changed, list elements ending in \302\240 don't get
wrapped with {}. Suppose I should file this as a new bug,
now?
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-27 21:22
Message:
Logged In: YES
user_id=21885
Thank you so much, Don. We're going to apply the patch
and do our tests. I'll let you know how it goes!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 20:59
Message:
Logged In: YES
user_id=80530
Here's a copy of the patch I am
committing to HEAD and to
core-8-4-branch.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 18:58
Message:
Logged In: YES
user_id=80530
committed new tests to test suite
util-8.3 shows dossy's reported bug
util-8.4 shows another TclNeedSpace bug
Fix on the way.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 22:47
Message:
Logged In: YES
user_id=80530
sorry, but it's gonna be another day.
Just as I was testing the patch, a big
storm came through and knocked off
power. Power's back, but the disk
on which the patch is stored has
not come back online yet. Will
get back to this tomorrow.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 20:25
Message:
Logged In: YES
user_id=21885
Sounds good, Don. Thanks for the quick response on this!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 20:21
Message:
Logged In: YES
user_id=80530
Tell you what. Let me commit
a fix to the re-opened bug (should
be able to start work on that shortly;
should not take long). Then
after that fix is in, if you still find
something not meeting your
expectations, you can file a new
bug report on that. Thanks.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 18:22
Message:
Logged In: YES
user_id=21885
I don't know if this should be entered as a seperate bug, but
it's related to this problem (similar fix should address both):
% set a [list "abc "]
{abc }
This is correct -- since the list element ends in whitespace,
it's wrapped with {} for its string representation.
% encoding system
iso8859-1
% set a [list "abc\240\240"]
abc
Here, the string is "abc\240\240" but it's not being wrapped
by {}. But, if [string is space \240] is 1, shouldn't it be?
% encoding system utf-8
% set a [list [encoding convertfrom utf-8 "abc\302\240\302
\240"]]
abc
Here, [string is space [encoding convertfrom utf-8 \302\240]]
= 1. Again, the list element isn't being wrapped with {} --
why?
Parts of Tcl treat \240 or \302\240 as a space (and thus
don't insert a list delimiter character) but others don't treat it
as a space, so stringify'ing list elements don't get {} wrapped
around them.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2003-08-26 11:12
Message:
Logged In: YES
user_id=79902
Sorry. I've no time to chase this today. :^/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 04:30
Message:
Logged In: YES
user_id=80530
I see the ChangeLog comment from dkf's patch:
* generic/tclUtil.c (TclNeedSpace): Rewrote to be
UTF-8 aware.
[Bug 411825, but not that patch which would have
added extra
spaces if there was a real non-ASCII space involved. ]
Trouble here is that Tcl_UniCharIsSpace() is the
wrong test. It is not equivalent to
Tcl_UniCharIsAListElementTerminator()
which is what we really need to test. In particular,
the "non-breaking space" \u00A0 returns true
from Tcl_UniCharIsSpace(), but is not recognized
by the list parser in [llength] as a separator of
list elements.
Looks like the prior fix did correct lots of errors.
Prior to the fix, every UTF-8 sequence ending
in the byte \xA0 (or \240) caused trouble with
TclNeedSpace(). After the fix, only the UTF-8
sequence \xC2\xA0 is a problem.
Here's an interactive sequence in plain Tcl
(no C coding required) that demos the remaining
bug:
% interp create \u00a0
&#65533;
% interp create [list \u00a0 foo]
&#65533; foo
% interp alias {} fooset [list \u00a0 foo] set
fooset
% interp target {} fooset
&#65533;foo
% # Just to be really clear...
% llength [interp target {} fooset]
1
Assigning to dkf. If he doesn't
have it fixed by the time I get
to work tomorrow, we'll get it
done then.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 04:12
Message:
Logged In: YES
user_id=80530
Thank you! A good example clarifies a lot.
Certainly looks like dkf's patch failed
to fix things, doesn't it?
Not clear to me why the patch attached
to this report wasn't accepted instead.
Re-opening.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 01:37
Message:
Logged In: YES
user_id=21885
====8<==== utfNbspTest.c ====8<====
/*
* utfNbspTest.c
*
* Appending an element to a previous element that ends
with the
* sequence 0xC2A0 (or \302\240), the UTF code for NO-
BREAK SPACE,
* results in an incorrect list.
*
* $ gcc -o utfNbspTest utfNbspTest.c -L/path/to/libtcl8.4.* -
ltcl8.4
*
*/
#include <tcl.h>
int
TestCmd(ClientData clientData, Tcl_Interp *interp, int argc,
char **argv)
{
Tcl_AppendElement(interp, "foo\302\240");
Tcl_AppendElement(interp, "bar");
return TCL_OK;
}
int
My_AppInit(Tcl_Interp *interp)
{
Tcl_CreateCommand(interp, "TestCmd", (Tcl_CmdProc *)
TestCmd, NULL, NULL);
return TCL_OK;
}
int
main(int argc, char **argv)
{
Tcl_Main(argc, argv, My_AppInit);
Tcl_Exit(0);
/* NOTREACHED */
return 0;
}
====8<==== utfNbspTest.c ====8<====
Here's the transcript showing the error:
$ ./utfNbspTest
% set tcl_patchLevel
8.4.4
% encoding system utf-8
% encoding system
utf-8
% set x [TestCmd]
fooÂ bar
% llength $x
1
% string length $x
7
% string bytelength $x
8
% exit
Of course, yes I know:
1) I should Tcl_Obj'ify everything.
2) Tcl_AppendElement is deprecated (supposedly!)
However, I'm dealing with a good amount of legacy code that
will eventually get changed/modernized, but for now, it needs
to work. If Tcl >8.1 isn't backward compatible, that's fine.
But, to call Tcl_AppendElement "deprecated" when it isn't
backward compatible, well ... that's just wrong.
-- Dossy
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 23:06
Message:
Logged In: YES
user_id=80530
Provide the C code that calls Tcl_AppendElement()
and that gives results that are incorrect in either
Tcl 8.4.4 or the HEAD.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-25 22:42
Message:
Logged In: YES
user_id=21885
I'd really hate to pick at an old scab (this bug was closed
back in 09/2001) but exactly what was "fixed" by dkf's
commit?
Against Tcl 8.4.4, using Tcl_AppendElement() which I know is
deprecated, the problem is still occurring. I guess it has to
do with this behavior:
$ string is space [encoding convertfrom utf-8 \302\240]
1
What's annoying is if you do:
> set a foo\302\240
fooÃÂ
> set a [encoding convertfrom utf-8 foo\302\240]
fooÂ
> lappend a bar
fooÂ bar
> llength $a
2
> string bytelength $a
9
That does the right thing. But if you Tcl_AppendElement(),
you'll get "foo\302\240bar", which is bad.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-19 09:53
Message:
Logged In: YES
user_id=79902
Test and fix committed (SF seems to be working at the mo...)
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 22:23
Message:
Logged In: YES
user_id=80530
Assigning to dkf, since he can't log in and assign it
himself.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 18:10
Message:
Logged In: YES
user_id=80530
The bug is in TclNeedSpace(), in generic/tclUtil.c,
part of the Objects Category.
Is there a reason not to accept the patch already
attached to this report? Will it break
TclNeedSpace for its existing callers?
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 17:47
Message:
Logged In: YES
user_id=80530
Here's a sequence of Tcl commands broken by this bug.
% interp create \u5420
?
% interp create [list \u5420 foo]
? foo
% interp alias {} fooset [list \u5420 foo] set
fooset
% interp target {} fooset
?foo
Re-opening the bug.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 17:17
Message:
Logged In: YES
user_id=80530
Self-explanatory and revealing. I think you're missing
the point, Jeff. Adrian's [sendCharList] command
is trying to return the result
[list \u5420 \u5320 \u9760 \u7d20]
but it's failing because Tcl_AppendElement is
mangling his UTF-8 characters that he has
encoded "by hand".
If I can manage it, I'll post a Tcl script that
demos the bug. I think such a script is possible.
Tcl_AppendElement calls haven't been entirely banished
from the Tcl source code.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-18 16:03
Message:
Logged In: YES
user_id=72656
This should be self-explanatory:
(hobbs) 50 % set var \345\220\240
å
(hobbs) 51 % string length $var
3
(hobbs) 52 % string bytelength $var
6
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-18 15:34
Message:
Logged In: YES
user_id=79902
Jeff just happens to be wrong. :^)
The example code contains valid UTF-8 strings. The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.
Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it. :^(
The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177. Plus
isspace is not usually Unicode-aware...
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 06:46
Message:
Logged In: YES
user_id=80530
Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings? Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-18 01:06
Message:
Logged In: YES
user_id=72656
Ah, but you are making a fatal flaw in your argument - you
are *not* passing UTF-8 strings - you are passing
incorrectly formed strings through Tcl. If you converted
these to UTF-8 first (with Tcl_ExternalToUtf), this would
not have happened. That isn't to say this still doesn't
need fixing - but it is one of those areas in the core
where the distinction between using utf-8 and raw data
became important.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-09-18 00:58
Message:
Logged In: YES
user_id=146959
This is NOT a solution. If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time. The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl. Please see the
comments submitted earlier for this bug for additional
clarification. Thank you.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-05-03 22:08
Message:
Logged In: YES
user_id=72656
The basic answer at this point is that if you want space
chars to be thought of as space chars in Tcl, you should
restrict yourself to the ascii 7-bit set, of which \240
isn't part. It works on some systems, where the locale
isspace('\240') is 1, but that's not reliable.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-02 02:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-02 02:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: miguel sofer (msofer)
Date: 2001-03-30 01:07
Message:
Logged In: YES
user_id=148712
This bug is related to bugs #408568 and #227512.
See TIP #20 at
http://www.cs.man.ac.uk/fellowsd-bin/TIP/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-29 09:55
Message:
Logged In: YES
user_id=80530
I was talking about the man page for Tcl_AppendElement():
http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm
Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly. Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all. This bug is re-opened.
Looking at TclNeedSpace() explains the mysterious platform
dependence. The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true.
I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.
Meanwhile you can use the workaround I posted in the first
comment.
Tcl_Merge() is safe for UTF-8 strings.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 07:50
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 06:34
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 06:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 06:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 06:09
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-28 23:31
Message:
Logged In: YES
user_id=80530
TclNeedSpace() is not UTF-8 aware. That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)
Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894

Bugs item #411825, was opened at 2001-03-28 01:36
Message generated for change (Comment added) made by dgp
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894
Category: 10. Objects
Group: 8.4.4
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Adrian Robert (arobert3434)
Assigned to: Don Porter (dgp)
Summary: Passing list w/UTF-8 from C can fail
Initial Comment:
On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function. In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.
The following C function, when called from Tcl,
illustrates the
problem.
int sendCharList(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
{
char s1[5], s2[5], s3[5], s4[5];
strcpy(s1, "\345\220\240");
strcpy(s2, "\345\214\240");
strcpy(s3, "\351\235\240");
strcpy(s4, "\347\264\240");
Tcl_ResetResult(interp);
Tcl_AppendElement(interp, s1);
Tcl_AppendElement(interp, s2);
Tcl_AppendElement(interp, s3);
Tcl_AppendElement(interp, s4);
return TCL_OK;
}
The Tcl calls:
set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"
should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters). On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl. A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).
A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim
Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't. A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?
I'm also not sure whether it persists in 8.3.2 or 8.4.
----------------------------------------------------------------------
>Comment By: Don Porter (dgp)
Date: 2003-08-27 19:39
Message:
Logged In: YES
user_id=80530
Let me advise you try again tomorrow.
By then the anonymous CVS at SF
will have caught up to all my commits.
the output you describe sounds
correct to me. Before you file
another bug report, be sure you
understand that Tcl uses UTF-8
encoding internally and by default
converts to your system encoding
on output.
The two byte sequence \302\240
is the UTF-8 encoding for the character
known in Tcl-Unicode notation as \u00a0
which is the non-breaking space. When
you write that character to output on
a system with system encoding of
iso8859-1 it gets written as the single
byte \240 which is the same character
in that encoding. Likewise, if you were
to read in the byte \240 on the same
system, Tcl will convert it back to UTF-8
so by the time Tcl sees it again, it will
be the 2-byte sequence \302\240 .
When you work with an interactive tclsh,
the results you see have actually been
written to stdout, and are in the system
encoding.
If you don't completely follow what I just
said, do not file another bug report yet,
but let's find another channel to straighten
out any misunderstandings about how Tcl
encodings are supposed to work.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-27 19:12
Message:
Logged In: YES
user_id=21885
Your patch only included tests util-8.5 and util-8.6. I just
checked HEAD and core-8-4-branch and the util.test file
stops at util-8.1.
I'm showing the last checkin for tests/util.test as:
revision 1.11
date: 2003/07/24 16:05:24; author: dgp; state: Exp; lines:
+37 -4
I assume this means you didn't get to check your change in,
yet?
Either way, the C test case I provided on 2003-08-25 20:37
passes after applying the patch, kinda. [llength [TestCmd]]
== 2, but now look at what TestCmd outputs:
% encoding system
iso8859-1
% TestCmd
foo bar
Pushing that through "od -xc", here's the actual bytes that
get output:
666f 6fa0 2062 6172
f o o 240 b a r
Instead of \302\240 coming back out, only \240 came back.
At least this is a *different* problem to solve, now. At least
before it would return "foo\302\240bar" -- now, it's
returning "foo\240 bar" -- I'm not exactly sure which is
worse. :-)
However, the behavior I described on 2003-08-26 13:22
hasn't changed, list elements ending in \302\240 don't get
wrapped with {}. Suppose I should file this as a new bug,
now?
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-27 16:22
Message:
Logged In: YES
user_id=21885
Thank you so much, Don. We're going to apply the patch
and do our tests. I'll let you know how it goes!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 15:59
Message:
Logged In: YES
user_id=80530
Here's a copy of the patch I am
committing to HEAD and to
core-8-4-branch.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-27 13:58
Message:
Logged In: YES
user_id=80530
committed new tests to test suite
util-8.3 shows dossy's reported bug
util-8.4 shows another TclNeedSpace bug
Fix on the way.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 17:47
Message:
Logged In: YES
user_id=80530
sorry, but it's gonna be another day.
Just as I was testing the patch, a big
storm came through and knocked off
power. Power's back, but the disk
on which the patch is stored has
not come back online yet. Will
get back to this tomorrow.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 15:25
Message:
Logged In: YES
user_id=21885
Sounds good, Don. Thanks for the quick response on this!
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-26 15:21
Message:
Logged In: YES
user_id=80530
Tell you what. Let me commit
a fix to the re-opened bug (should
be able to start work on that shortly;
should not take long). Then
after that fix is in, if you still find
something not meeting your
expectations, you can file a new
bug report on that. Thanks.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-26 13:22
Message:
Logged In: YES
user_id=21885
I don't know if this should be entered as a seperate bug, but
it's related to this problem (similar fix should address both):
% set a [list "abc "]
{abc }
This is correct -- since the list element ends in whitespace,
it's wrapped with {} for its string representation.
% encoding system
iso8859-1
% set a [list "abc\240\240"]
abc
Here, the string is "abc\240\240" but it's not being wrapped
by {}. But, if [string is space \240] is 1, shouldn't it be?
% encoding system utf-8
% set a [list [encoding convertfrom utf-8 "abc\302\240\302
\240"]]
abc
Here, [string is space [encoding convertfrom utf-8 \302\240]]
= 1. Again, the list element isn't being wrapped with {} --
why?
Parts of Tcl treat \240 or \302\240 as a space (and thus
don't insert a list delimiter character) but others don't treat it
as a space, so stringify'ing list elements don't get {} wrapped
around them.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2003-08-26 06:12
Message:
Logged In: YES
user_id=79902
Sorry. I've no time to chase this today. :^/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 23:30
Message:
Logged In: YES
user_id=80530
I see the ChangeLog comment from dkf's patch:
* generic/tclUtil.c (TclNeedSpace): Rewrote to be
UTF-8 aware.
[Bug 411825, but not that patch which would have
added extra
spaces if there was a real non-ASCII space involved. ]
Trouble here is that Tcl_UniCharIsSpace() is the
wrong test. It is not equivalent to
Tcl_UniCharIsAListElementTerminator()
which is what we really need to test. In particular,
the "non-breaking space" \u00A0 returns true
from Tcl_UniCharIsSpace(), but is not recognized
by the list parser in [llength] as a separator of
list elements.
Looks like the prior fix did correct lots of errors.
Prior to the fix, every UTF-8 sequence ending
in the byte \xA0 (or \240) caused trouble with
TclNeedSpace(). After the fix, only the UTF-8
sequence \xC2\xA0 is a problem.
Here's an interactive sequence in plain Tcl
(no C coding required) that demos the remaining
bug:
% interp create \u00a0
&#65533;
% interp create [list \u00a0 foo]
&#65533; foo
% interp alias {} fooset [list \u00a0 foo] set
fooset
% interp target {} fooset
&#65533;foo
% # Just to be really clear...
% llength [interp target {} fooset]
1
Assigning to dkf. If he doesn't
have it fixed by the time I get
to work tomorrow, we'll get it
done then.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 23:12
Message:
Logged In: YES
user_id=80530
Thank you! A good example clarifies a lot.
Certainly looks like dkf's patch failed
to fix things, doesn't it?
Not clear to me why the patch attached
to this report wasn't accepted instead.
Re-opening.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-25 20:37
Message:
Logged In: YES
user_id=21885
====8<==== utfNbspTest.c ====8<====
/*
* utfNbspTest.c
*
* Appending an element to a previous element that ends
with the
* sequence 0xC2A0 (or \302\240), the UTF code for NO-
BREAK SPACE,
* results in an incorrect list.
*
* $ gcc -o utfNbspTest utfNbspTest.c -L/path/to/libtcl8.4.* -
ltcl8.4
*
*/
#include <tcl.h>
int
TestCmd(ClientData clientData, Tcl_Interp *interp, int argc,
char **argv)
{
Tcl_AppendElement(interp, "foo\302\240");
Tcl_AppendElement(interp, "bar");
return TCL_OK;
}
int
My_AppInit(Tcl_Interp *interp)
{
Tcl_CreateCommand(interp, "TestCmd", (Tcl_CmdProc *)
TestCmd, NULL, NULL);
return TCL_OK;
}
int
main(int argc, char **argv)
{
Tcl_Main(argc, argv, My_AppInit);
Tcl_Exit(0);
/* NOTREACHED */
return 0;
}
====8<==== utfNbspTest.c ====8<====
Here's the transcript showing the error:
$ ./utfNbspTest
% set tcl_patchLevel
8.4.4
% encoding system utf-8
% encoding system
utf-8
% set x [TestCmd]
fooÂ bar
% llength $x
1
% string length $x
7
% string bytelength $x
8
% exit
Of course, yes I know:
1) I should Tcl_Obj'ify everything.
2) Tcl_AppendElement is deprecated (supposedly!)
However, I'm dealing with a good amount of legacy code that
will eventually get changed/modernized, but for now, it needs
to work. If Tcl >8.1 isn't backward compatible, that's fine.
But, to call Tcl_AppendElement "deprecated" when it isn't
backward compatible, well ... that's just wrong.
-- Dossy
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2003-08-25 18:06
Message:
Logged In: YES
user_id=80530
Provide the C code that calls Tcl_AppendElement()
and that gives results that are incorrect in either
Tcl 8.4.4 or the HEAD.
----------------------------------------------------------------------
Comment By: Dossy Shiobara (dossy)
Date: 2003-08-25 17:42
Message:
Logged In: YES
user_id=21885
I'd really hate to pick at an old scab (this bug was closed
back in 09/2001) but exactly what was "fixed" by dkf's
commit?
Against Tcl 8.4.4, using Tcl_AppendElement() which I know is
deprecated, the problem is still occurring. I guess it has to
do with this behavior:
$ string is space [encoding convertfrom utf-8 \302\240]
1
What's annoying is if you do:
> set a foo\302\240
fooÃÂ
> set a [encoding convertfrom utf-8 foo\302\240]
fooÂ
> lappend a bar
fooÂ bar
> llength $a
2
> string bytelength $a
9
That does the right thing. But if you Tcl_AppendElement(),
you'll get "foo\302\240bar", which is bad.
-- Dossy
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-19 04:53
Message:
Logged In: YES
user_id=79902
Test and fix committed (SF seems to be working at the mo...)
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 17:23
Message:
Logged In: YES
user_id=80530
Assigning to dkf, since he can't log in and assign it
himself.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 13:10
Message:
Logged In: YES
user_id=80530
The bug is in TclNeedSpace(), in generic/tclUtil.c,
part of the Objects Category.
Is there a reason not to accept the patch already
attached to this report? Will it break
TclNeedSpace for its existing callers?
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 12:47
Message:
Logged In: YES
user_id=80530
Here's a sequence of Tcl commands broken by this bug.
% interp create \u5420
?
% interp create [list \u5420 foo]
? foo
% interp alias {} fooset [list \u5420 foo] set
fooset
% interp target {} fooset
?foo
Re-opening the bug.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 12:17
Message:
Logged In: YES
user_id=80530
Self-explanatory and revealing. I think you're missing
the point, Jeff. Adrian's [sendCharList] command
is trying to return the result
[list \u5420 \u5320 \u9760 \u7d20]
but it's failing because Tcl_AppendElement is
mangling his UTF-8 characters that he has
encoded "by hand".
If I can manage it, I'll post a Tcl script that
demos the bug. I think such a script is possible.
Tcl_AppendElement calls haven't been entirely banished
from the Tcl source code.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-18 11:03
Message:
Logged In: YES
user_id=72656
This should be self-explanatory:
(hobbs) 50 % set var \345\220\240
å
(hobbs) 51 % string length $var
3
(hobbs) 52 % string bytelength $var
6
----------------------------------------------------------------------
Comment By: Donal K. Fellows (dkf)
Date: 2001-09-18 10:34
Message:
Logged In: YES
user_id=79902
Jeff just happens to be wrong. :^)
The example code contains valid UTF-8 strings. The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.
Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it. :^(
The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177. Plus
isspace is not usually Unicode-aware...
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-09-18 01:46
Message:
Logged In: YES
user_id=80530
Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings? Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-09-17 20:06
Message:
Logged In: YES
user_id=72656
Ah, but you are making a fatal flaw in your argument - you
are *not* passing UTF-8 strings - you are passing
incorrectly formed strings through Tcl. If you converted
these to UTF-8 first (with Tcl_ExternalToUtf), this would
not have happened. That isn't to say this still doesn't
need fixing - but it is one of those areas in the core
where the distinction between using utf-8 and raw data
became important.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-09-17 19:58
Message:
Logged In: YES
user_id=146959
This is NOT a solution. If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time. The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl. Please see the
comments submitted earlier for this bug for additional
clarification. Thank you.
----------------------------------------------------------------------
Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-05-03 17:08
Message:
Logged In: YES
user_id=72656
The basic answer at this point is that if you want space
chars to be thought of as space chars in Tcl, you should
restrict yourself to the ascii 7-bit set, of which \240
isn't part. It works on some systems, where the locale
isspace('\240') is 1, but that's not reliable.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-01 21:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-04-01 21:27
Message:
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
----------------------------------------------------------------------
Comment By: miguel sofer (msofer)
Date: 2001-03-29 19:07
Message:
Logged In: YES
user_id=148712
This bug is related to bugs #408568 and #227512.
See TIP #20 at
http://www.cs.man.ac.uk/fellowsd-bin/TIP/
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-29 03:55
Message:
Logged In: YES
user_id=80530
I was talking about the man page for Tcl_AppendElement():
http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm
Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly. Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all. This bug is re-opened.
Looking at TclNeedSpace() explains the mysterious platform
dependence. The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true.
I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.
Meanwhile you can use the workaround I posted in the first
comment.
Tcl_Merge() is safe for UTF-8 strings.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 01:50
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:34
Message:
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:11
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Adrian Robert (arobert3434)
Date: 2001-03-29 00:09
Message:
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
----------------------------------------------------------------------
Comment By: Don Porter (dgp)
Date: 2001-03-28 17:31
Message:
Logged In: YES
user_id=80530
TclNeedSpace() is not UTF-8 aware. That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)
Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894