Advertisements

Peter Jacobi wrote:
> They are easily built from the Unicode mapping files like the other
> ISO 8859 codecs and it would just be nice, if they were included in
> the standard distribution.

Can you produce a patch? Please upload it to sf.net/projects/python.

ISO-8859-11 is actually very difficult to implement, as it is unclear
whether the characters \x80..\x9F are assigned in this character set
or not. In fact, it is unclear whether the character set contains
even C0.

Advertisements

Martin asked for a patch, which would be nice if you could provide. On
"how": just take any lib/encodings/iso8859_?.py and edit the dict
argument to the decoding_map.update call.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek

"Martin v. Löwis" <> wrote in message news:...
> ISO-8859-11 is actually very difficult to implement, as it is unclear
> whether the characters \x80..\x9F are assigned in this character set
> or not. In fact, it is unclear whether the character set contains
> even C0.

That seems like a very fine distinction to me; the Unicode mapping tables
are the same for those points as in ISO-8859-1, so what's the difference?

Richard Brodie wrote:
>>ISO-8859-11 is actually very difficult to implement, as it is unclear
>>whether the characters \x80..\x9F are assigned in this character set
>>or not. In fact, it is unclear whether the character set contains
>>even C0.
>
>
> That seems like a very fine distinction to me; the Unicode mapping tables
> are the same for those points as in ISO-8859-1, so what's the difference?

For ISO-8859-1, I believe the standard actually says that those code
points are C1. For ISO-8859-11, you can find various statements in the
net, some claiming that it includes C1, and some claiming that it
doesn't. Somebody would actually have to take a look at ISO-8859-11 to
find out what is the case.

The issue is complicated by two facts:
- many sources indicate that ISO-8859-11 is derived by taking TIS-620,
and adding NBSP into 0xa0. Now, it seems quite clear that TIS-620 does
*not* include C1.
- some sources indicate certain restrictrions wrt. to control functions,
eg. in

which says "control functions are not used to create composite graphic
symbols from two or more graphic characters (see 6). "
I don't know what this means, especially as section 6 does not talk
about control functions. Section 7 says that any control functions
are out of scope of ISO 8859, which I believe is factually incorrect.

Christos "TZOTZIOY" Georgiou <> wrote in message
> Martin asked for a patch, which would be nice if you could provide. On
> "how": just take any lib/encodings/iso8859_?.py and edit the dict
> argument to the decoding_map.update call.

Thanks for the hint, but I've already succeeded in generating the
necessary files. It's even easier than your solutions, as the utility
gencodec.py in Tools/Scripts generate these automatically from (1:1)
Unicode mapping files (ftp://ftp.unicode.org/Public/MAPPINGS/).

I'll add the generated files at the end of this post.

The remaining question, and it seems the more difficult one, is a
question of process. Whether and how to add these to the normal
Python distribution.

"Martin v. Löwis" <> wrote:>
> The process is actually very easy. Anybody willing to contribute them
> would have to upload them to SF (sf.net/projects/python).

Perhaps I have just misunderstood your email. I read it this way (in my own words):

Taking into account unanswered questions about ISO 8859-11 and TIS620,
whoever wants to contribute, has to do provider further research,
starting with, but not limited to, buying the ISO standard.

The prospective contributor in addition has to provide support for this
patch and answer all questions about the details involved.

Sorry, this is in the moment out of scope for me. I have a patch, using
information from a source which is reliable enough for my personal
requirements, and now the patch is on USENET available for everyone
who wants to investigate further.

Peter Jacobi wrote:
>>The process is actually very easy. Anybody willing to contribute them
>>would have to upload them to SF (sf.net/projects/python).
>
>
> Perhaps I have just misunderstood your email. I read it this way (in my own words):

[snipped]
No - this is indeed my view on the issue. However, this is a technical
view; the *process* is completely independent, and very straight
forward. Submit the patch to SF, and somebody (probably Marc-Andre
Lemburg) will review it. The reviewer might ask questions or request
further changes (such as adding documentation); then the patch gets
accepted or rejected.

I know that *I* would ask questions as to why the submitter thinks the
patch is correct, and I would request that the submitter commits to
maintaining the patch. If you are unwilling to make such a commitment,
I can understand that - it just means that Python 2.4 might not have
these codecs (and we haven't discussed the 8859-16 at all).

Regarding the correctness doubts, I can provide these three points os far:

a) ISO 8859-n vs ISO-8859-n
If the information at http://en.wikipedia.org/wiki/ISO_8859-1#ISO_8859-1_vs_ISO-8859-1
is correct, Python 8859-n
codecs do implement the ISO standard charsets ISO 8859-n
in the specialized IANA forms ISO-8859-n (and in agreement
with the Unicode mapping files). So any difficult C0/C1
wording in the original ISO standard can be disregarded.

This is a confusing document, as it both refers to ISO/IEC
8859-16:2001 (no control characters), and the Unicode character
map (with control characters). We might interpret this as a
mistake, and assume that it was intended to include control
characters (as all the other ISO-8859-n).

For ISO-8859-11, the situation is even more confusing, as
that is no registered IANA character set, according to

"Martin v. Löwis" <> wrote in message
> Therefore, it would be a protocol violation (strictly speaking)
> if one would use iso-8859-11 in, say, a MIME charset= header.

Strictly speaking, there are some more dark corners to check.
All ISO charsets should be, strictly speaking, qualified by year. And
in fact there were some prominent changes, e.g. in 8859-7 (greek).
What to do of them?

Looking around:
- the RFC references a fixed year old version
- Unicode mapping files and libiconv track the newest version
- IBM ICU4C provides all versions
- Python (not by planning, I assume) has a "middle" version with
some features of the old mapping table (no currency signs) and some
features of the new (0xA1=0x2018, 0xA2=0x2019)

Peter Jacobi wrote:
> Looking around:
> - the RFC references a fixed year old version
> - Unicode mapping files and libiconv track the newest version
> - IBM ICU4C provides all versions
> - Python (not by planning, I assume) has a "middle" version with
> some features of the old mapping table (no currency signs) and some
> features of the new (0xA1=0x2018, 0xA2=0x2019)

Indeed. Adding new codecs is not a matter of just compiling a few files
that somebody else has produced, but requires a lot of expertise.
Therefore, I would have preferred if Python would not have included any
codecs, but relied on the codecs that come with the platform (e.g. iconv
on Unix, IE DLLs on Windows).

Now, things came out differently, and we are now in charge of
maintaining what we got. This requires great care, and expert volunteers
are always welcome. Unfortunately, in the Unicode/character sets/l10n
world, there is no one true way, so experts need to stand up and voice
their opinion, hoping that contributors become atleast aware of the
issues.

In the specific case of ISO-8859-7, I was until just now unaware of the
issue - I would not have guessed that ISO dared to ever change a part
of 8859. If this is ever going to be changed, I would suggest the
following approach:
- provide two encodings: ISO-8859-7:1987, and ISO-8859-7:2003. Without
checking, I would hope that the version in RFC 1345 is identical with
8859-7:1987
- Make ISO-8859-7 an alias for ISO-8859-7:1987
Of course, somebody should really talk to IANA and come up with
preferred MIME name. Apparently, ISO-8859-7-EURO and ISO-8859-7-2003
have been proposed.

I have just asked Markus Kuhn about this, who has registered
ISO-8859-16 with IANA. He believes that his registration does
not include control characters (neither C0 nor C1), just as
the ISO standard does not contain any. Wrt. RFC 1345 he points
out that this is not an Internet Standard, but a private
collection of Keld Simonsen, i.e. it is not binding.

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!