At 20:10 04/09/24 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >>I was under the impression that we agreed that using Encode and
> >>proper Perl Unicode features were not planned for 0.7.0 which will
> >>be the next version of the Markup Validator.
> >
> >Who agreed? You suggested to use proper Perl Unicode, didn't you?
>
>I also suggested that we release often; the 0.6.7 release is now three
>months old and it does not seem to me that the next version will be
>released in October.
I have never seen anybody on this list who suggested that
releasing often is a bad idea. But whenever it came close
to releasing, everybody seemed to me sceptical. The only
way, in my experience, to release often is 'just do it'.
>Currently our main focus is on stabilizing the code
>in HEAD which is the result of merging the improvements in the former
>HEAD and 0.6.7, fixing all the bugs so that it has at least the level of
>quality that 0.6.7 had and then see what comes next, I would expect a
>Beta release to get broader review.
Why not do the beta release before we have 'at least the quality
of 0.6.7'?
>I see switching to Unicode internals
>now as making that more difficult.
I was successfully able, in my checkout version, to get rid
of the counting problems when indicating where on a line an
error occurred. That's definitely a bug fix, and for some
people (all those working outside ASCII), it may be a real
feature. The actual disadvantage would be non-support for
GB18030. The other things that you have mentioned will
have to be checked very carefully eventually, but should
be okay for most cases (and going through the code and
replacing \s and friends in regular expressions with actual
precise [] shouldn't be such a big issue).
Also, my code got a lot simpler because Encode is
much better at handling decoding errors in various ways.
> >A lot of things would be better with a test suite. But I'm
> >not ready to wait for one.
>
>You don't have to wait! You can contribute to it and make it happen
>sooner! Valuable contributions would be ideas, test documents,
>documentation, source code for a test module and/or script, reports
>on bugs in the current code, etc.
>
> >> % perl -MEncode -e "print decode 'utf-16be', qq(\x00\xf6)"
> >> Unknown encoding 'utf-16be' at -e line 1
> >>
> >>using the Encode.pm that ships with Perl 5.8.2 even though the
> >>encoding would be supported if written as "UTF-16BE".
> >
> >Good to know. Does this apply to all encodings, or only to
> >a few?
>
>Only to a few as far as I can tell. A list of encoding names (including
>different spellings) we currently support and which we would support
>just by using Encode and/or Encode::Alias and/or I18N::Charset would be
>very useful. Maybe that's something you can look into?
I have. I remember that UTF-16 somehow showed up in this list,
and GB18030. I don't think there were any others. One big
advantage would be that Text::Iconv bases on a machine's iconv,
and that is varying. As an example, Solaris has a rather bad
one out of the box.
Also, please note that the encodings we currently support are
not simply those of Iconv; Iconv would support many others.
But we check whether there is actually an IANA registration,
and we use only the MIME preferred name.
> >>and check which behavior we desire, and have tests so
> >>that later changes do not introduce bugs. Iconv and Encode also do
> >>not support the same set of character encodings, GB18030 for example
> >>is supported by the current Markup Validator but not by the Encode
> >>version that ships with Perl 5.8.2, we would first need to figure
> >>out for which encodings we would need to drop support or find other
> >>replacements.
> >
> >Or we would just (temporarily) drop those that are not supported.
>
>That's an option, too. Maybe we should discuss this in one of our
>upcoming meetings?
Please discuss. I think it should be feasible.
Regards, Martin.