(This is an issue because the UTF8 representation of
$b is actually
the two bytes with values 196 and 172.) But Gisle said that of course
it should not match, because the target string does not in fact
contain character #300.

This led to a brief discussion of what the regex engine should do with UTF8
strings. The problem here goes back to the roots of the UTF8
implementation.

Larry's original idea was that if
use utf8 was in scope, operation would assume that all data was UTF8 strings,
and if not, they would assume byte strings. This puts a lot of
burden on the programmer and especially on the module writer. For
example, suppose you had wanted to write a function that would return true
if its argument were longer than 6 characters:

sub is_long {
my ($s) = @_;
length($s) > 6;
}

No, that would not work, because if the caller had passed in a UTF8 string,
then your answer ouwld be whether the string was longer than six
bytes, not six characters. (Remember characters in a UTF8 may be longer
than one byte each.) You would have had to write something like this instead:

This approach was abandoned several versions ago, and you can see why.
The current approach is that every scalar carries around a flag that
says whether it is a UTF8 string or a plain byte string, and
operations like
length() are overloaded to work on both kinds of strings;
length() returns the number of characters in the string whether or not the
string is UTF8.

Now here's a dirty secret: Overloading the regex engine this way is
difficult, and hasn't been done yet. Regex matching ignores the UTF8
flag in its target. Instead, it uses the old method that was
abandoned: if it was compiled with
use utf8 in scope, it assumes that its argument is in UTF8 format, and if not,
it assumes its argument is a byte string.

The right thing to do here is to fix the regex engine so that its
behavior depends on whether the UTF8 flag in the target. The hard way
(but the right way) is to really fix the regex engine. The easier way
is to have the regex engine compile everything as if
use utf8 was not in scope, and then later on if it is called on to match a
UTF8 string, it should recompile the regex as if
use utf8 had been enabled, and stash that new compiled regex alongside the
original one for use with UTF8 strings.

This led Simon to ask if Perl should have support for normalization.
What is normalization? Unicode has a character for the letter 'e' (U+0065),
and a character for an acute accent (U+00B4), which looks something like
´ and is called a 'combining character' because it combines with the
following character to yield an accented character; when the string
containing an acute accent is displayed, the accent should be
superimposed on the previous character.
But Unicode also has a character for the letter e
with an acute accent (U+00E9), as é. This should be displayed the same way as the two character sequence U+00B4 U+0065.

Perl does not presently do this, and if you have two strings, produced
by
pack "U*", 0xB4, 0x65 and by
pack "U*", 0xE9 it reports them as different, which they certainly are. But clearly, for some
applications, you would like them to be considered equivalent, and
Perl presently has no built-in function to recognize this.

Fergal Daly pointed out that Doug's patch will break abstract base
classes, because it extends the semnatics of
use Dog $spot to mean something new. Formerly, it meant that
$spot was guaranteed to be implemented with a pseudohash, and that the
fields in
$spot were guaranteed to be a subset of those specified in
%Dog::FIELDS. Doug's patch now adds the meaning that method calls on
$spot will be resolved at compile time by looking for them in class
Dog. This is a change, because it used to be that it was permissble to
assign
$spot with an object from some subclass of
Dog, say
Schnauzer, as long as its fields were laid out in a way that was compatible
with
%Dog::FIELDS. But now you cannot do that, because when you call
$spot->meth you get
Dog::meth instead of
Schnauzer::meth.

Oops.

Some discussion ensued. Sarathy suggested that the optimization only
be enabled if, at the end of compilation,
Dog has no subclasses. Fergal said it would be a shame to limit it to
such cases, and it would not be much harder to enable the
optimization for any method that was not overridden in any subclass.

Doug MacEachern contributed a patch that allows
my __PACKAGE__ $foo, where
__PACKAGE__ represents the current package name. There was some discussion about
whether the benefit was worth ths cost of the code bloat. Doug said
that it was useful for the same reasons that
__PACKAGE__ is useful anywhere else. (As a side
note, why is it that the word 'bloat' is never used except in
connection with three-line patches?)

Andreas Koenig said that it would be even better to allow
my CONSTANT $foo where
CONSTANT is any compile-time constant at all, such as one that was created by
use constant. Doug provided an amended patch
to do that also.

Jan Dubois pointed out that this will break existing code that has a
compile-time constant that is of the same name as an existing patch.
Andreas did not care.

Andreas Koenig: Who uses constants that have the same name as existing and actually used classes isn't coding cleanly and should be shot anyway.

More persuasively, he pointed out that under such a circumstance,
my Foo $x = Foo->new would not work either, because the
Foo on the right would be interpreted as a constant instead of as a class name.

Last week I sent aggrieved email to a number of people asking what
cfgperl was and why there appeared to be a secret source repository on
Jarkko's web site that was more up-to-date than the documented source
repository. I was concerned that there was in inner circle of
development going on with a hidden development branch that was not
accessible to the rest of the world.

Jarkko answered me in some detail in email, and then posted to p5p to
explain the real situation.
cfgperl is simply the name for Jarkko's
private copy of the source, to which he applies patches that he deems worthy.
It got ahead of the main repository because Sarathy was resting last month.

Sarathy said that signals really couldn't be emulated properly under
Windows, but that people keep complaining about it anyway. So he put
in a patch that tries to register the signal handler anyway, I guess
in hopes of stopping them from complaining.

Perl Lindquist reported an example of
s/// that runs much slower in 5.6.0 than in 5.004_03. The regex is bad,
so that you would expect a quadratic search, but Mike Guy reported
that in fact Perl was doing a cubic search.

The sequence
\_ in a regex now elicits a warning where it didn't before.
Dominic Dunlop tracked down the patch that introduced this and pointed
out that it needs to be documented (in
perldelta and possibly
perldiag) and probably also needs a
test case. But nobody stepped up. Here's an easy opportunity for
someone to contribute a doc patch.

Dominic Dunlop reported an interesting bug in the new
printf "%v" specifier. The bug is probably not too difficult to investigate and
fix, because it is probably localized to a small part of Perl that
does not deal woo much with Perl's special data structures. So it is
a good thing for a beginner to work on. Drop me a note if you are
interested and if you need help figuring out where to start.

A sidetrack developed out of Nicholas' patch to fix this, discussing
the best way to make sure that tests get the test version of the
library, and not the previously installed version of the library.
Nicholas was using

unshift '../lib';

This is a common idiom in the test files.
What's wrong with it? It leaves the standard directories in
@INC, which may not be appropriate, and it assumes that the library is in a
sibling directory, so you cannot run the test without being in the
t/ directory itself.

There was a little discussion of the right thing to do here. Mike Guy
suggested that one solution would be to have the test harness set up
the environment properly in the first place. The problem with that
is that then you can't run the tests without the harness. (For
example, you might want to run a single test file; at present you can
just say
perl t/op/dog.t or whatever.)

Sarathy pointed out that having each test file begin with something
like

BEGIN { @INC = split('|',$ENV{PERL_TEST_LIB_PATH}
|| '../lib') }

might solve the problem. Then the harness can set
PERL_TEST_LIB_PATH but you can still run a single test manually if you are in the right place.