note
creamygoodness
<p>I used <c>utf8::upgrade()</c> as a pure Perl example, so that I wouldn't have to resort to Inline::C or XS and more people would be able to run the sample code.
<p><c>Perl_sv_utf8_upgrade_flags_grow</c> is one of those root functions that's invoked via many wrappers, though, like <c>Perl_do_openn</c> or <c>Perl_sv_setsv_flags</c>. There are many ways to get at it.
<p>As noted, I discovered the issue via the <c>SvPVutf8</c> XS macro. The devel branch of one of my CPAN distros, [cpan://KinoSearch], is a mostly-C library which uses UTF-8 strings exclusively internally. Therefore, I use <c>SvPVutf8</c> rather than <c>SvPV</c> for accessing string pointers from arguments.
<p>If anybody ever uses $1 as an argument to any XS library function which uses <c>SvPVutf8</c>, it will get upgraded, triggering the bug:
<p><c>
$category =~ /(\w+)/
my $term_query = KinoSearch::Search::TermQuery->new(
field => 'category',
term => $1,
);
</c>
<p><a href="http://www.google.com/codesearch?hl=en&lr=&q=svpvutf8+-file%3Apport\.h%24+-file%3Asv\.h%24+-file%3Asv\.c%24&sbtn=Search">Other libraries</a> which use <c>SvPVutf8</c> include Mail::SpamAssassin, Glib, Tk, etc. However, I suspect that the problem isn't limited to us. It's more that using <c>m//g</c> is a little esoteric, and many functions reset $1 by turning off the <c>SVf_POK</c> flag -- e.g. <c>length($1)</c> will do it. So the problem tends not to persist for very long -- but while it does, you can get some maddeningly subtle bugs!
792157
792161