Re: Problem timing out XML::LibXML parse_html_string call

Sam Tregar writes:
> I'm using XML::LibXML to parse some HTML. Mostly it's working great> - fast and very useful XPath support. My problem is that it's> choking on some very bad HTML in a very bad way - it's sitting on> the CPU until killed manually. I expected some HTML wouldn't parse,> so this isn't such a tragedy. What is a big problem is that my> attempt to work around this with alarm() aren't working!
The problem with handling signals in Perl is that they happen
asynchronously. If a signal is delivered while the Perl interpreter
is executing an op, the code in the Perl-level signal handler might
attempt to modify interpreter state in a way that will cause later
crashes.
Perl 5.8 introduced "safe signals" to alleviate this problem. The
approach is to have the OS-level signal handler merely set a flag
indicating that the signal has been received. Then the interpreter
checks the flags at safe points (between ops, effectively), and
invokes your Perl-level handler at that point, when it's known to be
safe.
The only problem with this scheme is that if an op goes into an
infinite loop, the Perl-level signal handler never gets invoked.
That's very unlikely for regular ops in stable releases of Perl, but
a call to an XS function -- a single op -- might ultimately fall into
an infinite loop. And that's what's happening here; libxml2 (or
perhaps the XS component of XML::LibXML) has an infinite-loop bug, so
your signal handler never gets invoked.
You can switch back to the pre-5.8 signal-handling behaviour by
setting the environment variable PERL_SIGNALS to 'unsafe'. This has
to have happened at the point Perl starts executing; you can't do it
by setting that variable from inside your code. For example, using
env(1):
$ env PERL_SIGNALS=unsafe perl your_program.pl
If it's not possible for you to put an appropriate wrapper round your
program, something along these lines might help, if placed suitably
early in your code:
BEGIN {
if (!$ENV{PERL_SIGNALS} || $ENV{PERL_SIGNALS} ne 'unsafe') {
$ENV{PERL_SIGNALS} = 'unsafe';
exec $^X, $0, @ARGV;
}
}
See also `perldoc perlipc` and search for "safe signals".
--
Aaron Crane ** http://aaroncrane.co.uk/