Improving Perl 5: Grant Report for Month 9

Possibly the most unexpected discovery of May was determining precisely why Merijn's HP-UX smoker wasn't able to build with certain configuration options. The output summary grid looked like this, which is most strange:

As the key says, 'O' is OK. It's what we want. 'm' is very bad - it means that it couldn't even build miniperl, let alone build extensions or run any tests. But what is strange is that ./Configure ... will fail, but the same options plus -Duse64bitall will work just fine. And this is replicated with ithreads - default fails badly, but use 64 bit IVs and pointers and it works. Usually it's the other way round - the default configuration works, because it is "simplest", and attempting something more complex such as 64 bit support, ithreads, shared perl library, hits a problem.

As it turns out, what's key is that that ./Configure ... contains -DDEBUGGING. The -DDEBUGGING parameter to Configure causes it to add -DDEBUGGING to the C compiler flags, and to add -g to the optimiser settings (without removing anything else there). So on HP-UX, with HP's compiler that changes the optimiser setting from '+O2 +Onolimit' to '+O2 +Onolimit -g'. Which, it seems, the compiler doesn't accept for building 32 bit object code (the default) but does in 64 bit. Crazy thing.

Except, that, astoundingly, its not even that simple. The original error message was actually "Can't handle preprocessed file". Turns out that that detail is important. The build is using ccache to speed things up, so ccache is invoking the pre-processor only, not the main compiler, to create a hash key to look up in its cache of objects. However, on a cache miss, ccache doesn't run the pre-processor again - to save time by avoiding repeating work, it compiles the already pre-processed source. And that is key the distinction between invoking the pre-processor and then compiling, versus compiling without the pre-processor:

No, it's not just crazy compiler, its insane! It handles -g +O2 just fine normally, but for 32 bit mode it refuses to accept pre-processed input. Whereas for 64 bit mode it does.

If HP think that this isn't a bug, I'd love to know what their excuse is.

A close contender for "unexpected cause" came about as a result of James E Keenan, Brian Fraser and Darin McBride recent work going through RT looking for old stalled bugs related to old versions of Perl on obsolete versions operating systems, to see whether they are still reproducible on current versions. If the problem isn't reproducible, it's not always obvious whether the bug was actually fixed, or merely that the symptom was hidden. This matters if the symptom was revealing a buffer overflow or similar security issue, as we'd like to find these before the blackhats do. Hence I've been investigating some of these to try to get a better idea whether we're about to throw away our only easy clue about still present bug.

One of these was RT #6002, reported back in 2001 in the old system as ID 20010309.008. In this case, the problem was that glob of a long filename would fail with a SEGV. Current versions of perl on current AIX don't SEGV, but did we fix it, did IBM, or is it still lurking? In this case, it turned out that I could replicate the SEGV by building 5.6.0 on current AIX. At which point, I have a test case, so start up git bisect, and the answer
should pop out within an hour. Only it doesn't, because it turns out that git bisect gets stuck in a tarpit of "skip"s because some intermediate blead version doesn't build. So this means a digression into bisecting the cause of the build failure, and then patching Porting/bisect-runner.pl to be able to build the relevant intermediate blead versions, so that it can then find the true cause. This might seem like a lot of work that is used only once, but it tends not to be. It becomes progressively easier to bisect more and more problems without hitting any problems, and until you have it you don't realise how powerful a tool automated bisection is. It's a massive time saver.

But, as to the original bug and the cause of its demise. It turned out to be interesting. And completely not what I expected:

The SEGV (due to an illegal instruction) goes away once perl switched to using dlopen() for dynamic linking on AIX. So my hunch that this bug was worth digging into was right, but not for reason I'd guessed.

A couple of bugs this month spawned interesting subthreads and digressions. RT #108286 had one, relating to the observation that code written like this, with each in the condition of a while loop:

But it also does the same for C, C and C - i.e. the same cases that solicit the warning in 5.004 is extending the defined insertion to those cases desirable? (glob and readdir seem to make sense, I am less sure about each).

The intent of the changes back then appears to be to retain the 5.003 and earlier behaviour on what gets assigned for each construction, but change the loop behaviour to terminate on undefined rather than simply falsehood for the common simple cases:

while (OP ...)

and

while ($var = OP ...)

And there I thought it made sense - fixed in 1998 for readline, glob and readdir, but introducing the inconsistency because each doesn't default to assigning to $_. Except, it turned out that there was a twist in the tail. It turns out that while (readdir D) {...} didn't use to implicitly assign to $_. Both the implicit assignment to $_ and defined test were added in 2009 by commit 114c60ecb1f7, without any fanfare, just like any other bugfix. And the world hasn't ended.

Running a search of CPAN reveals that almost no code uses while (each %hash) [and why should it? The construction does a lot of work only to throw it away], and nothing should break if it's changed. Hence it makes sense to treat this as a bug, and fix it. Which has now happened, but I can't take credit for it - post 5.16.0, Father Chrysostomos has now fixed it in blead.

To conclude this story, the mail archives from 15 years ago are fascinating. Lots of messages. Lots of design discussions, not always helpful. And some of the same unanswered questions as today.

The digression relates from trying to replicate a previous old bug (ID 20010918.001, now #7698) I'd dug an old machine with FreeBSD 4.6 out from the cupboard under the stairs in the hope of reproducing the period problem with a period OS. Sadly I couldn't do that, but out of curiosity I tried to build blead on it. This is the same 16M machine whose swapping hell prompted my investigation of enc2xs the better part of a decade ago, resulting in various optimisations on its build time memory use, that in turn led to ways to roughly halve the side of the built shared objects, and a lot of the material then used in a tutorial I presented at YAPC::Europe and The German Perl Workshop, "When Perl is not quite fast enough". This machine has pedigree.

Once again, it descended into swap hell, this time on mktables. (And with swap on all 4 hard disks, it's very effective at letting you know that it's swapping.) Sadly after 10 hours, and seemingly nearly finished, it ran out of virtual memory. So I wondered if, like last time, I could get the memory usage down. After a couple of false starts I found a tweak to Perl_sv_grow that gave a 2.5% memory reduction on FreeBSD (but none on Linux), but that wasn't enough. However, the cleanly abstracted internal structure of mktables makes it easy to add code to count the memory usage of the various data structures it generate. One of its low-level types is "Range", which subdivides into "special" and "non-special". There are 368676 of the latter, and the name for each may be need to be normalised into a "standard form". The code was taking the approach of calculating the standard form at object creation time. With the current usage patterns of the code, this turns out to be less than awesome - the standard form is only requested for 22047 of them. By changing the code to calculate only when needed (and cache the result) I reduced RAM and CPU usage by about 10% on Linux, and 6% on FreeBSD. Whilst the latter is smaller, it was enough to get the build through mktables, and on to completion. The refactoring is now merged to blead, post 5.16.0. Hopefully everyone's build will be a little bit smaller and a little bit faster as a result.

To complete the story, I should note that make harness failed with about 100 tests still to run, snatching defeat from the jaws of victory. Turns out that that also chews a lot of memory to store test results. make test, however, did pass (except for one bug in t/op/sprintf.t, patch in RT @112820). Curiously gcc, even when optimising, isn't the biggest memory hog of the build. It's beaten by mktables, t/harness and a couple of the Unicode regression tests. But even then, our build is very frugal. It should complete just fine with 128M of VM on a 32 bit FreeBSD system, and I'd guess under 256M on Linux (different malloc, different trade offs). I think that this means that blead would probably build and test OK within the hardware of a typical smartphone (without swapping), if they actually had native toolchains. Which they don't. Shame :-(

Part of May was spent getting a VMS build environment set up on the HP Open Source cluster, and using it to test RC1 and then RC2 on VMS.

Long term I'd like to have access to a VMS environment, not to actually do any porting work to VMS, but to permit refactoring of the build system without breaking VMS. George Greer's smoker builds the various smoke-me branches on Win32, so that makes it easy to test changes that would affect the Win32 build system, but no such smoker exists for VMS. Hence historically I've managed to do this by sending patches to Craig Berry and asking him nicely if he'd test them on his system, but this is obviously a slow, inefficient process that consumes his limited time, preventing him using it to instead actually improve the VMS port.

As the opportunity to get access turned up just as 5.16.0 was nearing shipping, I decided to work on getting things set up "right now" to try to get (more) tests of the release candidates on VMS. We discovered various shortcomings in the instructions in README.vms, and as a side effect of debugging a failed build, a small optimisation to avoid needless work when building DynaLoader. So it's likely that my ignorance will continue to be a virtue by finding assumptions and pitfalls in the VMS process that the real experts don't even realise that they are avoiding subconsciously.

We had various scares just before 5.16.0 shipped relating to build or test issues on Ubuntu, specifically on x86_64. This shouldn't happen - x86_64 GNU/Linux is probably the most tested platform, and Ubuntu is a popular distribution, so it feels like there simply shouldn't be any more bugs lurking. However, it seems that they keep breeding.

In this case, it's yet another side effect of Ubuntu going multi-architecture, with the result that the various libraries perl needs to link against are now in system dependent locations, instead of /usr/lib. This isn't a problem (well, wasn't once we coded to cope with it) - we ask the system gcc where its libraries are coming from, and use that library path. The raw output from the command looks like this:

Except that all of a sudden, we started getting reports of build failures on Ubuntu. It turned out that no libraries were found, with the first problem being the lack of the standard maths library, hence miniperl wouldn't link. Why so? After a bit of digging, it turns out that the reason was that the system now had a gcc which localised its output, and the reporter was running under a German locale.

Because in the full output, the string we were searching for, "libraries", isn't there. it's now translated to "Bibliotheken".

Great. Unfortunately, there isn't an alternative machine readable output format offered by gcc, so this single output format has to make do for humans and machines, which means that the thing that we're parsing changes.

This is painful, and often subtle pain because we don't get any indication of the problem at the place where it happens. In this case, a failure in the hints file doesn't become obvious until the end of the link in the build.

The solution is simple - force the locale to "C" when running gcc in a pipeline. But it's whack-a-mole fixing these. It would be nice if more tools made the distinction that git does between porcelain (for humans), and plumbing (for input to other programs).

The second Ubuntu failure report just before 5.16.0 was for t/op/filetest.t failing. It turned out that the test couldn't cope with a combination of circumstances - running the test as root, but the build tree not being owned by root, and the file permissions being such that other users couldn't read files in the test tree. This all being because testing that -w isn't true on a read only file goes wrong if you're root, so there's special-case code to detect if it's running as root, which temporarily switches to an arbitrary non-zero UID for that test. Unfortunately it also had a %Config::Config based skip within that section, and the read of obscure configuration information triggers a disk read from lib/, which fails if the build tree's permissions just happened to be restrictive. The problem had actually been around for quite a while, so Ricardo documented it as a known issue and shipped it unchanged.

So post 5.16.0, I went to fix t/op/filetest.t. And this turned into quite a yak shaving exercise, as layer upon layer of historical complexity was revealed. Originally, t/op/filetest.t was added to test that various file test operators worked as expected. (Commit 42e55ab11744b52a in Oct 1998.) It used the file t/TEST and the directory t/op for targets. To test that read-only files were detected correctly, it would chmod 0555 TEST to set it read only.

The test would fail if run as root, because root can write to anything. So logic was added to set the effective user ID to 1 by assigning to $> in an eval (unconditionally), and restoring $> afterwards. (Commit 846f25a3508eb6a4 in Nov 1988.) Curiously, the restoration was done after the test for C<-r op>, rather than before it.

Most strangely, a skip was then added for the C<-w op> test based on $Config{d_seteuid}. The test runs after $> has been restored, so should have nothing to do with setuid. It was added as part of the VMS-related changes of commit 3eeba6fb8b434fcb in May 1999. As d_seteuid is not defined in VMS, this makes the test skip on VMS.

Commit 15fe5983b126b2ad in July 1999 added a skip for the read-only file test if d_seteuid is undefined. Which is actually the only test where having a working seteuid() might matter (but only if running as root, so that $> can be used to drop root privileges).

Commit fd1e013efb606b51 in August 1999 moved the restoration of $> earlier, ahead of the test for C<-r op>, as that test could fail if run as root with the source tree unpacked with a restrictive umask. (Bug ID 19990727.039)

"Obviously no bugs" vs "no obvious bugs". Code that complex can hide anything. As it turned out, the code to check $Config{d_seteuid} was incomplete, as it should also have been checking for $Config{d_setreuid} and $Config{d_setresuid}, as $> can use any of these. So I refactored the test to stop trying to consult %Config::Config to see whether root assigning to $> is going to work - just try it in an eval, and skip if it didn't. Only restore $> if we know we changed it, and as we only change it from root, we already know which value to restore it to.

Much simpler, and avoids having to duplicate the entire logic of which probed Configure variables affect the operation of $>

Finally, I spotted that I could get rid of a skip by using the temporary file the test (now) creates rather than t/TEST for a couple of the tests. The skip is necessary when building "outside" the source tree using a symlink forest back to it (./Configure -Dmksymlinks), because in that case t/TEST is actually a symlink.

So now the test is clearer, simpler, less buggy, and skips less often.

A more detailed breakdown summarised from the weekly reports. In these: