Recently in Perl Internals Category

So you're a Perl master. You've got XS sorted. You know how the
internals work. Hey, there's nothing we can teach you on perl.com
that you don't already know. You think? Where Wizards Fear To Tread
brings you the information you won't find anywhere else concerning
the very top level of Perl hackery.

This month, we look at the Perl op tree. Every Perl program is compiled
into an internal representation before it is executed. Functions,
subroutine calls, variable accesses, control structures, and all that
makes up a Perl program, are converted into a series of different
fundamental operations (ops) and these ops are strung together into a
tree data structure.

For more on the different types of ops available, how they fit together,
and how to manipulate them with the B compiler module, look at
the Perl 5 internals tutorial.
Right now, though, we're going to take things a step further.

The B module allows us to get at a wealth of information about an
op, but it can become incredibly frustrating to know which op you want
to deal with, and to perform simple manipulation on a range of ops.
It also offers limited functionality for navigating around the op tree,
meaning that you need to hold onto a load of additional state about
which op is where. This gets complicated quickly. Finally,
it's not easy to get at the op trees for particular subroutines, or
indeed, all subroutines both named and anonymous.

B::Utils was created at the request of Michael Schwern to address
these issues. It offers much more high-level functionality for
navigating through the tree, such as the ability to move ``upward'' or
``backward,'' to return the old name of an op that has currently been
optimized away, to get a list of the op's children, and so on. It can
return arrays of anonymous subroutines, and hashes of subroutine op
roots and starts. It also contains functions for walking through the op
tree from various starting points in various orders, optionally
filtering out ops that don't match certain conditions; while performing
actions on ops, B::Utils provides carp and croak routines which
perform error reporting from the point of view of the original source
code.

But one of the most useful functions provided by B::Utils is the
opgrep routine. This allows you to filter a series of ops based on a
pattern that represents their attributes and their position in a tree.
The major advantage over doing it yourself is that opgrep takes care
of making sure that the attributes are present before testing them - the
seasoned B user is likely to be accustomed to the carnage that
results from accidentally trying to call name on a B::NULL object.

So, what can we do with all this? The answer is, of course, ``anything we
want.'' If you can mess about with the op tree, then you have complete control
over Perl's operation. Let's take an example.

Damian Conway recently released the Acme::Don't module, which doesn't
do anything:

don't { print "Something\n" }

doesn't print anything. Very clever. But not clever enough. You see, I
like double negatives:

my $x = 1;
don't { print "Something\n" } unless $x;

doesn't print anything either, and if you like double negatives, then
you might agree that it should print something. But how on earth are we
going to get Perl to do something when a test proves false? By messing
about with the op tree, of course.

The way to solve any problem like this is to think about the op tree
that we've currently got, work out what we'd rather do instead, and work
out the differences between the op trees. Then, we write something that
looks for a given pattern in a program's op tree and modifies it to be
what we want.

There are several ways of achieving what we actually want to get but the
simplest one is this: add a second parameter to don't which, if set,
actually does do the code. This allows us to replace any occurrence
of

don't { print "Something\n" } if (condition);

with

don't(sub { print "Something\n" }, 1) unless (condition);

Let's now look at this in terms of op trees. Here's the relevant part of
the op tree for don't { ... } if $x, produced by running
perl -MO=Terse and then using sed to trim out to unsightly hex
addresses:

As we can see, the if is represented as an and op internally,
which makes sense if you think about it. The two ``legs'' of the and,
called ``first'' and ``other,'' are a call to fetch the value of $c,
and a subroutine call. Look at the subroutine call closely: the
ops ``inside'' this set up a mark to say where the parameters start,
push a reference to anonymous code (that's our { ... }) onto the
stack, and then push the glob for *don::t on there.

So, we need to do two things: We need to insert another parameter
between refgen and the null attached to *don::t, and we need to
invert the sense of the test.

This is actually pretty simple: We're looking for either an and or
an or op, and the ``other'' leg of the op is going to be a call to
*don::t. However, we have to be a bit clever here, since Perl
internally performs a few optimizations on the op tree that even the
B::* reporting modules don't tell you about. When Perl threads the
next pointers around an op tree, it does something special for a
short-circuiting binary op like and or or - it sets the other
pointer to be not the first sibling in the tree, but the first op in
execution order. In this case, that's pushmark, as we can see from
running B::Terse,exec:

Unfortunately, this doesn't tell us the whole story, since we actually
need to check that the subroutine call is to don't, rather than
to any other given subroutine that might be called conditionally. Hence,
our filter looks like this:

B::Generate was written to allow users to create their own ops and
insert them into the op tree. The original intent was to be able to
create bytecode for other languages to be run on the Perl virtual machine,
but it's found plenty of use manipulating existing Perl op trees.

It provides ``constructor'' methods in all of the B::*OP classes, and makes
many of the accessor methods read-write instead of read-only. Let's see
how we can apply it to this problem. Remember that we want to negate the
sense of the test, and then to add another argument to the call to
don't.

For the first of these tasks, B::Generate provides the handy
mutate and convert methods on each B::OP-derived object to
change one op's type into another. The decision as to which of them use
is slightly complex: mutate can only be used for ops of the same type
- for instance, you cannot use it to mutate a binary op into a unary op.
However, convert produces a completely new op, which needs to be
threaded back into the op tree. So convert is much more powerful, but
mutate is much more convenient. In this case, since we're just
flipping between and and or, we can get away with using mutate:

Now to insert the additional parameter. For this, remember that entersub
works by popping off the top entry in the stack and calling that as a
subroutine, and the remaining stack entries become parameters to the
subroutine. So we want to add a const op to put a constant on the stack.
We use the B::SVOP->new constructor to create a new one, and then
thread the next pointers so that Perl's main loop will call it between
$op->other->sibling (the refgen op) and the op after it.
(the GV which represents *don::t)

Both of these transformations can be done by applying the method above:
compare the op trees, work out the difference, find the pattern you want
to look for, then write some code to manipulate the op tree into the
desired output. An easy task for the interested reader ...

And we really haven't scratched the surface of what can be done with
B::Generate and B::Utils; the B::Generate test suite shows
what sort of mayhem can be caused to existing Perl programs, and there
have been experiments using B::Generate to generate op trees for
other languages - a B::Generate port of Leon Brocard's
shiny Ruby interpreter
could produce Perl bytecode for simple Ruby programs; chromatic is
working on an idea to turn Perl programs into XML, manipulate them and
use B::Generate to turn them back into Perl op trees.

Later in our ``Where Wizards Fear To Tread'' series, we'll have articles
about Perl and Java interaction, iThreads, and more.

Naming languages after stones is getting a bit old-hat these days. We
all know and love Perl; you might have heard of the Ruby language, which I'll talk more about another time.
There's also Chip Salzenberg's Topaz project, an idea to rewrite Perl in
C++, which ended with the announcement of the Perl 6 effort. And now,
there's Sapphire. So what's this all about?

Sapphire is one of the many projects which was started
purely and simply to prove a point. In this case, the point was that
building a large program from scratch in this day and age is crazy. I was going to prove it by showing how rapidly
software can be developed when using established libraries,
and that was done by seeing how quickly I could rewrite Perl 5.

Also, as a secondary goal, I wanted to show the flexibility of some of
my design ideas for Perl 6. It's dangerous when people are sitting
around for a long time discussing software design without
implementations, without benchmarks and without a single line of code.
I prefer getting up and doing something rather than talking about it. So I
was going to show my ideas in software, not in words.

What do I mean by this? Perl 5 is extremely self-sufficient. Once you've
got the source kit, it'll compile anywhere, on almost any platform and
requiring very few ``support'' libraries. It'll make do with what you have. One of the unfortunate side-effects of this is that if Perl wants
to do something, it implements it itself. As a result, Perl contains a
load of interesting routines, but keeps them to itself. It also doesn't
use any of the perfectly fine implementations of those interesting
routines which are already out there.

Some people think this is a feature; I think it's a wart. If we can
give things back to the open-source community, and work with
them to help improve their tools, then everyone benefits.

One of the great design choices of Perl 5 which appears to have been
completely and utterly rejected in the discussions on Perl 6's proposed
language is that we do things in the most general way possible. This is
why Perl doesn't need huge numbers of new built-ins - it just needs a
way to make user-defined syntax with the same status as built-ins. It
doesn't need beautiful new OO programming models - it just needs a
way to help people write their own OO models. Sapphire tries to do
things in the most general way possible.

Perl 5 consists of a number of parts, including a stunning regular expression
engine and a decent way of dealing with multityped variables.
Unfortunately, in the current implementation, these parts are all highly
interdependent in twisty ways. Separating them out into modules
means that you can test them independently, distribute them as
independent libraries, and upgrade them independently.

Sapphire, then, is a partial implementation of the Perl 5 API. I wasn't
setting out to create a new interpreter, although that would have been
necessary for some of the API routines, such as those which execute Perl
code. What I wanted was to re-create the programming environment which
Perl 5 gives you internally - a sort of ``super C,'' a C customized for
creating things such as Perl interpreters.

Specifically, I wasn't trying to do anything new. I like Perl 5. It
has a lot going for it. Of course, it could be tidier, since five years
of cruft have accumulated all around it now. It could be less quirky,
and it could display the design goals I have just mentioned. That's what
I wanted to do.

I gave myself a week. I was going to hack on it for one week, and see
where that got me. I didn't spend a lot of time on it, but I still managed
to achieve a fair amount: I had scalars, arrays and hashes working, as
well as Unicode manipulation to the level of Perl 5 and slightly beyond.

How? Well, I started by looking at the glib library, at
http://developer.gnome.org/doc/API/glib/. This provided a fantastic
amount of what I needed. The GPtrArray corresponds nicely
with a Perl AV, and glib also implements hashes, which saved a lot
of time, although to have HEs (hash entries) you need to dig a
little into the glib source.

All the Unicode support was there. I initially used GNOME's libunicode,
but then found that the development version of glib added UTF8 support
and was much easier to deal with. There were a few functions I needed which Perl 5 already had, and I'll be pushing those back to the glib
maintainers for potential inclusion.

Perl uses a lot of internal variable types to ensure portability - an
I32 is an integer type guaranteed to be 32 bits, no matter where it
runs. Not surprisingly, I didn't have much work to do there, since
glib provides a family of types, such as gint32, to do the same
thing. Differing byte orders are also catered for. The ``super C''
environment the Perl 5 API provides is largely out there, in
existing code.

Oh, and let's be honest: There was one large piece of existing code
that was just too tempting not to use, and that was Perl
itself. When you're trying to replicate something and you've got a
working version in front of you, it's tricky not to borrow from it; it
seems a shame to throw away five years of work without looking
for what can be salvaged from it. A lot of the scalar handling code
came from Perl 5, although I did rearrange it and make it a lot more
sensible and maintainable. I wasn't just interested in developing with
external libraries. I also wanted to see if I could correct some
other misfeatures of Perl's internals.

The first problem to solve was the insidious use of macros on macros
on macros. The way I went about this was by first outlawing lvalue
macros. That is, for example,

SvPVX(sv) = "foo";

had to turn into

sv_setpv(sv, "foo");

Incidentally, this is how perlapi says it should be done. Perl 5
often optimizes for speed (sometimes too enthusiastically) at the
expense of maintainability - Sapphire questions that trade-off,
preferring to trust compiler optimization and Moore's Law.

Next, I wrote a reasonably sophisticated Perl program to convert
inline functions into macros. That is, it would take

Now you can choose whether your macros should be expanded by flipping
on -DEXPAND_MACROS and whether they should be inline by playing
with -DINLINE. But what's EXPAND_HERE for? Well, the above code
snippet would go into an include file, maybe sv.h, and one C file -
let's call it sv_inline.c - would contain the following code:

#include <sapphire.h>
#define EXPAND_HERE
#include <sv.h>

Then if EXPAND_MACROS was defined, the function definitions would
all be provided in one place; if macros were not expanded,
sv_inline.c would define no functions. The function prototypes
would be extracted automatically with C::Scan.

With the state of compiler optimization these days, it's likely
that turning everything into macros makes no significant speed
difference. In which case, it's best to turn on EXPAND_MACROS to
assist with source-level debuggers which cannot read macros. However,
you can't tell until you benchmark, and the ``optional expansion''
method gives you a nice easy way to do that.

I also took a swipe at memory allocation; it seems the first job in
every large project is to write your own memory
allocator. I had heard from perl5-porters and other places that the
biggest speed overhead in XS routines is SV creation, so I wrote
an allocator which would maintain pools of ready-to-ship variables,
refreshing the pools when there was nothing else to do, like a
McDonald's burger line.

If I had given myself two weeks, where would I be? Sticking with glib,
I could very easily have safe signal handling, portable loadable
module support, a main event dispatch loop and a safe threading
model. It's all there, ready to go. It's free software, and that's
just one library.

To be honest, I wouldn't advocate the use of glib for everything I
could do with it. For example, I replaced Perl's main run loop
Perl_runops_standard in run.c) with a GMainLoop and
benchmarked the two. The glib version, although signal safe, was at
least five times slower. (However, you may want to contemplate what it
means for graphical application programming if you have, effectively,
a GNOME event loop right there in your interpreter.)

Heavier Unicode support would probably need libunicode. What about
regular expressions? Well, the glib developers are working on a fully
Unicode-aware Perl-compatible regular expression library (which,
frankly, is more than we have). If they don't finish that, Philip
Hazel's Perl Compatible Regular Expression library does
exactly what it says on the tin.

There are some areas of Perl's programming environment where I'm not
aware of a pre-existing solution. For instance, scope.c in the Perl
source distribution gives C the concept of ``dynamic scope,'' allowing you
to save variables and restore them at the end of a block, just
like the local operator in Perl.

And some problems just can't be solved in C. There's no good way, for
instance, to get a partitioned namespace. I didn't bother trying. Once
you've told the developer what the API is, it's his responsibility
to ensure it works.

On the other hand, C is not meant to be a language that gives you
this type of support. Some would argue that C++ solves these problems,
but in my experience, C++ never solves anything.

Standing for ``Sapphire variables,'' libsvar contains all the
functions for manipulating SVs, AVs and HVs. This is an interesting
library in its own right which can be used for programming outside of
the Sapphire environment - having SVs in an ordinary C program without
the overhead of a Perl interpreter really expands your programming
possibilities, and, as far as I'm aware, there isn't a good variable-handling library around.

The regular expression engine would be separated into its own library,
again so that external applications can use it without an entire Perl
interpreter. I didn't implement this myself, leaving it to PCRE or
glib to provide this.

The present-day scope.c and scope.h solve a problem in C by
giving it dynamic scoping; this is something that contributes to the
friendliness of the Perl programming environment, and something we can
separate and share.

So, what am I going to do with Sapphire now? To be honest, nothing at
all. I hope it has served its purpose by presenting the argument for
reusable code and stable design principles.

I certainly don't, at present, want to be placed in a position where
I'm choosing between spending time fiddling with Sapphire and spending
time contributing to coding Perl 6. Please understand: Sapphire is
emphatically not intended to be a fork of Perl, merely an
interesting interlude, and this is shown by the fact that I
didn't try to make any exciting changes.

If anyone has some interesting ideas on how to take this ball and run
with it, feel free. It's free software, and this is exactly what you
should be doing. Contact me if you'd like a copy of the source.

I do have some thoughts on what my next experiment is going to be, however ... .

What have I learned from all this? I've learned a lot about the
structure of Perl 5. I've realized that roughly half of it is support
infrastructure for the other half, the business half. Is this good or
bad? Well, it certainly means that we're not beholden to anyone else -
an external library may suddenly change its implementation, semantics
or interface, and Sapphire would have to struggle to catch up. Perhaps
it's all about control. By implementing everything ourselves, the
porters retain control over Perl.

I've also learned that Perl 5, internally, has a lot to share, yet,
even though we claim to believe in code reuse where the CPAN's
concerned, we do very little of it on a lower level, neither really
giving nor really taking.

I've learned that rapid development can come out of a number of things.
First, having external code already written to do the work for you
helps a lot, even though you don't have much control over it.

Second, having an existing implementation of what you're trying to program
also helps, although you have to walk a fine line. Taking Perl 5 code
wholesale meant I either had to do a lot of surgery or support things
I didn't really want to support, but ignoring the whole of the
existing code base would feel like throwing the baby out with the
bathwater. (Hence I would caution the Perl 6 internals people to
thresh carefully the Perl 5 code; there is some wheat in that chaff,
or else you wouldn't be using it ... .)

Finally, rapid development can come from having a well-organized and
disciplined team. My team swiftly agreed on all matters of design and
implementation and got down to coding without interminable and
fruitless discussions, taking unanimous decisions on how to get around
problems - because I was my team.

Would I say the Sapphire experiment was a success? Well, since it
taught me all the above, it certainly couldn't have been a failure. Did
it prove the point that developing with reusable code is worth the
sacrifice in terms of control? That remains to be seen.