February 2003 Archives

This is a book I have been looking forward to for a long time. Back when
James Tisdall had just finished his Beginning Perl for
Bioinformatics, I asked him to write an article about how to get into
bioinf from a Perl programmer's perspective.
With bioinformatics being a recently booming sphere and many scientists
finding themselves in need of computer programmers to help them process
their data, I thought it would be good if there was more information for
programmers about the genomic side of things.

Rex Dwyer has produced a book, Genomic Perl, which bridges the gap. As well as teaching
basic Perl programming techniques to biologists, it introduces many
useful genetic concepts to those already familiar with programming. Of
course, as a programmer and not a biologist, I'm by no means qualified
to assess the quality of that side of the book, but I certainly learned
a lot from it.

The first chapter, for instance, taught the basics of DNA transcription
and translation, the basics of Perl string handling, and tied the two
concepts together with an example of transcription in Perl. This is
typical of the format of the book - each chapter introduces
a genetic principle and a related problem, a Perl
principle which can be used to solve the problem, and a program
illustrates both. It's a well thought-out
approach which deftly caters for both sides of the audience.

However, it should be stressed that the book is a substitute neither for
a good introductory book on Perl nor a good textbook on genetics; and
indeed, I think it will turn out to be better for programmers who need
an over-arching idea of some of the problems involved with
bioinformatics than for biologists who need to turn out working code.
For instance, when it states that a hash is the most convenient data
structure for looking up amino acids by their codons, it doesn't say
why, or even what a hash is. On the other hand, amino acids and codons
are both explained in detail.

The book covers a wide range of biological areas - from the structure of
DNA to building predictive models of species, exploring the available
databases of genetic sequences including readers of the GenBank database
and an implementation of the BLAST algorithm, phylology, protein
databases, DNA sequence assembly and prediction, restriction mapping,
and a lot more besides. In all, it's a good overview of the common
areas in which biologists need computer programs.

There's a significant but non-threatening amount of math in there,
particularly in dealing with probabilities of mutation and determining
whether or not events are significant, but I was particularly encouraged
to see discussion of algorithmic running time; as the author is
primarily a computer science professor and secondarily a
bioinformaticists, this should not be too surprising. However, a
significant number of bioinformaticists tend to produce code which
works... eventually. Stopping to say "well, this is order n-to-the-6 and
we can probably do better than that" is most welcome.

Onto the code itself. The first thing any reader will notice about the
book is that the code isn't monospaced. Instead, the code is "ground",
pretty-printed, as in days of old. This means you'll see code snippets
like:

Now, I have to admit I really like this, but others may find it
difficult to read, and those who know slightly less Perl may find it
confusing - the distinction between '' and "
(that's two single quotes and a double quote)
can be quite subtle, and if you're going to grind Perl code, regular
expressions really, really ought to be monospaced. "$elem =~
/^[(^\[(]*)(\(.*\))?$/;" is just plain awkward.

The code is more idiomatic than much bioinformatic code that I've seen,
but still feels a little unPerlish; good use is made of references,
callbacks and object oriented programming, but three-argument
for is used more widely than the fluent Perl programmer
would have it, and things like

main();
sub main {
...
}

worry me somewhat. But it works, it's efficient, and it's certainly
enough to get the point across.

The appendices were enlightening and well thought-out: the first turns
an earlier example, RNA structure folding, into a practical problem of
drawing diagrams of folded RNA for publication; the other two tackle
matters of how to make some of the algorithms in the text more
efficient.

All in all, I came away from this book not just with more knowledge
about genetics and biology - indeed, some of what I learned has been
directly applicable to some work I have - but also with an understanding
of some of the complexity of the problems geneticists face. It fully
satisfies its goals, expressed in the preface: teaching computer
scientists the biological underpinnings of bioinformatics, providing
real, working code for biologists without boring the programmers, and
providing an elementary handling of the statistical considerations of
the subject matter. While it will end up being more used by programmers
getting into the field, it's still useful for the biologists already
there, particularly when combined with something like James Tisdall's
book or Learning
Perl. But for the programmer like me, interested in what biologists
do and how we can help them do it, it's by far the clearest introduction
available, and I would heartily recommend it.

Another week, another Perl 6 Summary, in which you'll find gratuitous
mentions of Leon Brocard, awed descriptions of what Leopold
Tötsch got up to and maybe even a summary of what's been
happening in Perl 6 design and development.

Dan responded to prompting from Leo Tötsch about the use and
reuse of string headers. The problem is that most of the string
functions that produce modified strings do it in new string headers;
there's no way of reusing existing string headers. This can end up
generating loads of garbage. Dan's going through the various string
handling ops and PMC interfaces working out what needs to do what, and
documenting them, as well as adding in versions of the ops that take
their destination string headers as an argument. Dan hopes that 'we
can make the changes quickly and get this out of the way once and for
all', leading Robert Spier to mutter something about 'famous last
words'.

Tupshin Harper has been trying to use an XML parser from within Parrot
and started off by looking at the PXS example (in examples/pxs) but
had problems following the instructions given there as his compiler
spat out errors by the bucket load. Leo Tötsch thought that PXS
was probably deprecated and the native call interface (NCI) was the
thing to use. Being Leo, he provided a port of the PXS Qt example to
NCI. Although PXS appears to be out of sync with the parrot core,
nobody was entirely sure whether it should be removed.

Tupshin Harper had some problems with some of the language examples
not working well with the most recent versions of Parrot. Leo
Tötsch addressed most of the issues he raised, but there are
definitely issues with the interpreter and the languages getting out
of sync.

Leo Tötsch posted an RFD covering his understanding of the
various calling conventions that IMCC would have to deal with which
sparked some discussion. I'm now confused as to whether function calls
in Parrot will be caller saves, callee saves, or some unholy mixture
of the two.

Tupshin Harper decided to port primes.pasm to C and Perl 5 to
compare results. Parrot came out very quick indeed. (Close to C). For
bonus points he then took a python primes algorithm that had been
ported to C and ported that to Parrot as well. In full on, all stops
pulled out, knobs turned to 11 mode, Parrot came in at about 50%
slower than C and around 14 times faster than Python. There was some
muttering about the demo being rigged. However, Jim Meyer redid the
Perl and Python implementations to use a loop that duplicated the
algorithm used in primes.pasm and, whilst it improved their
performance somewhat, Parrot was still substantially faster.

This probably won't be the test that's run when Dan and Guido face the
possibility of custard pies at 10 paces or whatever the performance
challenge stake stands at now.

Leon Brocard patched examples/assembly/life.pasm to use a small
spaceship as its starting pattern. Apparently because it 'can provide
hours of enjoyment at conferences if projected onto a big screen while
answering parrot questions.' Nobody thought to ask him how a spaceship
was going to answer Parrot questions, but Leo Tötsch applied the
patch.

Apparently, Leo Tötsch finds it unbearable that 'optimized
compiled C is still faster than parrot -j' so he's been experimenting
with adding smarts to IMCC, making it add hardware register allocation
hints to its emitted bytecode. Sean O'Rourke liked the basic idea, but
reckoned that the information generated by IMCC should really be
platform-independent, suggesting that it'd be okay to pass a control
flow graph to the JIT, but that hardware register allocation for a
specific number of registers would iffy. He suggested that another
option would be for IMCC to 'just rank the Parrot registers in order
of decreasing spill cost', then the JIT could just move the most
important parrot registers into architectural registers.

Dan thought the idea was interesting too, but worried that the JIT
might spend more time optimizing code than it could possibly gain from
the optimization. The discussion got more than a little technical
after this. I'm afraid I'm a bear of little brain when it comes to
this sort of magic, so you'll have to read the whole thread if you're
interested in the details.

The response was pretty positive throughout the discussion, so Leo
went ahead and implemented it. The new, improved version shaved
slightly more than a tenth of a second from the primes.pasm runtime
(not a great percentage win, but the total runtime includes the
compilation time)

Steve Fink is bothered by the invoke op because it operates
implicitly on P0. He wants to replace it with a new version that takes
a PMC argument. Leo Tötsch is less bothered by the implicit
behaviour of the op, but would like to see an additional invoke_p
op, which would take a single PMC argument. So Steve implemented it,
but hit a couple of snags. I'm not quite sure whether these have been
overcome yet.

Nicholas Clark was unable to build Parrot with computed gotos turned
off. Simon Glover offered a simple patch that fixed Nick's
problem. Leo Tötsch talked about the different reasons for not
building a computed goto core, which depends on both the compiler's
capabilities and the user's choices. This led to a discussion of which
cores were now obsolete. Leo believes that the simple Computed Goto
and Prederef cores have been obsoleted by the CGP core (Computed Goto
Prederefed). Nick thought we should continue to ship code for all the
cores, and ensure that the config system is flexible enough to let
anyone build any arbitrary combination of cores, which convinced
Leo. Dan suggested that, once we'd done this we should revamp the
Tinderbox system to run tests on all the core types.

Tupshin Harper wondered if there were any plans for Parrot to support
a distinct .string asm section. Leo Tötsch pointed to
.constant (in PASM) and .const (in IMCC) as ways of keeping
strings visually together. Tupshin realised he'd missed a chunk of
documentation (perldoc assemble.pl for anyone else who hasn't read
it) which he thinks should probably be moved into
docs/parrot_assembly.pod or somewhere similar.

Tupshin Harper wondered if it was possible and/or meaningful to read
and write from a part of a register (eg. a single word) in
PASM. Answer: Not at the moment, what do you want? We can always add
an intreg.ops set.

David Storrs suggested some possible semantics for (4, 1, 2) + 7,
noting that he doesn't like the Perl5's current behaviour (evaluates
to 9). Michael Lazzaro thinks it should evaluate to 10 (length of the
list + 7) or possibly throw a syntax error. This then morphed into a
discussion of pass by value, pass by reference and pass by constant
reference, which Allison Randal told us would be addressed in the
upcoming Apocalypse 6.

Mmm... the comfy chair... it's the only place to write a summary
in. Especially when you're plied with lashings of Earl Grey tea and
entertained by the antics of a couple of adolescent cats. Things could
be lots worse.

I'd like to apologize to Danny O'Brien for last weeks mention of 'a
Brainf*ck compiler, in Brainf*ck, for the Brainf*ck interpreter
supplied with Parrot, a virtual machine named after a joke, written
for a language that doesn't yet exist.' Apparently this sprained
Danny's head. And I extend similar apologies to anyone else whose head
was sprained by that concept.

The American Odyssey web page is still more of a beautiful idea than an
actual link you can click. Hopefully it'll spring into existence
before we set off, but I wouldn't bet on it.

Aspell is, once more, the weapon of proofreading choice.

If you appreciated this summary, please consider one or more of the
following options:

Why waste time reinventing the wheel, when you could be reinventing the engine?-Damian Conway

As a Perl programmer, sooner or later you'll get an opportunity to build
a search engine. Like many programming tasks - parsing a date,
validating an e-mail address, writing to a temporary file - this turns
out to be easy to do, but hard to get right. Most people try end up with some kind of reverse index, a data structure that
associates words with lists of documents. Onto this, they graft a scheme
for ranking the results by relevance.

Nearly every search engine in use today - from Google on down - works on
the basis of a reverse keyword index. You can write such a keyword
engine in Perl, but as your project grows you will inevitably find
yourself gravitating to some kind of relational database system. Since
databases are customized for fast lookup and indexing, it's no surprise
that most keyword search engines make heavy use of them. But writing code for them isn't much fun.

More to the point, companies like Google and Atomz already offer
excellent, free search services for small Web sites. You can get an
instant search engine with a customizable interface, and spend no time
struggling with Boolean searches, text highlighting, or ranking
algorithms. Why bother duplicating all that effort?

As Perl programmers, we know that laziness is a virtue. But we also know
that there is more than one way to do things. Despite the ubiquity of
reverse-index search, there are many other ways to build a
search engine. Most of them originate in the field of information
retrieval, where researchers are having all kinds of fun. Unfortunately,
finding documentation about these alternatives isn't easy. Most of the
material available online is either too technical or too impractical to
be of use on real-world data sets. So programmers are left with the
false impression that vanilla search is all there
is.

In this article, I want to show you how to build and run a search
engine using a vector-space model, an alternative to reverse index
lookup that does not require a database, or indeed any file storage at
all. Vector-space search engines eliminate many of the disadvantages
of keyword search without introducing too many disadvantages of their
own. Best of all, you can get one up and running in just a few dozen
lines of Perl.

Vector-space search engines use the notion of a term space, where
each document is represented as a vector in a high-dimensional space.
There are as many dimensions as there are unique words in the entire
collection. Because a document's position in the term space is
determined by the words it contains, documents with many words in
common end up close together, while documents with few shared words end
up far apart.

To search our collection, we project a query into this term space and
calculate the distance from the query vector to all the document vectors
in turn. Those documents that are within a certain threshold distance
get added to our result set. If all this sounds like gobbledygook to
you, then don't worry - it will become clearer when we write the code.

The vector-space data model gives us a search engine with several
useful features:

Searches take place in RAM, there is no disk or database access

Queries can be arbitrarily long

Users don't have to bother with Boolean logic or regular expressions

It's trivially easy to do 'find similar' searches on returned documents

You can set up a 'saved results' basket, and do similarity searches on the
documents in it

The easiest way to understand the vector model is with a concrete example.

Let's say you have a collection of 10 documents, and together they contain
50 unique words. You can represent each document as a vector by counting
how many times a word appears in the document, and moving that distance along
the appropriate axis. So if Document A contains the sentence "cats like
chicken", then you find the axis for cat, move along one unit, and then do
the same for like and chicken. Since the other 47 words
don't appear in your document, you don't move along the corresponding
axes at all.

Plot this point and draw a line to it from the origin, and you have your
document vector. Like any vector, it has a magnitude (determined by
how many times each word occurs), and a direction (determined by which
words appeared, and their relative abundance).

There are two things to notice here:

The first is that we throw away all information word order, and there is
no guarantee that the vector will be unique. If we had started with the
sentence "chickens like cats" (ignoring plurals for the moment), then we
would have ended up with an identical document vector, even though the
documents are not the same.

This may seem to be a big limitation, but it turns out that word order
in natural language contains little information about content - you can
infer most of what a document is about by studying the word list. Bad
news for English majors, good news for us.

The second thing to notice is that with three non-zero values out of a
possible 50, our document vector is sparse. This will hold true for
any natural language collection, where a given document will contain
only a tiny proportion of all the possible words in the language. This
makes it possible to build in-RAM search engines even for large
collections, although the details are outside the scope of this article.
The point is, you can scale this model up quite far before having to
resort to disk access.

We'll make things easy by starting with a tiny collection of four documents,
each just a few words long:

"The cat in the hat"
"A cat is a fine pet."
"Dogs and cats make good pets."
"I haven't got a hat."

Our first step is to find all the unique words in the document set. The
easiest way to do this is to convert each document into a word list, and
then combine the lists together into one. Here's one (awful) way do it:

The subroutine splits a text string on whitespace, takes out all punctuation
except hyphens and apostrophes, and converts everything to lower case before
returning a hash of words.

The curious do statement is just a compact way of creating a lookup hash.
%doc_words will end up containing our word list as its keys, and its values
will be the number of times each word appeared.

If we run this 'parser' on all four documents in turn, then we get a master word
list:

a
and
cat
cats
dogs
fine
good
got
hat
haven't
i
in
is
make
pet
pets
the

Notice that many of the words in this list are junk words - pronouns, articles,
and other grammatical flotsam that's not useful in a search context. A common
procedure in search engine design is to strip out words like these using a
stop list. Here's the same subroutine with a rudimentary stop list added
in, filtering out the most common words:

A true stop list would be longer, and tailored to fit our document
collection. You can find a real stop list in the DATA section of
Listing 1, along with a complete implementation of the search engine
described here. Note that because of Perl's fast hash lookup algorithm,
we can have a copious stop list without paying a big price in program
speed. Because word frequencies in natural language obey a power-law
distribution, getting rid of the most common words removes a
disproportionate amount of bulk from our vectors.

Here is what our word list looks after we munge it with the stop list:

cat
cats
dogs
fine
good
got
hat
make
pet
pets

We've narrowed the list down considerably, which is good. But notice that our
list contains a couple of variants ("cat" and "cats", "pet" and "pets"), that
differ only in number. Also note that someone who searches on 'dog' in our
collection won't get any matches, even though 'dogs' in the plural form is a
valid hit. That's bad.

This is a common problem in search engine design, so of course there is a module
on the CPAN to solve it. The bit of code we need is called a stemmer, a set
of heuristics for removing suffixes from English words, leaving behind a common
root. The stemmer we can get from CPAN uses the Porter stemming algorithm,
which is an imperfect but excellent way of finding word stems in English.

use Lingua::Stem;

We'll wrap the stemmer in our own subroutine, to hide the clunky Lingua::Stem
syntax:

Notice that we apply our stop list before we stem the words. Otherwise,
a valid word like beings (which stems to be) would be caught by the
overzealous stop list. It's easy to make little slips like this in search algorithm design, so be extra careful.

With the stemmer added in, our word list now looks like this:

cat
dog
fine
good
got
hat
make
pet

Much better! We have halved the size of our original list, while
preserving all of the important content.

Now that we have a complete list of content words, we're ready for the second
step - mapping our documents into the term space. Because our collection has a
vocabulary of eight content words, each of our documents will map onto an
eight-dimensional vector.

Here is one example:

# A cat is a fine pet
$vec = [ 1, 0, 1, 0, 0, 0, 1 ];

The sentence "A cat is a fine pet" contains three content words. Looking
at our sorted list of words, we find cat, fine, and pet at
positions one, three, and eight respectively, so we create an anonymous
array and put ones at those positions, with zeroes everywhere else.

If we wanted to go in the opposite direction, then we could take the vector
and look up the non-zero values at the appropriate position in a sorted
word list, getting back the content words in the document (but no
information about word order).

The problem with using Perl arrays here is that they won't scale. Perl
arrays eat lots of memory, and there are no native functions for
comparing arrays to one another. We would have to loop through our
arrays, which is slow.

A better data way to do it is to use the PDL module, a set of compiled
C extensions to Perl made especially for use with matrix algebra. You
can find it on the CPAN. PDL stands for "Perl Data Language", and it
is a powerful language indeed, optimized for doing math operations
with enormous multidimensional matrices. All we'll be using is a tiny
slice of its functionality, the equivalent of driving our Ferrari to
the mailbox.

It turns out that a PDL vector (or "piddle") looks similar to our
anonymous array:

We give the piddle constructor the same anonymous array as an
argument, and it converts it to a smaller data structure, requiring
less storage.

Since we already know that most of our values in each document
vector will be zero (remember how sparse natural language is),
passing full-length arrays to the piddle constructor might get a
little cumbersome. So instead we'll use a shortcut to create a
zero-filled piddle, and then set the non-zero values explicitly. For
this we have the zeroes function, which takes the size of the
vector as its argument:

Here we've said "take this vector, and set the value at position 4 to
3".

This turns out to be all we need to create a document vector. Now we
just have to loop through each document's word list, and set the
appropriate values in the corresponding vector. Here's a subroutine that
does the whole thing:

Now that we can generate a vector for each document in our collection,
as well as turn an incoming query into a query vector (by feeding the
query into the make_vector subroutine), all we're missing is a way to
calculate the distance between vectors.

There are many ways to do this. One of the simplest (and most intuitive)
is the cosine measure. Our intuition is that document vectors with
many words in common will point in roughly the same direction, so the
angle between two document vectors is a good measure of their
similarity. Taking the cosine of that angle gives us a value from
zero to one, which is handy. Documents with no words in common will
have a cosine of zero; documents that are identical will have a cosine
of one. Partial matches will have an intermediate value - the closer
that value is to one, the more relevant the document.

The formula for calculating the cosine is this:

cos = ( V1 * V2 ) / ||V1|| x ||V2||

Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm,
and the * indicates the inner product. You can take the math on
faith, or look it up in any book on linear algebra.

We can normalize the vectors to unit length using the norm function, because we're not interested in their absolute magnitude, only the angle between them.

Now that we have a way of computing distance between vectors, we're
almost ready to run our search engine. The last bit of the puzzle is a
subroutine to take a query vector and compare it against all of the
document vectors in turn, returning a ranked list of matches:

All that remains is to sort the results by the cosine (in descending
order), format them, and display them to the user.

You can find an object-oriented implementation of this code in Listing 1, complete with built-in stop list and some small changes to make the
search go faster (for the curious, we normalize the document vectors
before storing them, to save having to do it every time we run the
cosine subroutine).

Our get_words subroutine is rudimentary - the code equivalent of
a sharpened stick. For one thing, it will completely fail on text
containing hyperlinks, acronyms or XML. It also won't recognize
proper names or terms that contain more than one word ( like
"Commonwealth of Independent States"). You can make the parser smarter
by stripping out HTML and other markup with a module like
HTML::TokeParser, and building in a part-of-speech tagger to find
proper names and noun phrases (look for our own Lingua::Tagger::En,
coming soon on the CPAN).

Perl has great Unicode support, and the vector model doesn't care about
language, so why limit ourselves to English? As long as you can write a
parser, you can adapt the search to work with any language.

Most likely you will need a special stemming algorithm. This can be
easy as pie for some languages (Chinese, Italian), and really hard for
others (Arabic, Russian, Hungarian). It depends entirely on the
morphology of the language. You can find published stemming algorithms
online for several Western European languages, including German,
Spanish and French.

It's easy to add a "find similar" feature to your search engine. Just
use an existing document vector as your query, and everything else falls
into place. If you want to do a similarity search on multiple documents,
then add the vectors together.

Term weighting is a fancy way of saying "some words are more
important than others". Done properly, it can greatly improve
search results.

You calculate weights when building document vectors. Local weighting assigns values to words based on how many times they appear in
a single document, while global weighting assigns values based on
word frequency across the entire collection. The intuition is that rare
words are more interesting than common words (global weighting), and
that words that appear once in a document are not as relevant as words
that occur multiple times (local weighting).

If your documents have metadata descriptors (dates, categories, author
names), then you can build those in to the vector model. Just add a slot for
each category, like you did for your keywords, and apply whatever kind
of weighting you desire.

You can add arbitrary constraints to your result set by adding a chain
of filters to your result set. An easy way to do exact phrase matching
is to loop through your result set with a regular expression. This kind
of post-processing is also a convenient way to sort your results by
something other than relevance.

Lucene is a Java-based keyword search engine, part of the Apache project.
It's a well-designed, open-source search engine, intended for larger projects.
The documentation discusses some of the challenges of implementing a large
search engine; it's worth reading even if you don't know Java.

Foundations of Statistical Natural Language Processing, MIT Press (hardcover textbook). Don't be put off by the title - this book is a fantastic introduction to all kinds of issues in text search, and includes a thorough
discussion of vector space models.

An introduction to latent semantic indexing, the vector model on steroids. The document is aimed at nontechnical readers, but gives some more background information on using vector techniques to search and visualize data collections.

The adventurous can also download some Perl code for latent semantic
indexing at http://www.nitle.org/lsi.php. Both the code and the article
come from my own work for the National Institute for Technology and
Liberal Education.

Last week I mentioned that nobody seemed to have commented on Dan's
bet with Guido van Rossum that Parrot would outperform Python by OSCON
2004. (I also missed the fact that the penalty for losing the bet now
involves cream pies as well as $10 and a round of drinks...). After I
posted the summary to the mailing lists, Leopold Tötsch informed
me that he had commented indirectly by announcing his new, improved,
ludicrously quick runloop that combines computed GOTOs and
predereferencing. Whatever that means.

This week, Leo and Nicholas Clark worked out how to combine the
blistering pace of the JIT core (for operations that had been
translated into hand hacked machine code) with the blistering pace of
the CGP runloop (for the other ops). As far as I can tell, this
involved turning the idea 'inside out', the VM actually starts up
running JIT compiled code and calls out to the CGP core to execute
non-JITable sequences of ops. The numbers for this approach look
fantastic (quite stunningly quick...) So Leo checked it in.

Last week we were reminded that JIT and Predereferenced runloops don't
work in a threaded environment. This week Jerome Vouillon pointed out
an approach that looks like it could fix that (it seemed to convince
Leo). Dan (possibly playing mail catchup) said he was okay with having
to fall back to the old fast core (as opposed to the current, stupidly
fast core) in the presence of threads, but Leo seem to think that,
using Jerome's scheme we'll be able to have our cake, eat it and still
throw the cream pie at the Python team. Huzzah!

Leo Tötsch had wondered about why
{get,set}_<type>_keyed_int vtable methods needed to take an INTVAL*
value instead of a plain INTVAL as it introduced some possibly unneeded
conditional code, a stack variable for the key and made life hard for
JIT code. It looks like the pointer is a holdover from an early
approach to doing multidimensional keyed structures.

Jürgen Bömmels announced that he'd implemented macro
expansion in IMCC (Yay Jürgen!). Leo liked it, but requested a
few changes before he'd check it in, so hopefully, some time soon the
mainline IMCC will have macros, which is very nice.

Almost all the discussion was about the difference between arrays and
lists. Deborah Ariel Pickett came up with a good list of questions
about arrays and array references in scalar and list contexts, which
Michael Lazzaro answered (very neatly I thought) with a list of their
different contextual behaviours. Deborah extended Michael's list to
hashes and hashrefs in a reasonably obvious way. Smylers came up with
a possible new sigquote (after paraphrasing): ``We should limit new
features to those that arise from a desire for the functionality
rather than a desire to use up 'spare' syntax''. (Okay, it's not
exactly snappy, but I think it's important).

And that wraps it up for the language list. I'm sure it'll pick up
again soon though. There are rumours of a draft apocalypse in the next
couple of weeks, and presumably that implies a real apocalypse soon
after. Assuming we haven't had another kind of Apocalypse in the mean
time.

This week's summary was again prepared in the comfy chair with
surprisingly few distractions apart from the late arrival of mail from
Leon Brocard announcing that he'd implemented a brainf*ck compiler in
brainf*ck, but that didn't really happen this week so I've got no
excuse for mentioning Leon's name in this summary.

Thanks to everyone who dropped us a line about our American Odyssey;
I'm hoping I'll have one of those web page things up at some point
with a rough itinerary in case anyone wants to come and throw rotten
tomatoes at me (or, preferably, buy me sushi).

Proofreading services were again down to Aspell and me.

If you appreciated this summary, please consider one or more of the
following options:

This article was originally published in February of 2003, and was updated by Dave Rolsky and Michael Schwern in January, 2008.

If you've ever created a Perl module for distribution on CPAN, you've
used the ExtUtils::MakeMaker module. This venerable module goes
back to the dawn of modern Perl, which began with the release of Perl
5.000.

Recently, Ken Williams has created a potential replacement for
ExtUtils::MakeMaker called Module::Build, which was first
released in August of 2002. Hugo van der Sanden, the pumpking for the
current development version of Perl, has expressed interest in
replacing ExtUtils::MakeMaker with Module::Build for the 5.10
release of Perl, and Michael Schwern, the current
ExtUtils::MakeMaker maintainer, agrees with him.
ExtUtils::MakeMaker won't go away any time soon, but we can hope
for a gradual transition to a new and improved build system.

The ExtUtils::MakeMaker module, along with the h2xs script, has
been a huge boon to the Perl community, as it makes it possible to
have a standard way to distribute and install Perl modules. It
automates many tasks that module authors would otherwise have to to
implement by hand, such as turning XS into C, compiling that C code,
generating man pages from POD, running a module's test suite, and of
course, installing the module.

ExtUtils::MakeMaker is a huge part of what makes PAUSE and CPAN
possible, and it is quite a programming feat. Python did not have a
similar utility until the September, 1999 release of distutils, PHP's
PEAR didn't begin until mid-2000, and Ruby is just beginning to work
on a standard library distribution mechanism.

ExtUtils::MakeMaker works by generating a makefile that contains
targets for the various tasks needed when maintaining or installing a
module. Here's an example target:

install :: all pure_install doc_install

If you're familiar with makefile syntax, you'll realize that all this
target does is call three other targets, which are named "all",
"pure_install", and "doc_install". These targets in turn may call
other targets, or use system commands to do whatever action is needed,
in this case installing the module and its associated documentation.

The makefiles generated by ExtUtils::MakeMaker are fairly complex.
For example, using version 6.05 of ExtUtils::MakeMaker to install
my Exception::Class module, a pure Perl distribution containing
just one module, creates a makefile with about 390 lines of makefile
code. Figuring out what this makefile actually does is no simple
feat, because it consists of a maze of twisty macros, all alike, many
of which simply call Perl one-liners from the command line to perform
some task.

The ExtUtils::MakeMaker code itself is extremely complicated, as it
works on many operating systems (almost as many as Perl itself), and
needs to accommodate their file systems, command line shells,
different versions of make, and so on. And this is all done with an
extra layer of indirection in place, because it is generating a
makefile which does all the work.

If you want to customize the module build or installation process,
good luck. To do this, you must subclass ExtUtils::MakeMaker,
override methods that generate specific makefile targets, and then
tweak the generated text to include your custom behavior, all the
while preserving the basic behavior of the target. Considering that
there is no documentation describing what to expect from these
targets, and that the actual target text may change between releases
of ExtUtils::MakeMaker or between different OS's, this can be quite
painful to implement and maintain.

And by the way, you can't really subclass ExtUtils::MakeMaker,
instead you are subclassing the MY package. This is a deeply
strange hack, but the end result is that you can only override certain
pre-defined methods in ExtUtils::MakeMaker.

For example, the HTML::Mason module includes this snippet:

package MY;

sub test {
my $self = shift;

my $test = $self->SUPER::test(@_);

# %MY::APACHE is set in makeconfig.pl.
# If we are not going to test with Apache there is no harm in
# setting this anyway.

The goal of all this code is to pass some additional environment
information to the test scripts when they are run, so we can do tests
with a live Apache server. It accommodates several versions of
ExtUtils::MakeMaker, and attempts to work properly on multiple
operating systems (at least Win32 and *nix), and it has to be careful
about escaping things properly so that it executes properly from the
shell.

All of this prompts the question of "why not just use Perl itself for
all of this?" That's exactly the question that Ken Williams answered
with Module::Build. The goal of Module::Build is to do
everything useful that ExtUtils::MakeMaker does, but to do this all
with pure Perl wherever possible.

This greatly simplifies the build system code, and Module::Build
works on systems which don't normally include make, like Win32 and Mac
OS. Of course, if a module installation requires the compilation of C
code, you'll still need an external C compiler.

Additionally, customizing Module::Build's behavior is often quite
trivial, and only requires that you know Perl, as opposed to knowing
make syntax and possibly having to learn about multiple command line
environments.

Module::Build also aims to improve on some of the features provided
by ExtUtils::MakeMaker. One example is its prerequisite-checking
system, which provides much more flexibility than what
ExtUtils::MakeMaker allows. While these features could be added to
ExtUtils::MakeMaker, it's risky to make major changes to such an
important module, especially one with such complex internals.

From an end-user perspective, a module that uses Module::Build
looks quite a bit like one using ExtUtils::MakeMaker, and
intentionally so. So to install a module using Module::Build you'd
type the following lines from the command line:

perl Build.PL
./Build
./Build test
./Build install

The Build.PL script tells Module::Build to create a Build
script. During this process, Module::Build also writes some files
to a _build/ directory. These files are used to store the build
system's state between invocations of the Build script. This
script, when invoked, simply loads up Module::Build again, and
tells it to perform the specified action. An action is the
Module::Build version of a makefile target, and actions are
implemented in pure Perl whenever possible.

The "module_name" parameter is like the ExtUtils::MakeMaker "NAME"
parameter.

The "license" parameter is new with Module::Build, and its intended
use it allow automated tools to determine under which license your
module is distributed.

To determine your module's version, Module::Build looks in the
module specified by the "module_name" parameter, though this can be
overridden either by specifying a different module to look in, or by
providing the version number directly.

Of course, there are more options than those. For example,
Module::Build implements a prerequisite feature similar to that
implemented by ExtUtils::MakeMaker, so we can write:

If you have any experience with ExtUtils::MakeMaker, you can
probably figure out that this says that our module requires any
version of CGI, and version 2.1013 or greater of DBD::mysql. So
far, this looks just like what ExtUtils::MakeMaker provides, but
Module::Build goes further.

Consider what happens if we know that we need some piece of
functionality first present in DBD::mysql 2.1013. But perhaps a
release after this, 2.1014, introduced a new bug that breaks our
application. If this bug is fixed in version 2.1015, we could
simply require version 2.1015, but this is not ideal. There's no
reason to force someone who already has 2.1013 to upgrade because of a
bug in a version they don't have.

Fortunately, Module::Build provides a more flexible version
specification option that handles exactly this situation, so we can
write:

If the user does have version 3.0 or greater installed, it will at
least let them know that it won't work with our module.
Unfortunately, the only possible way to use our module at this point
is for the end user to manually downgrade their installation of
DBD::mysql, since Perl does not allow multiple versions of a module
to co-exist peacefully. Still, this is better than letting the module
be installed, only to fail at runtime when it tries to use an outdated
API for DBD::mysql.

There are also other options related to prerequisites, such as
"recommends" and "build_requires", which can be helpful for
prerequisites that are required to build the module but don't need to
be present after installation. There is also a "conflicts" option
which can be used to warn a user about potential conflicts between the
module they are installing and one they already have.

This is the default action, and is what happens if you run ./Build
without any additional arguments. It is responsible for creating the
blib/ directory and copying files into it, as well as compiling XS
and C files. If you have any scripts like lib/My/Module.pm.PL,
these are also run during this action.

Runs the module's tests using the Test::Harness module. The
"testdb" action can be used to run the tests under Perl's debugger.
Equivalently, a "debugger" parameter can be passed to the "test"
action to get the same effect.

Creates a directory based on your distribution name and version, and
then copies all the files listed in MANIFEST to that directory.
This directory is what people will see when they download and unpack
your distribution.

Module::Build also creates a file called META.yaml which
contains meta-data about your distribution. In the future, it may be
possible to use a command line tool (written in Perl, of course) to
read this file and use its contents to install your distribution,
without running the Build.PL script. It also makes the meta-data
more readily available to tools like search.cpan.org or the CPAN
shell.

This performs the "distdir" action, switches to the newly created
directory, and then runs perl Build.PL, ./Build, and ./Build
test. This lets you make sure that your distribution is actually
installable.

This version is actually readable, and is unlikely to break regardless
of changes in the Module::Build internals. This highlights just
how difficult it was to accomplish a simple task using
ExtUtils::MakeMaker, and how natural the pure-Perl solution can be.

One difficulty in getting Module::Build into widespread use is the
fact that support for ExtUtils::MakeMaker is so tightly integrated
into CPAN installers.

While the version of CPAN.pm which ships with 5.8 does not know how to deal with Build.PL, the latest versoin available from CPAN does. It will even install Module::Build for you. As of January 2008, CPANPLUS, another CPAN shell, understands Build.PL but will not install Module::Build for you, but this will be remedied in a future release.

However, old versions of CPAN.pm are still in extremely widespread use, and users won't
necessarily upgrade CPAN.pm before attempting to install a
distribution that relies on Module::Build.

There are a couple workarounds for this problem. The simplest is to
just include a Build.PL script and document this in the README
or INSTALL files included with your distribution. This has the
appeal of requiring of very little work to implement, but the downside
is that people who expect things to just work with a CPAN shell will
give up when your distribution doesn't install properly.

Another possibility is to create functionally equivalent Build.PL
and Makefile.PL scripts. If you're using Module::Build because
you need to customize installation behavior in a way that is difficult
to do with ExtUtils::MakeMaker, this pretty much defeats the
purpose of using Module::Build at all, and in any case having two
separate pieces of code that do the same thing is always unappealing.

Then there's the approach which involves using a
Makefile.PL script that simply installs Module::Build if needed,
and then generates a Makefile that passes everything through to the
./Build script. This is known as the "passthrough" method.

I think this approach gives the best result for the effort involved,
and is the method I prefer. The Module::Build distribution
includes a Module::Build::Compat module, which does the dirty work needed for this approach.

Simply add create_makefile_pl => 'passthrough' to the Build.PL parameters and a Makefile.PL will be created as part of the Build dist process.

This first attempts to load version 0.02 or greater of the
Module::Build::Compat module. If it isn't installed we know we
need to install Module::Build. Because we're polite, we ask the
user if they would like to install Module::Build before going
further. Some people dislike interactive installations, but fortunately the promp() command is pretty smart about detecting if there's a user at the end of the line.

Assuming that the user agrees to install Module::Build (if they don't the installer has to give up) this is what comes next:

We want to use CPAN.pm to actually install Module::Build, but we
need to first save our current directory, because CPAN.pm calls
chdir() quite a bit, and we'll need to be in the same directory as
we started in after installing Module::Build.

Then we load CPAN.pm and tell it to install Module::Build.
After that, we chdir() back to our original directory.

First it checks that the Module::Build install worked. Then it simply tells Module::Build::Compat to run the Build.PL
script, and to write out a "passthrough" Makefile.
Module::Build::Compat will attempt to convert
ExtUtils::MakeMaker style arguments, like "PREFIX", to arguments
that Module::Build can understand, like "--prefix".

The "passthrough" Makefile that Module::Build::Compat generates
looks something like this:

The ".DEFAULT" target is called when there is no matching make target
for the one given on the command line. It uses the "$@" make
variable, which will contain the name of the target that was passed to
make. So if "make install" is called, then "$@" contains "install",
and it ends up running "./Build install".

The generated Makefile also contains a comment which specifies the
module's prerequisites, because this is how CPAN.pm figures out
what a module's prerequisites are (scary but true).

This approach is the most elegant of all, but the code that translates
ExtUtils::MakeMaker arguments to something Module::Build
understands is quite minimal and won't handle all possibilities.

I have used this approach for one CPAN module, Thesaurus.pm, and in
my limited testing it did work. If you are inclined to try installing
this module, please send bug reports to me or the Module::Build
users list, module-build-general@lists.sf.net.

Recently, Autrijus Tang submitted a more complex Makefile.PL script
which implements several pieces of additional functionality. First of
all, it makes sure that the script is running as a user that can
actually install Module::Build. Second, it prefers CPANPLUS.pm
to CPAN.pm.

Autrijus' script looks promising, but since it hasn't yet been tested,
I've chosen not to include it here. It's quite likely that some
version of his script will be documented in future versions of
Module::Build

There is plenty of work left to be done on Module::Build. Off the
top of my head, here are some things that still need to be done:

The installation phase does not yet create man pages based on POD
included in the distribution.

Module::Build needs to implement a "local install" feature like the
one provided by the ExtUtils::MakeMaker "PREFIX" argument. The
logic that implements this in ExtUtils::MakeMaker is Byzantine, but
only because doing this correctly is quite complicated. This logic
needs to be implemented for Module::Build as well.

Module::Build needs better backwards compatibility with
ExtUtils::MakeMaker. The argument translation in
Module::Build::Compat is currently just a placeholder. Things like
"PREFIX", "LIB", and "UNINST=1" all need to be translated by
Module::Build::Compat, and the equivalent functionality needs to be
implemented for Module::Build

CPANPLUS.pm could take advantage of more Module::Build features.
For example, it currently ignores the "conflict" information that
Module::Build makes available, and it doesn't attempt to
distinguish between "build_requires", "requires", or "recommends".

Some of what Module::Build provides is intended for use by external
tools, such as the meta-data provided by the META.yaml file.
CPANPLUS.pm could use this to avoid having to run the Build.PL
and Build scripts, thus avoiding the need to install any of the
"build_requires" modules. Package managers like rpm or the Debian
tools could also use it to construct installable packages for Perl
modules more easily.

Adding at least basic support for Module::Build to CPAN.pm would
be nice. If anyone is interested in pursuing this, I have an old
patch that may provide a decent start on this. Contact me if you're
interested.

If you want to learn more about Module::Build, the first thing you
should do is install it (it will install itself just fine under
CPAN.pm) and read the docs for the Module::Build and
Module::Build::Compat modules. There is a project on SourceForge
for Module::Build at
http://www.sourceforge.net/projects/module-build. The source is in
CVS on SourceForge, and is accessible via anonymous CVS and and
online.

Dan announced that he'd made a bet with Guido van Rossum that Parrot
would be faster at executing a pure python benchmark of some sort to
be determined. The details of the challenge will be announced at OSCON
2003 and the results at OSCON 2004. Dan stands to lose $10 and a round
of beer for the Zope/Pythonlabs folks. (Dunno how many of them there
are though...). We don't know what he stands to win yet, but I'd hope
'beers from each of the Zope/Pythonlabs folks' are included.

Jerome Quelin wondered how Parrot objects would handle property
inheritance. Dan pointed out that properties don't get inherited and
Jerome realised he meant to ask about attribute
inheritance. Attributes are inherited but are mostly invisible to
any methods but the methods of the class that defines the attributes
(though they will be accessible (presumably through reflection)).

James Michael DuPont wanted to know what had been decided about
Bytecode metadata and outlined the things that he'd like to know about
a given bytecode. Leo Tötsch reckoned that what James wanted was
either in the bytecode right now, or was handleable by the framework
that was in place. He pointed James to docs/parrotbyte.pod in the
Parrot distribution.

Further discussions centred on the amount of metadata and whether
this would affect loading speed and efficiency, or get in the way of
the desired 'mmap and go' principle. Gregor N. Purdy
pointed out that we also had to be capable of passing meta data from a
compiler 'through' IMCC and on to the final bytecode. There was also a
touching reunion between James Michael DuPont and Gopal V. Ah...

Phil Hassey has been lurking, fascinated on the internals list for a
couple of months. This week he broke the silence by asking a couple of
questions about cross language function dispatch, string compatibility
and global scopes. For instance, PHP, Common Lisp and others are case
insensitive about functions. Which is okay when you're calling such a
function from a case sensitive language, but can be tricky if you call
out from a case insensitive to a case sensitive language. Dan thought
that there wasn't going to be much that could be done about this
problem (at least, not transparently) but seems to think that the
other issues raised shouldn't be too problematic.

David popped up and, after remarking on the amount of progress Parrot
had made since he last looked at it, had a few questions about various
bits and pieces. Leo and Dan provided a bunch of good answers.

Leo Tötsch seems to have got bored of 'just' being the Patch
Monster and has been producing some discussion documents proposing all
sorts of useful tricks for improving the design/speed/extensibility of
Parrot's internals. This week he turned his attention to
core.ops. His plan involves a way of reducing the size of core_ops,
improving cache locality and generally solving a raft of problems I
didn't even know Parrot had. His new scheme allows for a small
core.ops, easy extension and no execution speed penalty for non
core ops. As is usual with Leo's stuff, the scheme came with
code. Gregor had a bunch of observations and suggestions and he and
Leo thrashed out a slightly modified scheme for going forward.

Leo Tö offered a couple of different core runloops this
week. First up was the Computed Goto Prederefed (CGP) runloop which,
essentially combined a two runloop optimization techniques to provide
what can only be described as massive speedups. The -O3 compiled
version ran parrot_mops 6 times faster than the basic 'fast_core' and
more than 3 times faster than the previous fastest core. People were
impressed.

A few days later, Leo reached into his bag of tricks and pulled out
the CSwitch runloop that did predereferenced dispatch via a big C
switch statement. This loop is about 50% faster than the basic
fast_core loop, but slower than all the computed goto dispatched
loops. At this point Jason Gloudon pointed out that predereferencing
can cause problems in a multi threaded environment.

And that about wraps it up for perl6-internals this week. Dan's been
busy in Sebastapol with the Perl 6 Cabal thrashing out the design of
Perl 6 functions and other bits with Larry, Damian, Allison and
chromatic, which probably explains why he's not been driving the
mailing list for the week.

Miko O'Sullivan proposed a new assignment operator, $var ?= $x :
$y, equivalent to $var = $var ? $x : $y. It was pointed out that,
for correctness that should be ??=. Deborah Ariel Pickett wondered
what was the need for such an operator and couldn't imagine ever
needing such an operator.

Miko O'Sullivan proposed that members of the language list provide
summaries of the discussions in the group [What's this? Chopped liver?
-- Ed]. The idea being that each summary would describe a proposed
idea and then summarizes the list's feelings on the idea, and would be
posted on a website. The idea was basically well received, then the
list fell to quibbling about format, whether summaries should be
posted to the list where they could be picked up by the list
summarizer, and whether we were just reinventing RFCs all over again.

As of now, the only summary that actually exists is Miko's summary of
the proposed ??= operator. Which I hadn't even bothered to read
until I came to write the summary.

Stéphane Payrard confessed that he would like newline to be a
statement terminator everywhere it can be. Everybody hated it. Deborah
Ariel Pickett came up with a neat example of why it was a really bad
idea with a simple ambiguous piece of code:

In his quest to exhaustively document what is known about Perl so far,
Michael Lazzaro asked for a definitive answer to the question ``What's
the difference between an array and a list in Perl 6''?

Various answers were given but I think Uri Guttman came up with the
most succinct pair of definitions, which reduce to 'lists are read
only, arrays are read/write'. (That's the data structure itself, not
the data structure's contents, ($a, $b) = (1, 2) doesn't modify
either list, but it does modify the lvalues in the left hand list).

This week's summary was prepared in the comfy chair with distractions
from cats, irc, former employers, bugs and poring over maps of the
Eastern Seaboard.

We're coming to America. This year, making a virtue out of a necessity
we're going to give ourselves a good long holiday in the States. We'll
be flying out to Boca Raton for YAPC, then taking a road trip up
through the Appalachians to New England, possibly swinging into Canada
before coming down through Michigan to Chicago and taking a slow train
to Portland, OR for The Perl Conference. If it looks like you might be
on our route, drop me a line...

Proofreading services were provided by Aspell and me. (I could lie and
say that they were provided by Leon Brocard, enabling me to work his
name into the summary somehow, but I think I'll do it in a
parenthetical comment instead.)

If you appreciated this summary, please consider one or more of the
following options:

Correct configuration of the MinSpareServers, MaxSpareServers,
StartServers, MaxClients, and MaxRequestsPerChild parameters
is very important. There are no defaults. If they are too low, then you
will underutilize the system's capabilities. If they are too high, then
chances are that the server will bring the machine to its knees.

All the above parameters should be specified on the basis of the
resources you have. With a plain Apache server, it's no big deal if
you run many servers since the processes are about 1Mb and don't eat a
lot of your RAM. Generally, the numbers are even smaller with memory
sharing. The situation is different with mod_perl. I have seen
mod_perl processes of 20Mb and more. Now, if you have MaxClients
set to 50, then 50x20Mb = 1Gb. Maybe you don't have 1Gb of RAM - so how
do you tune the parameters? Generally, by trying different
combinations and benchmarking the server. Again, mod_perl processes
can be made much smaller when memory is shared.

Before you start this task, you should be armed with the proper weapon.
You need the crashme utility, which will load your server with the
mod_perl scripts you possess. You need it to have the ability to
emulate a multiuser environment and to emulate the behavior of
multiple clients calling the mod_perl scripts on your server
simultaneously. While there are commercial solutions, you can get
away with free ones that do the same job. You can use the
ApacheBench utility that comes with the Apache distribution, the
crashme script which uses LWP::Parallel::UserAgent, httperf or
http_load all discussed in one of the previous articles.

It is important to make sure that you run the load generator (the
client which generates the test requests) on a system that is more
powerful than the system being tested. After all, we are trying to
simulate Internet users, where many users are trying to reach your
service at once. Since the number of concurrent users can be quite
large, your testing machine must be very powerful and capable of
generating a heavy load. Of course, you should not run the clients and
the server on the same machine. If you do, then your test results would be
invalid. Clients will eat CPU and memory that should be dedicated to
the server, and vice versa.

We see that the server is capable of serving 50 concurrent users at
133 requests per second! Let's find the upper limit. Using -n
10000 -c 1000 failed to get results (Broken Pipe?). Using -n 10000
-c 500 resulted in 94.82 requests per second. The server's
performance went down with the high load.

The benchmark timed out with the above configuration. I watched the
output of ps as I ran it, the parent process just wasn't capable
of respawning the killed children at that rate. When I raised the
MaxRequestsPerChild to 10, I got 8.34 requests per second. Very
bad - 18 times slower! You can't benchmark the importance of the
MinSpareServers, MaxSpareServers and StartServers with this
type of test.

Now let's reset MaxRequestsPerChild to 1500, but reduce
MaxClients to 10 and run the same test:

I got 27.12 requests per second, which is better but still four to five times
slower. (I got 133 with MaxClients set to 50.)

Summary: I have tested a few combinations of the server
configuration variables (MinSpareServers, MaxSpareServers,
StartServers, MaxClients and MaxRequestsPerChild). The
results I got are as follows:

MinSpareServers, MaxSpareServers and StartServers are only
important for user response times. Sometimes users will have to wait a
bit.

The important parameters are MaxClients and MaxRequestsPerChild.
MaxClients should be not too big, so it will not abuse your
machine's memory resources, and not too small, for if it is, your users
will be forced to wait for the children to become free to serve them.
MaxRequestsPerChild should be as large as possible, to get the full
benefit of mod_perl, but watch your server at the beginning to make
sure your scripts are not leaking memory, thereby causing your server
(and your service) to die very fast.

Also, it is important to understand that we didn't test the response
times in the tests above, but the ability of the server to respond
under a heavy load of requests. If the test script was heavier, then the
numbers would be different but the conclusions similar.

The benchmarks were run with:

HW: RS6000, 1Gb RAM

SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3

Machine running only mysql, httpd docs and mod_perl servers.

Machine was _completely_ unloaded during the
benchmarking.

After each server restart when I changed the server's configuration, I
made sure that the scripts were preloaded by fetching a script at
least once for every child.

It is important to notice that none of the requests timed out, even if
it was kept in the server's queue for more than a minute! That is the
way ab works, which is OK for testing purposes but will be
unacceptable in the real world - users will not wait for more than
five to 10 seconds for a request to complete, and the client
(i.e. the browser) will time out in a few minutes.

Now let's take a look at some real code whose execution time is more
than a few milliseconds. We will do some real testing and collect the
data into tables for easier viewing.

Conclusions: Here I wanted to show that when the application is
slow (not due to perl loading, code compilation and execution, but
limited by some external operation) it almost does not matter what
load we place on the server. The RPS (Requests per second) is almost
the same. Given that all the requests have been served, you have the
ability to queue the clients, but be aware that anything that goes
into the queue means a waiting client and a client (browser) that
might time out!

Now we will benchmark the same script without using the mysql (code
limited by perl only):
(http://www.example.com/perl/access/access.cgi), it's the same
script but it just returns the HTML form, without making SQL queries.

Conclusions: This time the script we executed was pure perl (not
limited by I/O or mysql), so we see that the server serves the
requests much faster. You can see the number of requests per second
is almost the same for any load, but goes lower when the number of
concurrent clients goes beyond MaxClients. With 25 RPS, the
machine simulating a load of 400 concurrent clients will be served in
16 seconds. To be more realistic, assuming a maximum of 100
concurrent clients and 30 requests per second, the client will be
served in 3.5 seconds. Pretty good for a highly loaded server.

Now we will use the server to its full capacity, by keeping all
MaxClients clients alive all the time and having a big
MaxRequestsPerChild, so that no child will be killed during the
benchmarking.

Conclusions: Very little difference! Ten servers were able to
serve almost with the same throughput as 50. Why? My guess
is because of CPU throttling. It seems that 10 servers were serving
requests five times faster than when we worked with 50 servers. In that
case, each child received its CPU time slice five times less
frequently. So having a big value for MaxClients, doesn't mean
that the performance will be better. You have just seen the numbers!

Conclusion: mod_cgi is much slower. :) In the first test, when
NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In
the same circumstances, mod_perl was capable of 32 requests per
second, nearly 30 times faster! In the first test, each client waited
about 100 seconds to be served. In the second and third tests, they
waited 1,000 seconds!

The MaxClients directive sets the limit on the number of
simultaneous requests that can be supported. No more than this number
of child server processes will be created. To configure more than 256
clients, you must edit the HARD_SERVER_LIMIT entry in httpd.h
and recompile. In our case, we want this variable to be as small as
possible, so we can limit the resources used by the
server children. Since we can restrict each child's process size with
Apache::SizeLimit or Apache::GTopLimit, the calculation of
MaxClients is pretty straightforward:

There is no problem -- any connection attempts over the MaxClients
limit will normally be queued, up to a number based on the
ListenBacklog directive. When a child process is freed at the end
of a different request, the connection will be served.

It is an error because clients are being put in the queue rather
than getting served immediately, despite the fact that they do not get
an error response. The error can be allowed to persist to balance
available system resources and response time, but sooner or later you
will need to get more RAM so you can start more child processes. The
best approach is to try not to have this condition reached at all, and
if you reach it often you should start to worry about it.

It's important to understand how much real memory a child occupies.
Your children can share memory between them when the OS supports that.
You must take action to allow the sharing to happen. We have
disscussed this in one of the previous article whose main topic was
shared memory. If you do this, then chances are that your MaxClients
can be even higher. But it seems that it's not so simple to calculate
the absolute number. If you come up with a solution, then please let us
know! If the shared memory was of the same size throughout the
child's life, then we could derive a much better formula:

The MaxRequestsPerChild directive sets the limit on the number of
requests that an individual child server process will handle. After
MaxRequestsPerChild requests, the child process will die. If
MaxRequestsPerChild is 0, then the process will live forever.

Setting MaxRequestsPerChild to a non-zero limit solves some memory
leakage problems caused by sloppy programming practices, whereas a
child process consumes more memory after each request.

If left unbounded, then after a certain number of requests the
children will use up all the available memory and leave the server to
die from memory starvation. Note that sometimes standard system
libraries leak memory too, especially on OSes with bad memory
management (e.g. Solaris 2.5 on x86 arch).

If this is your case, then you can set MaxRequestsPerChild to a small
number. This will allow the system to reclaim the memory that a
greedy child process consumed, when it exits after
MaxRequestsPerChild requests.

But beware -- if you set this number too low, you will lose some of
the speed bonus you get from mod_perl. Consider using
Apache::PerlRun if this is the case.

Another approach is to use the Apache::SizeLimit or the
Apache::GTopLimit modules. By using either of these modules you
should be able to discontinue using the MaxRequestPerChild,
although for some developers, using both in combination does the
job. In addition the latter module allows you to kill any servers
whose shared memory size drops below a specified limit.

Piers Cawley worried about class private attributes and asked it it
was still going to be possible to write an object serializing tool
that wouldn't require tons of Perl class overloads. Dan said that it
should still be possible.

Thomas Whateley has been thinking about how to allow Parrot to run
untrusted code in a sandbox, and proposed a system where code would
provide a list of required rights and Parrot would guarantee that only
those privileges were provided. Matthew Byng-Maddick pointed everyone
at the RFC he'd written along these lines and raised the issue of what
happens when you call out to a module that's written natively, which
he described a 'brick wall'. Others were less sure that this was a
showstopper.

Somewhat annoyingly my mailer seems to have had a bad moment and lost
some messages, including one from Dan in which he outlined his vision
for 'safe mode' and gave some pointers to the VMS docs in this area.

Leon Brocard announced the Parrot Developer World Map, which plots the
locations of all the Parrot developers who sent him their location
details on a map of the world. A cursory glance at the map would seem
to imply that a good place for a Parrot Developers' face to face
meeting would be a Caribbean island. Gregor neatly one upped Leon with
his interactive SVG version.

Jonathan Sillito found some problems with the coroutine code, so he
wrote a test exposing the errors, patched things to fix the errors and
added a documentation patch to be on the safe side. Go Jonathan. Steve
Fink applied the patches. Jonathan also had a question about naming,
and wondered if anyone would be interested in a howto for compiler
writers. Nobody commented on this, but the howto sounds like a great
idea to me.

Leo Tötsch noted some problems with interpreter->code and
multiple code segments, suggested a way forward and asked for
comments. Nicholas Clark commented, wondering about the potential for
memory leaks but Leo seemed to think that that was covered.

Leo has also been thinking about Parrot run loops and reckons we're
doing it wrong. So he proposed a different way of doing things,
working through its impact on the various kinds of Parrot core. Steve
Fink liked Leo's idea, but Jason Gloudon was less sure.

Michael Lazzaro is working on documenting Perl Arrays and asked for
confirmations and clarifications on various areas of Array behaviour.
One subthread morphed into a discussion of when to use is
properties and when to use but properties. It turns out that the
'is is compile time, but is runtime' rule of thumb is only a
rule of thumb. Damian says that a better distinction is 'but is a
property on a value, is is a property on a referent'. Variables,
arrays, opaque objects, hashes etc are referents and therefore get
is properties. Values are thing things that are held in variables,
arrays, opaque objects etc, and they get but properties.

Martin wondered if it would be possible to use [] to do both hash
and array lookups. Short answer ``No''. Which isn't to say the thread
didn't take off like a runaway train. After far too many messages we
reached the same 'No' conclusion that Damian had stated way back a the
start of the thread. The reasoning goes like this: Given <
$container[$spec] > you can't use the type of $container to decide
whether to do a hash or an array lookup because there are Perl object
types which have both hash and array semantics. And you can't use
$spec's type either because Perl happily autoconverts between
numbers and strings. So, you need some way to specify what sort of
lookup you need, and you do that by using {} or [] as
appropriate. But, frankly, that's beside the point. The reason we're
not using [] to do both array and hash style lookups is Because
Larry Says So.

Argh! I've really been screwed by problems with my mail here. For some
reason a batch of mail appears to have got dropped on the floor this
week and my record of this thread is the one that's suffered the
most.

The gist of the thread is that adding a default value to an array
opens a whole scary can of worms to do with the distinction between
'undef' and 'exists' (made trickier still by the fact that 'exists' is
a pretty vague concept when you're dealing with an array.) Damian's
view is that an array with a default value should do the equivalent of

$content = @foo[$index] // @foo.default;

ie, all undefs in the array should be 'replaced' with the array's
default value. And that includes any 'deliberate' undefs (where the
programmer does @foo[$index] = undef). As I understand it, doing
anything else leads to implementation pain. And it leads to even more
fun when your default is computed and the function doing the
computation gets passed the index as an argument. What happens when
someone does a shift?

For some reason this led into a discussion of whether hashes are
ordered (they're not, but if you leave a particular hash alone any
iterators (keys/values/each) will step through the hash in the same
order, repeatably, but that order is not predictable).

Damian announced that, owing to time pressures, the need to find
paying work, the need to help Larry get the next Apocalypse out the
door and to get the associated Exegesis written he's unsubscribed from
the language list for the time being, though he expects that he'll
probably resubscribe once A6 and E6 are released. Folks wished him well.

I'm more interested in Parrot rather than in Perl 6 , since it is the
only really free bytecode format out there. I am planning to build a
C# to Parrot compiler. I did a bunch of opcodes in this direction ,
but mostly what I do is lurk around in the mailing list & IRC.

I'm from the DotGNU project and have been playing around in that code
for some time. I've still got my foot placed firmly there and am
peering over the hedge into parrot. I'm still in college, so I still
have to find time to study.

This week's summary was once again prepared in a comfortable chair
with plenty of Earl Grey tea and distractions from two delightfully
ginger kittens and my embarrassing ability to screw up my Gnus
installation (now unscrewed).

Thanks to Gopal V for his answers to the questionnaire this week. If
you've been involved in the Perl 6 development process, please
consider answering the same set of questions and sending them to
5Ws@bofh.org.uk and you'll earn my undying gratitude.

Proofreading was handled by the ever lovely Aspell.

If you appreciated this summary, please consider one or more of the
following options: