I have been using the fastest "Perl interpreter ever" (at
least from my experience) for quite some time now. It seems
stable, so I'd like to share that knowledge with you.

Nicholas gave an excellent talk about the topic When Perl is not fast enough some time ago. He mentions that "compiling your own perl" may be an option and reports speed gains in between 5% and 14%.

With recent improvements of GCC and its autovectorization feature, I thought that I could spend a sunday trying to find out what it would bring me.

The Results:

I fetched sources for both gcc 3.4 and perl 5.8.5. Then I compiled GCC 3.4 and then compiled with that GCC 3.4 Perl 5.8.5 with the options -msse2 and -O3. Use -msse if your CPU doesn't support sse2. The "autovectorized" perl is constantly 30% faster than the plain-vanilla perl that comes with a standard linux distribution (I suppose compiled for pentium), with the lowest speedup seen at 20% for store/retrieval and highest speedup about 40% for some list manipulations.

I can tell you, that 30% is significant and makes recompiling worthwhile. Moreover, it seems GCC doesn't
autovectorize all cases it could, so we can probably expect
some more improvements. I also suppose, that real GCC cracks could find more optimizations for the P4 architecture, but neither my time, nor my knowledge allowed me more experiments.

Update:

More specifications about environment and compared interpreters:

As you may or may not see from the data below, environment is (SuSE) Linux 8.2, CPU is a Pentium 4-M 1,8GHz. Benchmarked was our application for natural language processing/understanding. Where some heavy operations on N-ary trees (List-based implementation of ours - not that on CPAN) happen. E.g. the "normalization" of a swedish lexicon
(removing redundant data, sorting trees etc.) takes 423 seconds with the standard perl, and 288 seconds with the optimized one. This is a pretty hard benchmark as it extremely shuffles data around. We have also results where information about a lexicon is gathered where the speedup is a factor of ten(10!). I.e. about 150.000 lexicon entries are iterated and the number of meanings per entry is evaluated and added to a total. Takes for the swedish lexicon 20 seconds on the unoptimized version and 2(!!) seconds on the optimized version.

As you can see, even the old perl was compiled
with -O3 so one cannot say it was not optimized in any way.

I'd like to reiterate, that I also saw this as an experiment
that probably would fail, because I also was reluctant sacrificing stability for speed. But I'm using the optimized
Perl now on a regular basis and it has proven to work
with only one side effect. It's faster. :-)

Using this probably depends on how much risk you can live with. Personally, I've always hesitated turning on extra optimization during compiles. For example, on AIX there is a caveat in the man pages for the built-in compiler of "The -O3 specific optimizations have the potential to alter the semantics of a user's program", which doesn't give me any warm feelings.

I'm not saying gcc is subject to this since I haven't used it in a while, but the fact that they don't enable these cool features by default seems to indicate they understand there is some risk involved.

Many of the optimizations make it hard to debug the program. Most of us don't debug the perl binary, so that's a non-issue.

There are cases where optimization can break code. This is one good reason to have a good test suite. Usually it happens around particularly hairy code (IIRC, Duff's Device tends to trip up optimizers).

Also, higher optimization levels may start trading off time for space, which might make someone still running Perl on an old VAX angry.

In the general case of a regular Perl programmer, running on a reasonably up-to-date machine, higher optimization is fine.

"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

The original interpreter the "autovectorized" one was compared against was also compiled with -O3 as one can see in the updated node.

All runs stable - at least on the P4 - don't know about other architectures, but sse2 is not of much interest there - I guess. Probably Mac/PPC users could experience similar results if GCC supports the altivec engine.

Very cool. It would be interesting to see what effect this would have on the performance of large-scale Perl apps, like Krang. I've been meaning to try compiling Perl with Intel's C compiler. From what I've heard it's very good.

The gain from removing threads can vary between 10-40% in the tests we've done. However, you are not comparing like for like. The installed version is based on 5.8.0 and your fast version is based on 5.8.5.

Unfortunately RH, and probably several other distros, come with a threaded Perl. Even though when 5.8.0 was released, ithreads were not recommended for production environments. From what I understand, ithreads support is much more stable now.

They can indeed. However, in this case the biggest factor is the none use of ithreads. In the tests we did here we did a like for like test, which showed a significant improvement when we simply recompiled without ithread support. If we were to add compiler flags I'm sure we could improve further, but as others have mentioned, adding further optimisations could have side effects.

There are some mistake in your reasons of performance increase. (I'm work in computational optimitation and i'm a newbie boys of gcc@...)
tree-ssa will be avaible only from gcc 4.0 (next major)
the flag -msse or -msse2 can or cannot increase your performance, but they don't activate vectoritation, they only ask to the compiler to use simd instruction (see gcc's info)
you can yous -mfpmath=sse,387 (as from gcc's info).
30% it's a good result, but how do you obtain this number? (old perl version, old flags, old compilers flag etc.)
Recompiling can however increase your performance (Slack + Gentoo rules ;))
P.S.
The intel C compiler works fine in vectoritation but:
1) No source code
2) No good code for Non-Intel
3) No AMD64 support

Well - actually I only wanted to see the effect of using the SIMD instructions of the SSE-engine. Probably I've misunderstood the GCC pages but thought that "using the SSE engine" in automatically vectorizing sequential code IS autovectorization.

I'm actually considering recompiling Perl once again with more optimizations for my specific CPU

Well the faster was compilated with gccversion=3.4.2
the slower with 3.3 20030226(prerelease), two different kernel (!= kernel-headers), two different version of perl....
The slower is compilated with -DUSETHREAD, using multithreading can (not always) reduce cpu-bound performance application on no-SMP machine.
Well, on the other way 30% is more or less the stimated gain from using 3.4 vs 3.3, but this test can be considered a useful test in this direction. If performance is your objective you can googling around for a perl script called cpuflags
and use it's output as compiler flag, but attention some of the flags it suggest can break perl internal semantic.
If you do it please report your success vs. failure.
tried perlcc?

the two different kernels don't matter in this case, as they only indicate on what kernel the perl was compiled on. Actually both interpreters run on the same kernel (2.4.26) now.

Also I don't see any issue in comparing various perls compiled with various compilers. Basically we tried to be as pragmatic as possible: Given Perl X (where X is the parameter space): What can we do to make it faster?

It was mentioned here, that disabling threads helps a lot (which I can confirm), but that also some e.g. linux distributions ship perl with threads enabled by default.

I've recompiled Perl once again, this time also for the specific P4 architecture and fpmath=sse. The result: Binary is about 200k shorter, and the following execution times: