News, June 2013
I've forward-ported these patches to gcc-4.4 and made the 64-bit instructions work
in gcc-4.3 and 4.4 but have not published the work. If interested, please get in touch.

Image found on the "Gears of War" site
during a web search for "Maverick 9312"

Preamble

I've been working on GCC-4 to make it generate working code for the
Cirrus Logic MaverickCrunch FPU, as found in their ARM-based
EP9302, EP9307, EP9312 and EP9315 chips, making floating point-intensive code
between 2.5 and 4 times faster.

What it does

The 20090908 version

performs single and double precision floating point
in the FPU (add, sub, mul, neg, abs, cmp and conversions
from single and double precision floats to integral types).

by default, disables the floating point cfnegs and cfnegd
instructions, which fail to convert 0 to -0 as they should.
You can re-enable them with the -funsafe-math-optimizations
flag, which is one of those enabled by -ffast-math
(gcc-4.3 has an even more specific -fno-signed-zeros flag,
which is one of those enabled by -funsafe-math-optimizations).

by default, does not respect denormalised values, so the smallest
representable values
are ±2-126 for floats
and ±2-1022 for doubles
instead of the usual
±2-149 and
±2-1074.

has a -mieee flag, which enables handling of denormalized values
by disabling all the buggy instructions.
With this, floating point addition,
subtraction, negation, absolute value and conversion between floats and
integer types are performed in software, leaving only floating point
multiplication and comparison performed in hardware.

has no negative impact on regular ARM code generation.

always works round the hardware bugs in the FPU and no longer has the
-mcirrus-fix-invalid-insns flag since
chip development has stopped and all existing silicon has the same bugs
except for the original revision D0 which is not supported.

passes GCC's IEEE testsuite except for the one specific test that checks
for correct handling of denormalized values. With -mieee it
passes all the math tests.

passes all other testsuites that I've tried (see below) including
the stringent "paranoia" floating point IEEE conformance test.

does not use the FPU's 64-bit integer instructions unless the new
-mcirrus-di flag is given. Programs that do a lot of 64-bit
integer operations (add, sub, mul, neg, abs, shifts) may be faster using
this, but rigorous testing will be necessary to ensure that bad code
is not being produced. OpenSSL's testsuite fails if this is enabled.
There is more detail at the head of the
arm-crunch-cirrus-di-flag.patch file.

Correctness tests

The compiler passes the following floating point-intensive test suites:

Speed tests

LAME encoding
a 30-second
stereo CD-quality file (actually two identical mono tracks) with
default options and the output written to /dev/null,
indicative of unspecialized mixed use of floating point and integer
code.

In other words, using the full Maverick instruction set, LAME is 2.5 times
faster than with softfloat, and when just using the -mieee subset,
it runs 25% faster or about half the speed of the full set, and
gcc-4.2 produces significantly faster code than gcc-4.3.

(*) Although crunch libgsm is 4 times faster than softfloat, libgsm also has a
fixed-point encoder, selected with MULHACK='', which is faster still
(the same is true of the speex encoder).

Using it

(gcc-4.3 only) If your program does not care about the difference
between 0 and -0, you can use this flag to enable the Maverick 'negate'
instructions for a little extra speed.

-ffinite-math-only

This tells the compiler that NaNs and infinities do not need to be
handled; this allows further speed and optimization.

-funsafe-math-optimizations

This enables even more optimizations that may give results not in
accordance with the strict IEEE-754 math standard. Among others,
It enables -fno-signed-zeros and in GCC-4.2 is the least
invasive way to enable the Crunch negate instructions.

-ffast-math

This is the most aggressive math optimization flag, enabling all of
the above and more.

-mieee

Most of Crunch's instructions take denormal values as zero;
this flag only enables the ones that work at full IEEE precision
(just multiply and compare).

-mcirrus-di

The FPU also has 64-bit integer instructions but they appear to be
buggy. This flag enables them (load, store, add, subtract, convert
to/from 32 bit and logical shifts by up to 31 places).

Building it from source

Resource requirements

GCC keeps on growing.
One of gcc-4.3's C source files, automatically generated during the build,
insn-recog.c, is now over 4 MB in size and gcc-4.3 requires 219MB
of virtual memory to compile it with normal optimization.

Memory: If you have less than 160MB of physical RAM plus 64MB swap,
you will need to stop the compilation, compile that one file without
optimisation by saying make CFLAGS=-g and then interrupt it
and carry on as usual when that one file has been done.

Disk space: The full sources unpack to 500MB (360MB for gcc-4.2)
and a further 200MB (140MB for gcc-4.2) are needed to build the C compiler.
If you have less space, you can fetch a "gcc-core" source tarball instead,
which only contains the C compiler and unpacks to about 200MB,
for a total of 400MB when built.

The tarball script dumps a .tar.gz
of the essential installed files and another of the source patchset in the
../packages directory.

There is also a test directory here with some program
fragments that I used to probe hardware bug presence and characteristics.

Patches for other packages

The patches for GCC work fine for all C software that I've tried.
Some other software packages are known to need Crunch tweaks as well:

binutils

C++: Some C++ files will not compile, saying".save {mv8}" Error: register expected
although the same files will compile with optimization disabled.

glibc

C: Values held in Maverick registers are not restored when performing a
setjmp/longjmp pair.

C++: Similarly, exception unwinding (performing a throw
back to a catch block in a different function) does not
restore floating point and 64-bit values held in Maverick registers.

When libm is compiled with Maverick support, sin() goes into
an infinite loop on some values, as demonstrated by
this test program.

gdb

A user reports that "print test" (where "test" is a double) shows garbage
while "print *(double *)&test" works correctly. This may be because "test" is
in a register in the first case but in memory in the second and softfloat gdb
doesn't know about MaverickCrunch registers.

Thanks

Thanks to Hasjim Williams for the work that this is based on,
to SimpleMachines for funding the
initial work on these patches and for hosting the tarballs, and to
to Arenque Software for encouraging me to complete them.