To FPU or not to FPU

When I work on MUIMapparium usually I only work in Linux and test on AROS Linux-hosted. which is very convenient and fast. When starting the MUIMapparium I also tested at the end on every platform if it works and how is the speed. For the last two Releases I skipped this part, due to lack of time.

But yesterday I tried MUIMapparium on my Amiga 600 with Vampire and was shocked how slow it behave. The map moving is just not usable around a second reaction time. I downloaded/compiled older versions to check when this problem appeared. Deep in the back of my brain I guessed already that the fixed position calculation could be the reason (see here). Thats pure floating point calculation and a lot of them. I tested that on the initial implementation and it seemed not too slow, because for simple map moving and zoom only very two-three times this conversation have to be done, so the influence is not very big.

So why it’s now so slow? The difference is that before I tested with a bare MUIMapparium without any marker or tracks loaded. Marker only add a single conversation to the list. But Tracks need a conversation for every recorded (and maybe drawn) point. Remember the most GPS devices measure the position once per second, that means for an hour walk you get something about 3600 points (usually the GPS already strip them from “not moved” points, nevertheless you get around 1000 points). For NG Amigas with their massive computing power especially on the FPU side, this is not much of a problem, 1000 fpu calculation with some hundreds MFlops are just done some milliseconds.
But on Vampire it’s a different story, no FPU, it has to use the softFPU emulation of FreePascal. This raised the question: How fast is the softFPU emulation on a Vampire in comparison to a real 68060 / 50Mhz FPU. The Vampire integer performance is much higher than the 68060 (around twice as fast, see here) but emulated FPU, there is a lot of code needed to emulate that correctly.
I used two tests for that, a simple Mandelbrot algorithm, in single and double precision and the well known SciMark from NIST. Compiled with either with FPC SoftFPU emulation or the 68881 FPU support.

Mandelbrot results (Runtimes, shorter is better)

Test

68060/50 MHz FPU

68060/50Mhz SoftFPU

Vampire SoftFPU

Mandelbrot single precision

0.12 s

9.53 s

3.81 s

Mandelbrot double precision

0.15 s

23.72 s

13.37 s

When comparing the SoftFPU times of 060 and Vampire you can see the 2-3 times I experienced before already. But the (often called “very slow”) 68060 FPU leaves the SoftFPU Vampire in the dust far behind it. (In fact the dust is already settled down again, before the SoftFPU finished the calculation). Of course the errorbars for the calculations with FPU are huge, the time is too short for a reliable time measurement, but a bigger calculation just would need ages with SoftFPU 😉 and the trend is nicely visible.

Next is the SciMark, it uses various real life floating point calculation, like FFT, matrix multiplication, monte carlo simulation, if you work in science you know that stuff, if not just believe me that is what science programs do all day 😉

So it just shows the same trend. Attention: do not compare this MFlops with the theoretically MFlops most speedtests show you (like sysinfo), you can see, how different the tests behave. It depends very strong on which commands are used and how much memory bandwidth is needed.

In conclusion it shows really nicely why the MUIMapparium with a track on a Vampire is so slow currently, because of the slow SoftFPU. Very sad that the Vampire still lacks a proper FPU support. We (ChainQ and me) believe that it is possible to optimize the SoftFPU performance maybe 50% faster or even double, or lets aim for the stars.. 10 times faster than now (I do not believe that is even close to possible at all). It would still be around 5 times slower than a 68060/50 Mhz FPU, for the people believing a SoftFPU implementation could be a replacement for a native FPU in the FPGA.

That means, if it reacts very slowly on Vampire, just remove the track. 😉 I will work on this, reduce the needed calculations, (by using more memory), see at which places I could possibly go down to single precision (not much hope there ;-)) and of course reduce the number of points, in principle a LOD on the Zoomlevel.