Thursday, January 22, 2009

Spent time reading about hardware performance counters while waiting for builds and test runs. Realised that I had built the RTS with -fvia-c before, so there are still a couple of NCG things to fix before we can do a full stage2 -fasm build.

The RTS code uses a machine op MO_WriteBarrier that compiles to a nop on i386, but uses the LWSYNC instruction on PPC. I still have to read about that and work out what to do for sparc.

Also spent some time fighting a configure problem. The configure script can detect whether the assembler supports the -x flag, that tells it whether to export local symbol names. The configure script looks at the ld in the current path, but ghc calls gcc to do the assembly. That gcc will use whatever ld /it/ was configured with, and on mavericks the Solaris linker doesn't support -x. (epic sigh). In the same vein, the Solaris assembler doesn't support the .space directive. Fixed that and my builds are running again.

Started of a full run of the testsuite with the stage1 compiler + via-c rts, but that'll take all day and all night. num012 is failing with 16 bit integer arithmetic, but that's probably no big deal.

The hardware performance counters on the T2 are exercised with the cputrack util, which I spent some time learning about. You can get counts of instrs executed, branches taken, cache misses etc. There are lots of measurements available, but you can only collect two at a time - because there are only two PICs (performance instrumentation counters) on the T2.

So, it looks like we're getting about 2.3 clocks per instruction (cpi). About 30% of instructions executed are loads or stores. If those load/stores account for L1 data cache misses then about 6% of them miss. That might be wrong though - I'll have to work out whether the info tables are stored in the data or instr cache. In any event, about 30% of all instructions are loads or stores , and another 16% are branches of which 67% are taken.

I'll post some nice graphs and whatnot once my nofib run has gone through. Should probably add the performance counters to nofib-analyse as well, in the place of valgrind on sparc / solaris.

Wednesday, January 21, 2009

Implemented tabled switch, and fixed a problem when converting float to integer formats. The FSTOI instruction leaves its result in a float register, but in integer format. On at least V9 you then have to copy the value to mem and back to get it into an actual int register. Later architectures have instructions to do this without going via mem, but we're sticking with V9 for now. Perhaps we could get some speedup for FP code by using the later instruction set extensions (Vis2.0?)

The genCCall problem was that calls to out-of-line float ops like sin and exp were disabled for 32 bit floats, maybe because other things were broken before.

Also fixed a 64 bit FFI problem. Closures allocated by the storage manager are 32bit aligned, but the RTS was trying to do misaligned read/writes of 64bit words. The standard SPARC load/store instructions don't support misaligned read/writes, so had to break it them up into parts.

Looking good. A few tests still fail, but they fail the same way with all ways. I'm guessing that they are problems with Solaris or other environment stuff, and not the NCG.

I tried to do build the stage2 compiler with -fasm, but I made the foolish mistake of pulling from the head beforehand, which broken the build. Did a full distclean, but will have to leave it overnight.

OVERALL SUMMARY for test run started at Wednesday, 21 January 2009 9:05:20 AM EST 2283 total tests, which gave rise to 8531 test cases, of which 0 caused framework failures 7429 were skipped

Looking into arith004. The code to generate integer remainder / divide instructions was missing. On further investigation, old SPARC implementations didn't have hardware support for this. GHC used to call out to a library. The SPARC T2 has hardware divide, but you have to compute remainders using div/mul/sub. Added code to do so. Not sure if we still want to maintain the software mul/div path - but I'll worry about that when the rest is fixed and refactored. Also fixed code to generate 64 bit operations on 32 bit SPARC, which was the isel64Expr problem.