SSE2 Instructions

For the past several weeks I have been studying X86 assembly language, mainly because I wanted to update my knowledge on the assembly language to match the latest CPU technology. I had previosly taken an X86 assembly language course at NCSU roughly a year ago, but the course only covered 8086 instruction set, and used the MASM version 6.0 as the assembler which is only good for writing MS-DOS applications. I wanted to at least learn how to do floating-point calculations in assembly, and do it in GNU assembly so that my apps would run on Linux.

There are quite a few extensions to the core X86 instructions, such as FPU, MMX, SSE, and SSE2. The FPU takes care of normal floating point calculations since 80386, MMX for operating multiple integer calculations in a single CPU cycle, SSE for multiple single-precision calculations, and SSE2 for multiple double-precision calculations (again, in a single CPU cycle). Since software these days, and OO.o in particular, seem to do almost all of floating point calculations in double-precision, I decided to give SSE2 a little benchmark test.

Here is how I did it. I wrote some simple mathematical routines in C, compiled it normally with gcc with -O1 optimization. Then I had gcc generate an assembly code of that routine, cleaned it up a bit and replaced several instructions with SSE2 instructions, reassembled it into an executable to run benchmark.

to call both the C version and the assembly with SSE2 version to compare performance. The executables with the original C version and the SSE2 version are named test_c and test_sse, respectively. Here is the result (on my machine):

Indeed, the SSE2 version seems to perform better! I also compared my SSE2 version against the -O3 optimized C-code, but there was not much difference from the -O1 optimized code for this particular algorithm. But of course, YMMV.

Does this mean we should start writing assembly code for performance optimization? Probably not. I still think it’s much better to write a better algorithm in the first place than re-write code in assembly language, because re-writing in assembly is itself not a guarantee for a better performance. But for serious performance work, however, knowing what type of assembly code that the compiler generates from a C/C++ code, for various compiler optimization flags, will undoubtedly benefit. You never know, in a few extreme cases, it may even make sense to write parts of the application code in assembly, even if that means having to write that part in assembly for every platform that app needs to support. OO.o has some parts of UNO bridge written in assembly, and I’ve seen some assembly code in the FireFox codebase as well.

Oh, by the way, for anyone looking for a good study guide on GNU assembly for X86 family of chips, the “Professional Assembly Language” by Richard Blum (Wiley Publishing) would be a pretty good place to start.