Xordan's blog

Category: Optimisation Framework

2007-06-20

Delays

It's been a busy two weeks since my last entry here. With moving out of uni accom back home, my VM holding half my SoC stuff failing for whatever reason and other random things keeping me busy I've had little coding time to get anything worth committing done. However, I've had plenty of thinking time and progress has been made. Right now, I'm updating my branch from trunk and installing MinGW here so I can get some more testing done. I'm going to quickly lay out my plan and what progress I've made in each area:

It seems that -msse etc. is required to use SIMD instructions on gcc. This is a bit of an inconvenience, but not a serious issue. The way I've decided to handle the problem is to force users (that's you) to put their SIMD code in a separate cpp file to c++ code, as suggested by the gcc docs. Then, all SIMD code will have to be compiled with those compiler flags. What I need to do is to make this as painless as possible, so I'm going to run configure checks (AX_CHECK_COMPILER_FLAGS()) to see if the flags are supported by the compiler, then save the results in COMPILER.CFLAGS.SIMD or something of the kind and/or COMPILER.HAS.SSE = "yes/no" etc. Next, I add something which allows me to specify in a jam file the compiler flags (those ones I saved) which will be applied to a specific cpp. It might be an idea to put SIMD code in a subdir to the main folder I think. That way we know that *.cpp will all be SIMD, which means we don't really need to worry about specific files, we can apply to everything (so to the whole Jamfile). Of course, we need to be able to #define out any code which isn't supported by the compiler too (or do this in the Jamfile). Any suggestions on how to refine this idea are welcome of course!

The next area I'm working in is that fairly important code path selector. I've decided on using a function which takes in the functions, arguments, types and selects the correct route to take. It looks like this:

This works quite nicely, but it does have some limitations which I'm working towards removing. Right now, it only supports one SIMD path and a c++ fallback. It needs to be able to take several possible SIMD paths and a fallback (MMX, SSE3, AltiVec, C++ for example).

I chose this method mainly for it's lack of overhead and simplicity. Only a single call to check for capabilities is done (the results are cached), most internal functions to the check are inlined so I have few function calls and only one check per SIMD/c++ function call is made. A small benchmark I ran showed an overhead which was too small to be measured (<1us). The SIMD code itself ran 8x faster than the C++ which is a good sign. :) I'll probably commit that test as part of the simdtest app.

Also, I've started to define some CS types to be used for SIMD work. AltiVec and SSE use different methods of declaring the __m128 (SSE :)) type, but the CS version needs to be as platform independent as possible so the user can just 'use' it and not worry about maintaining compatibility. I'll add more details on this when I've written them. :)

Finally (as far as I can think), more testing is needed. As I said earlier, I'm installing MinGW to see if my code compiles fine there. Hopefully I won't run into many problems.

If anyone could point out how to have arrow brackets without this thing spitting errors at me, that'd be cool :P

2007-06-07

Changes and a problematic problem.

It's been a while since my last entry, so I'll quickly update on what I've done.

Right now, basic runtime detection for Windows, x86 linux and PPC are done. I've changed quite significantly the original plan for that, now I have a base class with the inline bool HasMMX() type functions and the bool hasMMX; type vars. I've used a template on that, so I can pass the correct platform specific class to it when creating an object instance of it, then I use another class as an access point for the outside world which has it's own Has*() functions (which call the specific equivalent in the base class).

When a check for one instruction set is done, checks for all of them are done and a bitmask is returned. Then the correct instruction is fetched from this result.

I think this is quite a nice solution. It allows us to easily add new checks in the future.

While writing some configure checks for xmmintrin.h and __m128 I ran into a problematic problem. GCC requires -msse to be enabled for me to access builtin intrinsic functions. However, -msse also tells the compiler to optimize non-floating point code with sse instructions :) To quote from the GCC manual:

"These options will enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications which perform runtime CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options."

To me, this is not a great option. I'm not sure why the GCC devs decided to force compiler optimizations upon us if we want to use intrinsics at all, but that's the way it is... maybe. I'm going to experiment on defining what the xmmintrin.h header requires to be defined.. maybe that will work. If not then we'll have to try what the manual suggests, making each file which uses intrinsics compile with the required flags. The third option is to say "screw this" and write my own versions of the intrinsics using asm. I'll still need to use the builtin stuff for x86_64, but that's okay because -msse and crew are defined by default on that platform. My hope is that I can trick the headers that all is good without giving the compiler an 'okay' to optimize.

Once a solution for this is done, I need to work out a code path for using these optimizations. Right now I'm favouring either using templates along with my own functions, or having a function like blah(SIMDcode, C++Code, arg1, arg2, argn); I haven't decided. Obviously I need to keep the overhead and code duplication down to a minimum. More on this later.

2007-05-30

Overview of project

Hey, my name is Mike Gist (aka Xordan). Currently I'm a first year student studying Computing at Imperial College London. I'll be working on the optimisation framework project for the next few months and I'll be keeping note of my progress here and explain a bit about what I'll be doing now.

At the moment there are various optimisations that could be done using SIMD instructions, but there is no way to properly detect support and use the correct code path. Obviously an Athlon XP won't be able to use SSE3 instructions, and a PPC processor will be able to use AltiVec only.. assuming it supports it. My job is to add runtime and compile time detection of the supported instruction sets (both for the processor and the OS), add a method for the correct code path to be selected and used, and then to make use of this in various places in the existing CS code.

Within the scope of this project I'll be concentrating on MMX, SSE, SSE2, SSE3 and AltiVec. Later I will expand and include SSSE3 and SSE4, but those aren't priorities for now.

Currently there is a class called csProcessorCapability which contains some MMX detection code. I plan to rewrite all of this, but keep the name :) There are different ways of detecting supported instructions (asm or inbuilt compiler functions). VC has a function called isProcessorFeaturePresent() which makes detection very simple. For gcc it's a bit harder, we will need to use cpuid to get the info. We will know the minimum instruction set supported by an architecture so we can short cut some checks as well (amd64 all support from MMX through to SSE2 for example).

The CheckSupportedInstructions function is run once by the Initialise() function, and contains all the detection code. If it's already known that an instruction is supported or isn't supported then this will be passed to that function and it won't be checked. I'll go into more detail as I progress.

Compile time checks will be for __m128 support (checking for xmmintrin.h), so we know if the compiler actually supports this stuff :) No compiler support, no extra cpu juice. Some other things will be used, like target arch, which will allow us to rule out some instruction sets. I'll add an entry just on this when I get to it. First thing is getting that class written. I'm giving myself two weeks to finish the detection, then I'll move on to making use of it. I'm hoping to get a good chunk of that done by the mid-term evaluation, leaving me about a month to finish it and optimise some areas (to be chosen) of code. I'll detail that a bit more very soon.