The code is based on method 4c from an excellent post by Fabian Giesen on AABB culling: View frustum culling. If you haven't read it yet, check it out now. He presents several different ways of performing the same test, each of which has its own advantages and disadvantages.

There's nothing special about the above code, except from the fact that it breaks out of the inner loop as soon as the box is classified as completely outside of one of the planes. No need to check the others.But we can do better. First thing, precalculate the absolute of the plane normals once, outside of the AABB loop. No need to recalculate the same thing over and over. We can also replace the floating point comparisons with bitwise ANDs. Lets take a look at the code.

Again, since we are processing one AABB at a time, we can break out of the inner loop as soon as the AABB is found to be completely outside one of the 6 planes.

The SSE code should be straightforward. All arithmetic operations are scalar and we use _mm_movemask_ps in order to extract the signs of (d + r) and (d - r). Depending on the signs, we classify the AABB as completely outside or intersecting the frustum. Even though we are testing one AABB at a time, the code is faster than the optimized C++ implementation.

The "problem" with the above snippet is that all SSE operations are scalar. We can do better by swizzling the data from 4 AABBs in such a way that it will be possible to calculate 4 (d + r) and 4 (d - r) simultaneously. Let's try that.

Compared to the original (unoptimized) C++ version, it's 4x faster. But that's unfair. Compared to the scalar SSE version it's about 2.5x faster. Not bad, don't you think? Can it get any better? Probably yes. The problem with the 2 SSE versions is the lack of enough XMM registers to keep all the AABB data in them and avoid using the stack for storing intermediate results. Unfortunately, we need 6 registers for the 4 Center/Extent pairs, and the rest (2) aren't enough for the inner loop. Compiling under 64-bits should give better results because of that.

Another way to optimize this algorithm further is to have the data already laid out the way the code expects them (in SoA form). This will get rid of the shuffles (the memory loads will still be 6).

Finally, the code used to calculate the state of each AABB from the two bitfields can change. One other way of doing it is the following:

This way, we can get rid of the variable shifts (all shifts and ANDs are compile-time constants). The difference from the previous version is that the intersection state is 3 instead of 2. 2 isn't a valid state, that's why we AND the intersection_flag with the in_out_flag. 2 means that the box is outside and intersecting the frustum, which is an invalid case.

Unfortunately, in practice this doesn't make a big difference in performance. So, it's a matter of taste.

Results

Below is a table with more performance data from all the above snippets. Two cases have been tested. For both of them, the frustum is the [0, 1]^3 box. The first one is a random case, where all the boxes are randomly placed in the [-1.0, 2.0] box, and have random sizes in [0.1, 0.2] range. The other case is the worst case, where all boxes are inside the frustum.

Method

Random (32)

Worst (32)

Random (1024) *

Worst (1024)

C++ (ref)

105.2 (14.0)

181.5 (21.0)

102.5 (12.0)

159.3 (20.3)

C++ (opt)

96.1 (10.3)

138.0 (17.2)

84.9 (12.3)

119.8 (16.3)

SSE_1

73.7 (9.8)

93.2 (13.8)

63.9 (10.8)

72.8 (9.8)

SSE_4

26.5 (4.0)

27.6 (4.3)

24.1 (4.2)

22.9 (4.0)

Table: Performance comparison between the 4 snippets. 2 batch sizes, 32 and 1024 AABBs. The column marked with * contains the data shown in the post.

That's all folks. Thanks for reading. Any corrections/suggestions are welcome.

Changes

2013-09-06: Changed C to C++ because the code uses references. Thanks to @zdlr for pointing that out.2013-09-10: Removed underscore from structure names (comment by @NightCreature83). Also moved definition of absFrustumPlaneMask outside of the two SSE functions (see comments by @Matias Goldberg for details).

About the Author(s)

I'm a freelance programmer, currently working on Android, web and desktop apps. I find low level programming and optimizations in general, very interesting subjects, so I try to spent most of my free time on them.

Could you please choose better variable names than 'd' and 'r'? You then exacerbate the lack of clarity by aliasing d + r to 'd_p_r' and d - r to 'd_m_r' - not forgetting the 'd' field on the _Plane struct.

Would distanceFromCenterToPlaneNormal and distanceFromSizeToAbsolutePlaneNormal not be more explicative? Example code like this benefits hugely from such verbosity.

@Servant: Bacterius is right. The numbers are "cycles per AABB". Culling a batch of 1024 AABBs in a single loop ends up being faster than 32 probably because the function overhead (e.g. calculating the abs of the plane normals) is minimal compared to the main loop.

Example: Assume that the initial loop which calculates the abs of the plane normals requires 200 cycles. Also, assume that each AABB requires 100 cycles. For a batch of 10 AABBs, the function would require 1200 cycles to complete. Or in other words, 120 cycles per AABB. If the batch had 1000 AABBs, the function would require 100200 cycles, or 100.2 cycles per AABB. Hope that makes sense.

@zdlr: Variable names were kept like that in order to match the examples in Fabian Giesen's article which I linked above. Also, the term 'reference implementation' doesn't mean that the code uses references. It means that, that specific snippet is used as a baseline for performance comparisons (if this was what you meant).

On your 4 at a time SSE2 code, you're using movemask too often and switching to GPR registers to perform the bitwise arithmetic.
I suggest you keep everything in XMM registers (actually, avoid the switch of registers) and use cmpeqps familiy of instructions together with andps & orps instructions; then use movemask to get the final mask outside the loop

@zdlr: Of course you are right. I forgot about those two. I'll edit the article to read "C++" instead of "C".

@Matias: Thank you very much for the tips. I'll try to find some time to do the changes you suggest and edit the article. I'll also try to make a little VS project to attach to the article at the same time.

About the two dot products. Do you mean method 5 on that article? In this case, the resulting code wouldn't be able to distinguish between fully inside and intersecting states which is a requirement for this article.

I know, it might sound bad trying to optimize something and add arbitrary restrictions which might affect performance, but I think having the ability to distinguish between those two cases might help in the case of an hierarchy. E.g. parent is completely inside, so there's no need to check any of its children.

Btw. a macro is often used to add the restrict keyword since every compiler has it's own way of declaring it (i.e. gcc vs msvc vs clang)

As for the two dot products, yes I'm talking about method 5. I wrongly saw in your code your routine was returning outside and inside and didn't see the "intersect". My apologies.

I wonder though if real life scenarios would just perform better using method 5 (brute force over smartness).

Also, it may be worth noting in the article that your routine is 100% re-entrant, meaning it's very easy to execute in parallel; except for your "staticconstunsignedint absPlaneMask[4]" part. That variable should be declared outside to make it re-entrant.

Being very strict, the snippet "_mm_load_ps((float*)&absPlaneMask[0])" is not standard compliant because it breaks the strict aliasing rule (casting an unsigned int* to a float*). It's a PITA coming from the intrinsics, since movaps can load integers just fine.

The "correct" (standard compliant) thing to do would be either to declare absPlaneMask as unsigned char* and then cast to float*

Another option is to rely on SSE2, and declare the variable as __m128 absPlaneMask( _mm_castsi128_ps( _mm_set1_epi32( 0x7FFFFFFF ) ) );

Absurd, but well, standard compliant (provided the compiler supports the SSE intrinsic extensions, which by definition aren't standard..., but you never know in which compiler your code will be used, and if it won't be ported to a different set of intrinsics)

Btw you shouldnt use _<Captial char> identifiers for C/C++ code as these are reserved as global names, this is one of the reasons why STL code looks so horrible. I am assuming here that this code should also compile for a C++ compiler.

17.4.3.2.1 Global names [lib.global.names]

Certain sets of names and function signatures are always reserved to the implementation:

Each name that contains a double underscore (_ _) or begins with an underscore followed by an uppercase letter (2.11) is reserved to the implementation for any use.

Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace.165

165) Such names are also reserved in namespace ::std (17.4.3.1).

Also defining a few macros for allignment and typedefing the intrinsics makes the code easier to read in the end.

So, if two or more threads try to enter a function where there is a static variable (declared at that scope) for the first time, there's a race condition which can potentially leave the variable in an inconsistent state.

Once the function has been called once, this danger is no longer present and can be called from multiple threads safely.

In other words, if we look at the generated assembly, it's the variable is not really read-only. To ensure this doesn't happen we need to declare it somewhere else.

As for the re-entrancy, I didn't say it means it's thread safe, but rather makes it very easy to parallelize.

In this case, caller just needs to be sure it only gives the function access to a portion of the data, different to every thread, which is very straightforward to implement.

Perhaps the biggest thing to worry about is that the amount of data to be sent to each thread needs to be multiple of 4 (since the last snippet SSE is operating with 4 objects at a time)

Everyone, thanks for the constructive comments. I've updated the article based on some of your suggestions.

@Matias Goldberg: Unfortunately, changing _mm_load_ps((float*)&absPlaneMask[0]) to be standard compliant as you suggested, as well as adding the __restrict keyword, requires a rerun of the benchmarks, because otherwise the numbers won't be accurate. I'll keep a note and hope to be able to do it sometime soon.

One small note about your last comment. The 4 AABBs at a time SSE version has a loop at the end which should handle the case where the number of AABBs isn't a multiple of 4. The code isn't shown for clarity (it should actually be the same as the 1 AABB at a time version).