Abstract

As we know, numerous and complex mathematical operations are the cornerstones of 2D and 3D game development and execution, and these operations frequently require lots of vector and matrix transformations with resulting high computation demands. In this article, we will look at matrix multiplication and use “The Last Defender,” the Android*-based, 3D shooting game popular in the PRC, as a practical case to describe how to use one of SIMD technologies for Intel® Architecture (IA): Intel® Streaming SIMD Extensions (Intel® SSE[XX]) and Supplemental Streaming SIMD Extensions (SSSE[XX]) to speed up mathematical operations and improve game performance. At the end of the article, some common optimization solutions IA-based code will be mentioned.

Background

Simply put, Single Instruction Multiple Data, or SIMD, does multiple data computing in one instruction and is a kind of parallel computing technology on the instruction level. With its ability to utilize parallel computing fully, Intel® SSE is the perfect match for vector and matrix operations in game code because each element in the vector or matrix can usually be operated independently (developers must check to be sure elements do not interfere with each other, which occasionally happens). In contrast to traditional serial computing technologies where each element takes one instruction cycle, a 4x4 matrix, for example, would take 16 cycles to perform all operations. But if we utilize SIMD technology and assume the implementation of SIMD can handle 4 data operations in 1 instruction cycle in parallel, it means 4 elements operations can be finished in 1 instruction cycle simultaneously, and all 16 element operations spend only 4 instruction cycles!

Intel SSE instructions are widely used in SIMD implementations on IA; they are especially seen in game engines. Intel SSE includes many extensions: SSE, SSE2, SSE3, and SSE4.x, and they support many kinds of integer and floating-point types of operations. They also support highly efficient memory access. In this article, we refer to them all as “Intel SSE” for simplification, and here we will show how to use them in the matrix multiplication operation.

There are two kinds of matrix storage models in memory:

1. In Column-Major Order (Figure 1), all elements of the matrix are put in column order one by one in memory. If memory is accessed by contiguous addressing, the matrix elements will be accessed column by column. As we know, OpenGL* uses this method to store matrices in memory.

Figure 1. Column-Major Order

2. In Row-Major Order (Figure 2), all elements of a matrix are put in row order, one by one in memory. If memory is accessed by contiguous addressing, the matrix elements will be accessed row by row. “The Last Defender” game uses this method as its high-level matrix storage model.

Figure 2. Row-Major Order

Because this article is focusing on “The Last Defender” (row major order) as the use case for game development on Android and OpenGL ES* (column major order) is the only low-level graphics hardware acceleration API on Android, we will cover both kinds of storage models and respective optimizations.

In the following discussion, in OpenGL ES and “The Last Defender”, the matrix multiplication operation (vector transformation) orders are the same. They are both using “matrix premultiplication” as this equation (V is vector, M is matrix) shows:

Vo = Mn x ... x M2 x M1 x Vi

Please refer to Figure 3:

Figure 3. MxV

Figure 3 shows “MxV” results in a new vector, multiplying a 4x4 matrix by a 4-element vector needs 16 multiplication operations and 12 add operations. The operation load is very heavy. If these operations are performed serially, many instruction cycles must occur, and it will take a long time. Game engines especially have lots of these kinds of operations, so this is a key area for optimization.

In Figure 3, the columns in the blue box are the 4 element operations that can be performed simultaneously, which means a whole MxV operation only needs 4 parallel multiplication operations and 3 parallel add operations. Performance can be improved 4 times. In fact, Intel SSE is very good at executing all multiplication and add operations concurrently in each column, as Figure 4 shows.

Figure 4. Matrix Parallel Operations

Figure 5. Matrix X Matrix

Figure 5 shows an example of multiplying two matrices, which is an extension of multiplying a matrix by a vector, shown previously. But there is one important point we should know: this operation will be impacted by the matrix storage models. The different colors in Figure 5 indicate parallel operations in two different matrix storage modes. Pink shows the parallel operation in Row-Major mode, and orange shows another parallel operation in Column-Major mode. Because Intel SSE also provides highly efficient memory access instructions for different matrix storage models, we should use different algorithms to utilize these acceleration instructions. But memory alignment restrictions for memory access instructions must be noted.

The next two sections show two different solutions based on the two matrix storage models. Both use Intel SSE to parallelize matrix operations to speed up code execution.

Based on “The Last Defender,” optimizing matrix multiplication operation in Row-Major order

Before optimizing the “The Last Defender” game engine code with Intel SSE, we analyzed it using Intel® VTune™ Amplifier 2011 for Android to profile its computing consumption, specifically noting the “Matrix4f::mul” function as shown in Figure 6:

We profiled our code on the Motorola MT788 smartphone with the Intel® Atom™ processor Z2480. After specific operations, we found Matrix4f::mul computing consumption reference was 83,340,000—a very time-consuming operation. In real code, it looks like this:

Yes, this code is clear and simple. But obviously, it is long and time-consuming, and the function is called frequently in the game engine, affecting performance and making it a prime candidate for optimization.

As mentioned above, the simple SSE optimization can be applied as follows (in Row-Major order):

This implementation is based on Intel SSE Intrinsics. We recommend that developers use SSE Intrinsics when compilers support them instead of writing pure assembly language. They are better, easier to use, and more intuitive than assembly language without any performance loss.

__m128 is a data type for SSE Intrinsics. The length is 128 bytes, it can be used to store four 32-bit single floats.__m128 _mm_setr_ps(float z , float y , float x , float w );

This intrinsic is based on mask “i”. It selects four specific single floats from “a” and “ b” to combine a new __m128 data based on mask “i”. The mask must be an immediate number. Figure 7 shows the detailed rules:

After this simple optimization, we used Intel VTune Amplifier 2011 for Android to profile the same operation again and got the result shown in Figure 8.

Figure 8. Optimized Matrix4f::mul Computing Consumption Reference

The computing consumption reference was reduced from 83,340,000 to 18,780,000, which is a performance improvement of over 4 times1 (We executed the same test steps and operations at same scenario: same scene, almost same enemies, same vehicles, same weapons, same test duration and so on, but because of changes of AI, quantity of enemy, there were some tiny impacts for test result). This example shows the powerful ability of Intel SSE parallel computing.

Column-Major Matrix Multiplication Optimization for OpenGL ES

For matrix operations in OpenGL ES-based applications, the Column-Major storage model is strongly recommended. Not only can this model meet the OpenGL ES specification, it can also be parallelized more easily. As we mentioned previously, developers can apply many highly efficient memory access technologies to optimize their code. The following code is a classic conversion sample from ARM NEON* to Intel SSE:

Additional Optimization Technologies

There are more optimization technologies and techniques that can be applied to game coding. One of them is the Intel® C++ Compiler for Android*, a good, easy-to-use candidate that can compile NDK portions of the game code. Intel C++ Compiler for Android OS provides many special optimizations for Intel® CPU architecture, such as pipeline, cache, and memory utilization. We also use GCC to compile our code, but we need to set compilation options to tune performance and improve cache and memory utilization as follows:Optimization Compilation Options for GCCLOCAL_CFLAGS := -O3 -ffast-math -mtune=atom -msse3 -mfpmath=sse
Optimization Compilation Options for Intel C++ Compiler for Android OSLOCAL_CFLAGS := -O3 -xSSSE3_atom -ipo -no-prec-div

“Sharp tools make good work”, as we know, and Intel VTune Amplifier 2011 for Android can help developers locate a program’s hotspots (highly time-consuming) quickly, and check cache and memory usage to improve performance and quality. Intel® Graphics Performance Analyzers (Intel® GPA) is another powerful set of tools. It can help developers monitor real-time status of executing software from a whole system perspective including CPU, GPU, memory, IO, graphics API, and so on to find the bottlenecks. Intel GPA is excellent for game development!

Figure 9. Intel® Graphics Performance Analyzers

Summary

By combining Intel® SSE and Intel C++ Compiler for Android OS compilation with Intel GPA instructions, we achieved obvious improvement in The Last Defender’s performance. Using the same test scenario as before, the FPS improved from 30 to 39, or about 30%2!

Figure 10. Snapshot for Non-optimized Version of The Last Defender (FPS can be turned on via game settings)

Figure 11. Snapshot for Optimized Version of The Last Defender (FPS can be turned on via game settings)

Using Intel® SSE technology to speed up game code is amazing and fun, but also challenging. Although we could only give a short description of our process in this article, we hope we inspired Android game developers to use the features that are available on IA, and helped them optimize their game code to get faster gameplay and better user experience!

Thanks!

Author Bio

YANG Yi is a software application engineer and working in Intel, currently, he is focusing on game engines and graphics related enablement for Android* on IA in PRC. Now, based on lots of advanced Intel® technologies, he engages and helps PRC game ISVs to enable more high performance and high quality game engines, more popular game titles with excellent gameplay on Intel® x86 Android* platforms.

1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance