Kingsoft made extensive use of Intel® Graphics Performance Analyzers (Intel® GPA) to craft their latest release of JX3 Online*, a 3D massively multiplayer online role-playing game (MMORPG). Kingsoft utilized both primary components of Intel® GPA, the System Analyzer and the Frame Analyzer, to identify bottlenecks and slowdowns in graphics processing. With support from Intel, Kingsoft developers determined overall system performance metrics and pinpointed sections of the code where bottlenecks and unnecessary draw calls slowed graphics performance. Using diagnostic techniques and optimization methods suggested by Intel application engineers, Kingsoft successfully boosted the overall playable frame rate for the game performance by 2.1x when run on systems with Intel® 4 Series Express Chipsets graphics.

The specific methods Kingsoft used to identify root causes of graphics slowdowns and improve the code for the target graphics platforms provide a useful guide to the capabilities and functionality of the Intel® GPA, as documented in the following sections.

Test Bed Configuration

To perform the optimization of JX3 Online*, Kingsoft used two separate PCs. One, containing the Intel® GPA components, was set up for the performance analysis. This analysis machine was linked through a TCP/IP connection to a target PC equipped with Intel® GMA X4500 Series Graphics. For the testing, Kingsoft created a special benchmark version of JX3 Online* to approximate online behavior while running in standalone mode on a single PC.

To establish a test baseline, Kingsoft selected a typical scene (see Figure 1) and set parameters for the configuration, choosing a screen resolution of 800x600 with the basic game configuration and low visual effects. Running the game benchmark indicated that playback remained stable at 11 frames per second (fps), as measured using the Fraps* real-time video capture and benchmarking application.

Figure 1: Baseline performance of JX3 Online* game benchmark

With the test environment defined, a test scene selected,, and a baseline measured, Kingsoft fired up the Intel® GPA System Analyzer and Frame Analyzer to begin an in-depth examination of the graphics performance.

Analyzing System Performance

The System Analyzer provides a high-level view of systems performance metrics to determine whether slowdowns and bottlenecks are associated with the CPU, graphics processor unit (GPU), or Microsoft DirectX* (DX) runtime operations. By selecting state overrides to isolate various graphics functions, a profile of the game workload at particular stages in the graphics pipeline emerges.

The applicable state overrides include:

Null Driver: Removes the graphics driver and GPU workloads to evaluate the effect of game application workloads to the entire frame rate. Specifically, the frame rate in this override can be thought of as the maximum performance of the game with an infinitely fast graphics card and an infinitely fast graphics driver in a system with the same configuration. Put another way, this Is the fastest the game, as currently implemented, could run on this system by removing the work of the driver and graphics card..

Null Hardware: Removes the GPU load to evaluate whether the game is GPU-bound or CPU-bound. When enabled, if this override provides a substantially higher frame rate, this indicates the game is currently GPU-bound.

1x1 Scissor Rect: On systems with Intel® Graphics Technology, this removes all pixel processing except for 1 pixel per draw call. When enabled, if this override provides a substantially higher frame rate, this indicates pixel processing is a bottleneck.

Table 1 shows the indicated frame rates and frame time for the selected test scene running in the test bed environment.

State Overrides

Frame rates (fps)

Frame time (milliseconds)

Initial state (no overrides)

11

91

1x1 Scissor Rect (remove pixel processing)

19

53

Null Hardware (remove GPU load)

21

48

Null Driver (remove graphics driver and GPU load)

52

19

Table 1: Test scene frame rates and frame times with selected state overrides

The Table 1 data illustrates the contribution of each stage of the rendering process to the frame time during game playback (as shown in Figure 2). Note that the indicated times for each of the stages do not reflect actual time consumed at that stage for every frame, but the actual processing time as it affects the overall frame rate. Because the stages are typically executing in parallel (or, under some conditions, serially), the time required to process each stage may take longer.

Figure 2. Load distribution of the graphics pipeline

To achieve the target frame rate (20 fps) and frame time (50 ms), shown by the red line in Figure 2, both the GPU and CPU workloads must be reduced (particularly the GPU). The indications from the analysis suggest:

Pixel processing represents a significant bottleneck with a strong effect on the game’s frame rate. However, optimizing the pixel processing alone is unlikely to be enough to achieve the target frame rate.

Graphics driver workloads are the second heaviest burden in the graphics pipeline. Analyzing the DX draw calls further should help determine if the game is sending a high volume of rendering commands and data to the graphics hardware.

The maximum frame rate of the game before optimization (with any graphics hardware) would be only 52 fps (according to the frame rate under Null Drive), which indicates that workloads from the program code and DX runtime are fairly heavy.

The heavy workloads associated with the graphics pipeline will require substantial trimming to reach the target frame rate, and optimization efforts should be directed at each stage of the graphics pipeline.

The DX metrics feature of the System Analyzer helps identify root causes of performance issues, as shown in Figure 3.

Figure 3: Analyzing the Microsoft DirectX* metrics for JX3 Online*.

The System Analyzer determined that the average number of DX draw calls per frame is approximately 1500, a much higher number than is commonly seen for a game of this type. This suggests that DirectX runtime workloads are too high, and too many objects are being rendered in each frame. With this information, Kingsoft recognized that draw calls could be the primary culprit slowing performance and used the Frame Analyzer to obtain an in-depth view of rendering operations for the test frames.

Analyzing Frame Performance

The initial screen of the Frame Analyzer provides an overview of the regions, partitioned by render target. A region consists of one or more work items (primarily draw calls). Regions 2 to 668 render the scene into a texture surface for the reflection in the water. Regions 669 to 1522 render the complete scene in the frame buffer. Other regions clear the color and Z/Stencil buffers.

In this example, regions 2 to 668 consume nearly half of the overall frame time (40.0 percent), as shown in Figure 4. The Render Target Viewer (see red circle In Figure 4) shows the generated water reflection texture. When displaying the final rendered scene in the frame buffer, however, the river is not visible within the game camera. Since the river is not visible, the workload required to render the water reflection is unnecessary, and removing it provided a substantial performance boost.

Figure 4: Water reflection texture in the Render Target Viewer

Selecting Draw Calls from the drop-down menu on the left side of the screen lets you sort calls by duration. By identifying the draw calls that consume the most time, developers can explore options for reducing the duration. Right clicking a shader associated with a selected draw call highlights all the draw calls that use the shader, as shown in Figure 5. The example shows all the draw calls that render the terrain in the scene.

Figure 5. Display Ergs that use Selected Shader

In this example, the Frame Analyzer indicated that the selected terrain rendering required 11.6 percent of the complete frame time, suggesting another area where optimization could improve game performance.

Setting up experiments within the Frame Analyzer provides quantitative feedback on the degree of performance improvement that can be obtained. For example, as shown in Figure 6, the use of the Simple Pixel Shader could result in at most a 50.3 percent improvement in the terrain rendering time. In comparison, the 2x2 Textures experiment could result in at best a 0.7 percent improvement. In this instance, to best optimize the terrain rendering performance, developers can see that reducing the complexity of the pixel shader will yield the best results.

Rendering leaves, as indicated by the Frame Analyzer, also consumes an inordinate amount of the total frame time (21 percent), providing insight into another area where rendering can be optimized. In this case, the Frame Analyzer determined that there was not a defined pixel shader used to render the leaves, so there was no need to run the Simple Pixel Shader experiment. The 2x2 Textures experiment, as shown in Figure 7, showed a performance improvement for leaf rendering of more than 40 percent.

Figure 7: Performing the 2x2Textures experiment on leaf rendering.

Further investigation in Frame Analyzer of the textures applied to leaf rendering revealed a large illumination texture map with a 1024x1024 resolution, used 42 times per frame. Another set of four leaf textures was found to have the same leaf structure, varying only by color. Consolidating and merging the textures used for the leaves saved a significant amount of texture bandwidth.

Draw calls can also be sorted by "Prim Count" to show the batch sizes of each, as shown in Figure 8. Comparing the batch sizes with the pixel coverage in the screen, developers can determine whether the level of detail (LOD) of meshes is appropriate for the scene. Selecting all draw calls with the largest batch size (1540 primitives) did not obviously change the render target. By selecting the “Pop-Out” check box, these objects appeared, but covered very few pixels (as shown by the small pink points in the red rectangles). Frame Analyzer indicated the Vertex Shader Duration was much higher than the Pixel Shader Duration. This means that LOD was not enabled for detailed meshes that were very far away from the camera. Best practices for Intel® Graphics Technology implementations recommend scaling the batch size between 200 to 1000 primitives. The Prim Count sorting analysis also indicated that nearly 30 percent of the draw calls had batch sizes smaller than 200 primitives.

Figure 8: Sorting draw calls by Prim Count to See the Batch Sizes

Optimization Results

Here are the areas where optimization could be successfully applied, as highlighted by the Intel® GPA System Analyzer and Frame Analyzer:

Reduce redundant rendering of reflections in water, updating only when visible in the camera, by using lower levels of detail and closing the water reflection effects for platforms with low frame rates

Eliminate large illumination textures and consolidate four similar leaf textures into one (also compressing the colors)

Improve terrain rendering by optimizing the pixel shader used for the terrain, simplifying the blend algorithm and texture filter employed by the pixel shader

Reduce the level of detail for model meshes to minimize overdraw for distant models and models in water

Merge draw calls with small batch sizes to reduce the DX overhead

As shown in Figure 9, these optimizations produced an improvement in the playback rate indicated by the game benchmark from 11 to 23 frames per second on systems using Intel® 4 Series Express Chipsets-a 109 percent improvement in performance.

Figure 9: Optimization improved frame rate from 11fps to 23fps

Summary

As illustrated by the experiences of the Kingsoft development team, Intel® GPA helps minimize the effort programmers devote to optimizing games and other graphics-intensive applications to run on desktop and notebook computers equipped with Intel® Graphics Technology. Through the use of various override modes, the System Analyzer provided a high-level profile of the game workload at particular stages in the graphics pipeline. This information was then used within the Frame Analyzer to perform various experiments which identified specific areas for improvement. The bottom line is that the frame rate increased by a factor of 2.1x, without negatively impacting the overall quality of the scene being rendered.

About the Author

Sheng Guo is an application engineer in Intel’s Developer Relations Division within the Software and Services Group (SSG) working on optimizing applications to take advantage of the latest Intel software and hardware innovations. Sheng specializes in software design and optimization in massive multi-player online games, enterprise anti-virus software, web services and office applications, etc. He holds a Masters degree in Computer Science from the University of Nanjing, China.

Intel® GPA is available at no cost to Visual Adrenaline developer program members. For more information about Intel® GPA and to download the software, look in the Visual Computing tools section.