Download PDF and Source

This article, the fifth in a series, describes the profiling and optimization of a fluid simulation, presented in the third and fourth articles. The first article summarized fluid dynamics; the second surveyed fluid simulation techniques; and the third and fourth presented a vortex-particle fluid simulation with two-way fluid-body interactions, which runs in real time. This article exploits yet another feature of Intel® Threading Building Blocks (Intel® TBB) to spread more work across multiple threads. This article describes a process for profiling CPU usage and uses that information to optimize and further parallelize the code so that it runs faster.

Much of the process of profiling and optimization described in this article mirrors the procedures described in The Software Optimization Cookbook. That process starts with creating a benchmark-a chunk of code used to quantify the performance of the algorithm being optimized. The rest of the process entails iterating on three steps:

Profile to identify so-called “hotspots,” where the application spends most of its time.

Investigate the details of why the hotspot consumes so much time.

Modify code to try to make it faster.

This article chronicles applying these steps to the fluid simulation application presented in the previous two articles.

Benchmark Setup

The benchmark should provide repeatable results so that a profiling tool obtains the same results each time. The test cases provided with the fluid simulation use a random number generator to initialize the fluid properties; formulating a proper benchmark required seeding that random number generator so that the simulation produced exactly the same results for each run. Calling srand with a constant value, from within InteSiVis::InitialConditions accomplished this.

I used Intel® VTune™ Performance Analyzer to perform many of the profiles. Although not necessary, it helps simplify the process if the benchmark automatically terminates after a fixed, constant duration. I therefore added code in GlutDisplayCallback that automatically terminates the process after a fixed number of updates. This code is only compiled when the pre-compiler macro PROFILE is defined with a value exceeding 1.

When the simulation code is built in benchmark mode, profiling the application is fast and easy. The application automatically runs the benchmark test case, executes the predefined number of iterations, and then terminates. The process is automated and easy to run, and the execution is identical for each run, all of which are important features to have in a benchmark.

Profiling

I used two tools to perform CPU profiling: performance counters and VTune. The kernel provides a pair of system calls to facilitate precise measurement of elapsed time. The QueryPerformanceCounter function obtains the current value of a high-resolution timer in units of cycles, and the QueryPerformanceFrequency function obtains the frequency (in cycles per second) of that timer. The simulation code includes macros that simplify the use of these timers. To time a section of code, place QUERY_PERFORMANCE_ENTER just before the start of the region and QUERY_PERFORMANCE_EXIT just after the region, where the EXIT macro also takes an argument used to label the region. These macros measure the duration of each iteration that the region executes, along with average durations of that region. This provides a quick and easy scheme for measuring aggregate code performance and is embedded in the code, so that it can be used even when fancier tools are not available.

Table 1 shows timing data for the simulation as it ran on a system with dual 2.33 GHz quad-core Intel® Xeon® processors, varying the number of cores used by the simulation. I used Microsoft* Visual Studio to compile the application. The shaded routines are parallelized using Intel® TBB, as described in articles three and four. As expected, those routines run faster as the number of cores increases. In contrast, the execution times of other routines, which have not been written to exploit multiple threads, do not vary significantly with the number of cores. In fact, because this process ran on a multiprocessing operating system, variations in runtime have less to do with this process in isolation and more to do with miscellaneous other processes running concurrently on the same machine during these timings. Also note that some of the entries in the table represent aggregates of other routines; so naturally, when those aggregates contain parallelized routines, the aggregate timings also vary with the number of threads.

Table 1. Benchmark Timings Before Optimization

Code region

Duration (ms) when executed on # CPUs

1

2

4

8

VortonSim

CreateInfluenceTree

FindBoundingBox

Vortons

0.01

0.01

0.01

0.01

VortonSim

CreateInfluenceTree

FindBoundingBox

Tracers

3.63

3.45

3.35

3.35

VortonSim

CreateInfluenceTree

FindBoundingBox

3.65

3.47

3.36

3.37

VortonSim

CreateInfluenceTree

MakeBaseVortonGrid

0.05

0.05

0.05

0.05

VortonSim

CreateInfluenceTree

AggregateClusters

0.04

0.04

0.04

0.04

VortonSim

CreateInfluenceTree

3.83

3.66

3.56

3.57

VortonSim

ComputeVelocityGrid

5.30

2.27

1.58

1.11

VortonSim

StretchAndTiltVortons

0.23

0.24

0.24

0.25

VortonSim

DiffuseVorticityPSE

0.24

0.24

0.24

0.25

VortonSim

AdvectVortons

0.05

0.05

0.05

0.06

VortonSim

AdvectTracers

18.51

12.71

7.12

3.94

FluidBodySim

VortonSim_Update

26.73

18.58

12.38

8.89

FluidBodySim

SolveBoundaryConditions

4.04

3.21

3.13

3.14

FluidBodySim

RbSphere_UpdateSystem

0.00

0.00

0.00

0.00

InteSiVis

FluidBodySim_Update

30.78

21.80

15.53

12.05

InteSiVis

Render_SetMaterial

0.01

0.01

0.01

0.01

ParticlesRender

FillVertexBuffer

12.48

8.06

5.83

4.73

ParticlesRender

Draw

11.82

11.76

11.77

11.97

ParticlesRender

24.31

19.83

17.61

16.71

InteSiVis

Render_SwapBuffers

0.24

0.29

0.26

0.27

InteSiVis

Render

24.73

20.30

18.05

17.17

InteSiVis

55.31

41.95

33.50

29.15

A look at Table 1 suggests that the FindBoundingBox routine could benefit from parallelization, because it takes a considerable fraction of the simulation time-38% when the number of cores is 8-and a look at its code reveals that it is a straightforward loop, ripe for data parallelism.

VTune is a sophisticated profiling tool that uses sampling-that is, it periodically interrupts the system to record performance information, such as the current instruction pointer and various event counters. This technique allows the profiler to obtain timing and other information for virtually every machine instruction in the program, so you can tell not only which functions occupy a lot of time but which line of code within the function and indeed which instruction within the line of code. Meanwhile, because the sampling occurs relatively infrequently, it does not unduly interfere with the execution of the algorithm.

Note: Intel® VTune™ Performance Analyzer can also instrument code to provide call graphs. This functionality can be valuable in more complicated code, but this simulation has a simple software architecture. As a result, this article omits use of instrumenting and call graph features. VTune also supports programming the performance monitoring unit (PMU) to obtain far more detailed information about low-level details such as memory bandwidth and cache usage.

One of the nice features of a sampling profile is that it does not require any setup within the code. For example, you can just build the application, and then tell VTune to run your application. Then, you “surf” through VTune to discover where your program runs slowly.

VTune can provide information about many events, and the wealth of data could become overwhelming. David Levinthal’s article “Cycle Accounting Analysis on Intel® Core™2 Processors” provides an in-depth analysis of how to interpret sampling data. Intel® VTune provides a “program” of sampling experiments that it runs on your code automatically and that includes calibration and data collection for dozens of events, such as cache misses, branch mispredictions, and many other events. To get started, let’s keep things simple and look at just two quantities: CPU cycles and instructions retired.

Table 2 shows a tiny subset of the sampling data that VTune provides for a few hotspots. The CPU_CLK column indicates counts of the core CPU cycles in the unhalted state, which is approximately the number of CPU cycles spent in the routine. The INST_RETIRED column indicates the number of instruction retired, and CPI indicates the cycles per instruction. A general rule of thumb is that when CPI > 1, the routine is operating significantly less efficiently than the processor is capable of.

Table 2. Sampling Profiles Before and After Optimization

Before optimization

After optimization

Name

CPU_CLK

INST_RETIRED

CPI

CPU_CLK

INST_RETIRED

CPI

FillVertexBufferSlice

3,291

2,971

1.108

3,166

2,656

1.192

Interpolate†

2,679

3,013

0.889

3,104

4,613

0.673

IndicesOfPosition†

2,130

2,198

0.969

†

†

†

ComputeVelocity

1,617

1,686

0.959

1,158

593

1.953

SolveBoundaryConditions

1,134

587

1.932

1,014

1,264

0.802

FindBoundingBox*

692

908

0.762

677

737

0.919

AdvectTracersSlice

289

409

0.707

377

498

0.757

*After optimization, the original FindBoundingBox was split into three routines, and the vast majority of computation occurred in FindBoundingBoxTracersSlice, whose data appears in the “after optimization” columns.†After optimization, IndicesOfPosition became inlined, primarily contributing to Interpolate.

Table 2 suggests that several routines warrant investigation, including the FillVertexBuffer, Interpolate, and IndicesOfPosition routines.

It is interesting to note that although ComputeVelocityGrid consumes a lot of time, it appears only sixth on this list, and it has a relatively good CPI, implying that it is not terribly inefficient. When you look at its code, that fact comes as somewhat of a surprise, because it is one of the most mathematically and logically complicated parts of the code and runs recursively. You might have expected that the function call overhead and branching logic would come at a huge price. But it turns out that it’s not the most immediate source of concern.

The results of the profiling stage often come as a surprise. Before I ran these profiles, I had expected that the ComputeVelocity routine would consume most of the time, but it is fifth on the list. In practically every situation where I have profiled and optimized code, the profiles revealed a different picture of the situation than the one I predicted. This fact suggests that programmers should avoid premature optimization. Profiles, not guesses, should drive your optimization efforts.

Investigation

The sampling summary indicates which routines to investigate, and VTune provides additional detail about precisely which sections of those routines run slowly. For example, the seemingly simple routine IndicesOfPosition consists of only three lines of code, which are simple assignment statements. Each of these lines of C code generates around 10 assembly instructions. Upon closer investigation, VTune reveals that, of these instructions, one in particular consumes a disproportionate number of cycles.

Understanding why code runs slowly requires a wide range of information about the CPU, which is beyond the scope of this article. But the books and articles listed in the “Further Reading” section provide a wealth of pertinent information.

Ultimately, the investigation phase should result in some hypotheses of what causes the hotspots. VTune can provide more detail by running a suite of tests. Upon clicking "Sampling", the Sampling Configuration Wizard appears. Under “Tuning Assistance,” I selected "Automatically generate tuning advice" and selected the option to generate over 30 different “experiments.” For each experiment, VTune reprograms the PMU to collect data about a different “event.” The validity of these experiments depends on having a repeatable, automated benchmark, as mentioned above. This article presents some of the details of this experiment in the “Hotspot” sections below.

Modification

Armed with details about where the code spends most of its time and hypotheses on why, there are a wide variety of solutions to try. Here, optimization transitions from science to art, because finding the optimal solution entails a lot of trial and error. Do not expect your first attempt at optimization to succeed, and never assume that your code change improved execution time.

Modifications to try include:

Changing the algorithm to one with less complexity.

Changing branching logic to avoid mispredicted branches.

Making the code access data in a more cache-friendly manner.

Avoiding inherently slow operations.

Avoiding or replacing certain floating-point operations.

Using intrinsic vector operations.

Distributing algorithms across multiple threads.

You might naïvely assume that slow code results from using slow operations, but very often, code runs slowly because some portion of the code spends a lot of time idly waiting for memory to arrive. Simply knowing exactly what line of C code is the hotspot can imply that the operation performed is slow, but often the operation is not inherently slow: memory or instruction fetches are. The take-home point is that you should expect to try more than one idea when modifying your code.

Repeat as needed

Finally, each time you modify the code, repeat the entire profile–investigate–modify process. You will run through that loop dozens or hundreds of times, which emphasizes the point that the benchmark needs to be very quick and easy to run.

Hotspots

The profile data indicates that the code spends most of its time on a few operations. This article focuses on just a few of them: Rendering, Interpolate, IndicesOfPosition, and FindBoundingBox.

Hotspot 1: Rendering

VTune reports the following:

Execution may be bound by load operations: 0.54 Number of load instructions per clockticksAdvice:

Reduce the number of load operations.

The slowest part of this code comes from rendering. This series of articles focuses on simulation and side-steps rendering, because rendering fluids is its own large and complicated topic. The code provided with this series of articles does not even use programmable shaders or fancy features like “instancing,” both of which would result in monumental gains in speed.

Still, you can drastically improve the rendering speed simply by eliminating a useless operation that is simply an artifact of the OpenGL rendering application programming interface (API). Rendering vertices entails two steps: filling a buffer with vertices, then handing that buffer to the renderer to draw it. VTune reported that DrvCopyContext within the video driver consumed a large percentage of the total cycles for this process. You can’t change the video driver, but you can change how you use it. The original code used “vertex arrays” in which OpenGL copies that buffer to yet another buffer before rendering. Microsoft* DirectX* technology and more recent versions of OpenGL allow you to skip that step and write vertex data directly into the final destination. I changed the rendering routine to use OpenGL’s vertex buffer objects (VBOs), which sped up the draw call from 11.83 ms to 0.027 ms-an approximately 450× improvement.

Hotspot 2: Interpolate

The Interpolate routine is not necessarily inefficient (because its CPI is relatively low), but for simulations with a lot of particles-tracers in particular-the simulation spends a lot of time interpolating to obtain velocity values for each tracer. VTune reports the following:

I also replaced the Vec3 routines with routines from Devir and Zohar’s “Matrix Library”-but it resulted in no speed improvement, possibly because of the overhead associated with increased memory-alignment restrictions. The optimized code distributed with this article includes the vectorized version of Vec3, but it is disabled, because its use did not result in speed improvements.

Note that using the Intel® SSE instructions effectively requires that the memory address each Vec3 is aligned be an integer multiple 16 bytes. This causes each Vec3 to include padding for a phantom fourth component (which not used) and also requires changing all memory allocations to use 16-byte alignment. (Note that the STL vector container resize method infamously takes its “default value” argument by value, not by reference, which means that it is impossible to use an STL vector with objects that require peculiar alignments. To dodge this, I wrote my own.) The additional memory bandwidth and allocation overhead offset gains provided by using the Intel® SSE instructions.

VTune tells you that a huge fraction of the time spent in this routine goes toward the float-to-int conversions, and indeed, this operation is well known to be slow. The Intel® Tuning Assistant reported the following:

By default, the floating-point unit (FPU) converts floating-point values to integers using rounding; but C specifies truncation. So each float-to-int conversion requires changing the FPU mode to “truncation,” but to do that safely requires first saving the old control word, then restoring it after the conversion. You can see (but the compiler cannot) that you only need to change the control word once for all three conversions. The new code for all three conversions looks like this:

In addition to reducing expensive FPU control word changes, the resulting routine is simple enough that IndicesOfPosition becomes inlined. The Interpolate routine calls IndicesOfPosition, so the cost subsequently gets tallied with Interpolate. That explains why IndicesOfPosition disappeared from the post-optimization profiles. That also explains why, despite the fact that interpolation, overall, runs faster, VTune reports that the optimized Interpolate routine consumes more CPU cycles than it did before. You can measure the aggregate improvement from changing Interpolate and IndicesOfPosition from the AdvectTracers routine, which went from 12.9 ms to 18.5 ms-an approximately 1.4× improvement.

Hotspot 4: ComputeVelocity

Most of the time spent in the ComputeVelocity routine happens in VORTON_ACCUMULATE_VELOCITY:

The 1/sqrt(dist2) stands out as particularly expensive, because it includes both division and a square-root, both of which are expensive. Replacing that with an approximation (see the code for details) sped up this routine nearly 1.6×, from 5.30 ms to 3.36 ms.

VTune also reports that this routine has a lot of mispredicted branches. This is because of the unpredictable nature of recursively descending the octree. The problem requires a more sophisticated algorithm, which will be the topic of a future article.

Hotspot 5: SolveBoundaryConditions

VTune reports that the following loop occupies the vast majority of time spent in the SolveBoundaryCondition routine:

Intel® VTune reports that the fsub instruction that computes (rTracer.mPosition.x–rSphere.mPosition.x) runs slowly. The subtraction itself is very fast, so what is the problem? VTune reports that this instruction is waiting on memory to arrive. The hardware prefetch should be able to predict that it needs this memory, but the memory is not available when needed. Adding a software prefetch five iterations ahead improves that wait only modestly. What’s going on? Because this loop is so computationally simple and accesses such a large amount of memory (thousands of tracers), the process is bound by memory bandwidth. You could speed it up by reducing the amount of memory it accesses or incorporating the same instructions into another loop that reads the same memory. Reducing the memory footprint of a particle would not be helpful, because the data accessed is a set of tracer particles, whose structure is quite common in video games. So shrinking it here would make the results less relevant to real-world situations. And incorporating this routine into another would reduce modularity.

As a last resort, it is tempting to parallelize this process. But alas, because it is memory-bound, parallelization would not speed it up. Because Intel® TBB makes parallelization so easy, however, I went ahead and parallelized this process anyway, just to confirm my suspicions. Each iteration of this process potentially affects the same data (by applying impulses to the rigid body), so you must again use the map-reduce paradigm-that is, parallel_reduce. The results indicate that the process speeds up going from one to two CPUs (probably because that effectively doubles the L1 cache, as each core has its own), but going from two to four or four to eight CPUs does not improve throughput whatsoever, as expected.

Hotspot 6: FindBoundingBox

The FindBoundingBox routine was neither terribly inefficient (its CPI was appreciably below 1) nor very “hot” (it consumed under 7% of the total cycles). But it was not threaded, so as the number of CPUs increased, FindBoundingBox occupied a greater fraction of the time. As earlier articles mentioned, parallelizing this routine is not trivial, because each iteration affects data “shared” across all iterations. Fortunately, however, combining that shared information is trivial. This routine follows the classic map-reduce pattern, which resembles “divide-and-conquer”: the problem can be split into pieces, each piece can be solved separately, and the pieces can be joined efficiently. The “merge sort” algorithm also follows this pattern.

Parallelizing this routine with Intel® TBB parallel_reduce requires two parts: the usual “map” part, which earlier articles applied to other routines, and a new “reduce” part, which joins the results of the individual threads.

The VortonSim_FindBoundingBoxTracers_TBB class below shows a function for use with Intel® TBB parallel_reduce, which performs the same operations that FindBoundingBox did for tracers. The VortonSim_FindBoundingBoxVortons_TBB class performs the analogous operations for vortons, and you can find it in the source code that accompanies this article.

Table 3 shows the final timing results from the optimized version of this simulation code.

Table 3. Benchmark Timings After Optimization

Code region

Duration (ms) when executed on # CPU’s

1

2

4

8

VortonSim

CreateInfluenceTree

FindBoundingBox

Vortons

0.012

0.027

0.028

0.030

VortonSim

CreateInfluenceTree

FindBoundingBox

Tracers

3.593

1.626

1.181

0.924

VortonSim

CreateInfluenceTree

FindBoundingBox

3.613

1.660

1.217

0.961

VortonSim

CreateInfluenceTree

MakeBaseVortonGrid

0.036

0.039

0.040

0.039

VortonSim

CreateInfluenceTree

AggregateClusters

0.037

0.038

0.038

0.039

VortonSim

CreateInfluenceTree

3.787

1.838

1.405

1.153

VortonSim

ComputeVelocityGrid

3.362

1.594

1.028

0.783

VortonSim

StretchAndTiltVortons

0.256

0.273

0.283

0.282

VortonSim

DiffuseVorticityPSE

0.223

0.227

0.227

0.230

VortonSim

AdvectVortons

0.038

0.038

0.038

0.038

VortonSim

AdvectTracers

12.739

12.186

6.625

3.821

FluidBodySim

VortonSim_Update

20.853

16.174

9.627

6.326

FluidBodySim

SolveBoundaryConditions

3.827

2.876

2.928

2.914

FluidBodySim

RbSphere_UpdateSystem

0.002

0.002

0.002

0.002

InteSiVis

FluidBodySim_Update

24.693

19.063

12.569

9.253

InteSiVis

Render_SetMaterial

0.011

0.012

0.011

0.011

ParticlesRender

FillVertexBuffer

11.441

6.149

3.008

2.087

ParticlesRender

Draw

0.033

0.028

0.024

0.021

ParticlesRender

11.506

6.209

3.065

2.141

InteSiVis

Render_SwapBuffers

0.282

0.318

0.289

0.312

InteSiVis

Render

11.889

6.633

3.458

2.558

InteSiVis

36.597

25.705

16.037

11.823

The overall speedup varies from 1.5× on a single CPU to 2.5× on eight CPUs.

Summary

Tuning the performance of any program involves timing measurements that must be precise in both time and space. I started by composing a repeatable, easy-to-run benchmark. I manually instrumented this benchmark with timing code to measure the duration of various chunks of code. That provided a simple value with a simple meaning, and although it is precise in time, it was not very precise in space-that is, it did not tell me precisely which instructions ran slowly. VTune, a sampling profiler, narrowed down the search for hotspots by reporting precisely which instructions of which routines occupied most of the time. VTune also provided a wealth of other information, such as cache misses and branch mispredictions, to alert me of where the code runs less efficiently than it theoretically could. This led to many inspirations about how to make the code more efficient. In addition, I used the parallel_reduce feature of Intel® TBB to parallelize a routine that requires some amount of synchronization between threads. These optimizations led to an overall speedup of 1.5× to 2.5×, depending on the number of CPUs.

Future articles will cover more drastic changes to the algorithm to compute velocity along with other enhancements to make this fluid simulation more practical for use in video games.

About the Author

Dr. Michael J. Gourlay teaches at the University of Central Florida (UCF) Florida Interactive Entertainment Academy (FIEA), an interdisciplinary graduate program that teaches programmers, producers and artists how to make video games and training simulations. He also works as a Senior Software Engineer at Electronic Arts. He currently works on the procedural animation system used by EA. He was also a lead programmer on Madden and NASCAR. He architected the visual effects system used on EA Sports titles worldwide and developed algorithms for interactive, high-bandwidth online applications. Prior to joining EA, he performed scientific research using computational fluid dynamics and the world's largest massively parallel supercomputers. His previous research also includes nonlinear dynamics in mechanical systems, and atomic, molecular and optical physics. Michael received his degrees in physics and philosophy from Georgia Tech and the University of Colorado at Boulder.