Book

Tune CPU performance

This tutorial introduces techniques we use to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
cProfile and yep, and Google's perftools.

Profiling is the process that reveals performance bottlenecks,
which could be very different from what's in the developers' mind.
Performance tuning is done to fix these bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.

PaddlePaddle users program AI applications by calling the Python API, which calls
into libpaddle.so. written in C++. In this tutorial, we focus on
the profiling and tuning of

the Python code and

the mixture of Python and C++ code.

Profiling the Python Code

Generate the Performance Profiling File

We can use Python standard
package, cProfile,
to generate Python profiling file. For example:

python -m cProfile -o profile.out main.py

where main.py is the program we are going to profile, -o specifies
the output file. Without -o, cProfile would outputs to standard
output.

Look into the Profiling File

cProfile generates profile.out after main.py completes. We can
use cprofilev to look into
the details:

cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py

where -a specifies the HTTP IP, -p specifies the port, -f
specifies the profiling file, and main.py is the source file.

Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:

We can see that the most time-consuming function is the built-in
method run, which is a C++ function in libpaddle.so. We will
explain how to profile C++ code in the next section. At this
moment, let's look into the third function sync_with_cpp, which is a
Python function. We can click it to understand more about it:

The lists of the callers of sync_with_cpp might help us understand
how to improve the function definition.

Profiling Python and C++ Code

Generate the Profiling File

To profile a mixture of Python and C++ code, we can use a Python
package, yep, that can work with Google's perftools, which is a
commonly-used profiler for C/C++ code.

In Ubuntu systems, we can install yep and perftools by running the
following commands:

apt update
apt install libgoogle-perftools-dev
pip install yep

Then we can run the following command

python -m yep -v main.py

to generate the profiling file. The default filename is
main.py.prof.

Please be aware of the -v command line option, which prints the
analysis results after generating the profiling file. By examining the
the print result, we'd know that if we stripped debug
information from libpaddle.so at build time. The following hints
help make sure that the analysis results are readable:

Use GCC command line option -g when building libpaddle.so so to
include the debug information. The standard building system of
PaddlePaddle is CMake, so you might want to set
CMAKE_BUILD_TYPE=RelWithDebInfo.

Use GCC command line option -O2 or -O3 to generate optimized
binary code. It doesn't make sense to profile libpaddle.so
without optimization, because it would anyway run slowly.

Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable OMP_NUM_THREADS=1 to prevents OpenMP from automatically
starting multiple threads.

Examining the Profiling File

The tool we used to examine the profiling file generated by
perftools is pprof, which
provides a Web-based GUI like cprofilev.

We can rely on the standard Go toolchain to retrieve the source code
of pprof and build it:

go get github.com/google/pprof

Then we can use it to profile main.py.prof generated in the previous
section:

pprof -http=0.0.0.0:3213 `which python` ./main.py.prof

Where -http specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
the following:

Identifying the Performance Bottlenecks

Similar to how we work with cprofilev, we'd focus on tottime and
cumtime.

We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
time, and MomentumOp takes about 17%. Obviously, we'd want to
optimize MomentumOp.

pprof would mark performance critical parts of the program in
red. It's a good idea to follow the hints.