1. Overview

The graph belows charts the number of CPU cycles used by SQLite on a
standard workload, for all versions of SQLite going back about 8 years.
As can be seen, the number of CPU cycles used by SQLite has been
halved in just the past three years.

This article describes how the SQLite developers measure CPU usage,
what those measurements actually mean, and the techniques used by
SQLite developers on their continuing quest to further reduce the
CPU usage of the SQLite library.

Measured using cachegrind on Ubuntu 16.04 on x64 with gcc 5.4.0 and -Os

2. Measuring Performance

In brief, the CPU performance of SQLite is measured as follows:

Compile SQLite in an as-delivered configuration, without any special
telemetry or debugging options.

Link SQLite against a test program that runs approximately 30,000
SQL statements representing a typical workload.

2.1. Compile Options

For performance measurement, SQLite is compiled in approximately the same
way as it would be for use in production systems. The compile-time configuration
is "approximate" in the sense that every production use of SQLite is
different. Compile-time options used by one system are not necessarily
the same as those used by others. The key point is that options that
significantly impact the generated machine code are avoided. For example,
the -DSQLITE_DEBUG option is omitted because that option inserts thousands
of assert() statements in the middle of performance critical sections of the
SQLite library. The -pg option (on GCC) is omitted because it causes the
compiler to emit extra probabilistic performance measuring code which interferes
with actual performance measurements.

For performance measurements,
the -Os option is used (optimize for size) rather than -O2 because the
-O2 option creates so much code movement that it is difficult to associate
specific CPU instructions to C source code lines.

2.2. Workload

The "typical" workload is generated by the
speedtest1.c
program in the canonical SQLite source tree. This program strives to
exercise the SQLite library in a way that is typical of real-world
applications. Of course, every application is different, and so
no test program can exactly mirror the behavior of all applications.

The speedtest1.c program is updated from time to time as the SQLite
developers' understanding of what constitutes "typical" usage evolves.

2.3. Performance Measurement

Cachegrind is used to
measure performance because it gives answers that are repeatable to
7 or more significant digits. In comparison, actual (wall-clock)
run times are scarcely repeatable beyond one significant digit.

2.4. Microoptimizations

The high repeatability of cachegrind allows the SQLite developers to
implement and measure "microoptimizations". A microoptimization is
a change to the code that results in a very small performance increase.
Typical micro-optimizations reduce the number of CPU cycles by 0.1% or
0.05% or even less. Such improvements are impossible to measure with
real-world timings. But hundreds or thousands of microoptimizations
add up, resulting in measurable real-world performance gains.

3. Performance Measurement Workflow

As SQLite developers edit the SQLite source code, they run the
speed-check.sh
shell script to track the performance impact of changes. This
script compiles the speedtest1.c program, runs it under cachegrind,
processes the cachegrind output using the
cg_anno.tcl TCL
script, then saves the results in a series of text files.
Typical output from the speed-check.sh script looks like this:

The important parts of the output (the parts that the developers pay
the most attention to) are shown in red.
Basically, the developers want to know the size of the compiled SQLite
library and how many CPU cycles were needed to run the performance test.

The output from the
cg_anno.tcl script
shows the number of CPU cycles spent on each line of code.
The report is approximately 80,000 lines long. The following is a brief
snippet taken from the middle of the report to show what it looks like:

The numbers on the left are the CPU cycle counts for that line of code,
of course.

The cg_anno.tcl script removes extraneous details from the default
cachegrind annotation
output so that before-and-after reports can be compared using a
side-by-side diff to view specific details of how a
micro-optimization attempt affected performance.

4. Limitations

The use of the standardized speedtest1.c workload and cachegrind has
enabled significant performance improvement.
However, it is important to recognize the limitations of this approach:

Performance measurements are done with a single compiler (gcc 5.4.0),
optimization setting (-Os), and
on a single platform (Ubuntu 16.04 LTS on x64). The performance of
other compilers and processors may vary.

The speedtest1.c workload that is being measured tries to be representative
of a wide range of typical uses of SQLite. But every application is
different. The speedtest1.c workload might not be a good proxy for the
kinds of activities performed by some applications. The SQLite developers
are constantly working to improve the speedtest1.c program, to make it
a better proxy for actual SQLite usage. Community feedback is welcomed.

The cycle counts provided by cachegrind are a good proxy for actual
performance, but they are not 100% accurate.

Only CPU cycle counts are being measured here.
CPU cycle counts are a good proxy for energy consumption,
but do not necessary correlate well with real-world timings.
Time spent doing I/O is not reflected in the CPU cycle counts,
and I/O time predominates in many SQLite usage scenarios.