Compiler optimizations vs PostgreSQL

About two weeks ago I posted a performance comparison of PostgreSQL compiled using various compilers and compiler versions. The conclusion was that for pgbench-like workloads (lots of small transactions), the compilers make very little difference. Maybe one or two percent, if you're using a reasonably fresh compiler version. For analytical workloads (queries processing large amounts of data), the compilers make a bit more difference - gcc 4.9.2 giving the best results (about 15% faster than gcc 4.1.2, with the other compilers/versions somewhere between those two).

Those results were however measured with the default optimization level (which is -O2 for PostgreSQL), and one of the questions in the discussion below that article is what difference would the other optimizations level (like -O3, -march=native and -flto) do. So here we go!

I repeated the two tests described in the previous post - pgbench and TPC-DS, with almost exactly the same configuration. If you're interested in the details, read that post.

The only thing that really changed is that thile compiling the PostgreSQL sources, I modified the opmtimization level or enabled the additional options. In total, I decided to test these combinations:

When combined with all the clang and gcc versions, this amounts to 43 combinations. I haven't done the tests for the Intel C Compiler.

BTW if you're interested in more detailed results, see this spreadsheet, or download this OpenDocument spreadheet (same data).

pgbench

For the small dataset (~150MB), the results of the read-only test are depicted on the following chart, showing the number of transactions per second (50k-58k range) - so the higher the number, the better.

The results are sorted by compiler (clang, gcc), optimization level and finally compiler version. The bars depict minimum, average and maximum tps (from 3 runs), to give an idea of how volatile the results are - I haven't found a better chart type in Google Drive or Libreoffice Calc.

Now, the first thing you probably notice is that for clang, the higher optimization levels mostly lower the performance. The newer the version, the worse the impact - while clang 3.5 gives >55k transactions with -O2, it drops to 54k with -O3 and 53k with -O4. The 2-4% difference is not a big one, but it's pretty consistent and it certainly is not in the direction we've hoped for.

With gcc, the situation is more complicated - the -O3 and -O3 -march=native levels result in slightly worse performance, although the difference is not as significant as for clang. The results however seem less volatile (e.g. 4.4 is a good example of that).

The "Link Time Optimization" is a different story, increasing the performance for most versions, especially compared to the -O3 -march=native results. For example 4.8 jumps from 53k to 55k tps, and 4.9 jumps from 54k to 56k. Compared to the -O2 results it's not that great, though (it's still faster, but the difference is smaller).

The other way to look at the data is by looking at the results like this:

compiler

-O2

-O3

-O4

clang 3.1

54808

54666

-

clang 3.2

55320

54957

-

clang 3.3

55314

54909

-

clang 3.4

55144

54474

54259

clang 3.5

55766

54628

53848

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc 4.1.2

52808

53019

-

-

gcc 4.2.4

53474

52829

53053

-

gcc 4.3.6

52355

52634

52465

-

gcc 4.4.7

51685

52070

52194

-

gcc 4.5.4

53739

52828

52663

56085

gcc 4.6.4

53144

53632

52899

54973

gcc 4.7.4

54354

53572

53001

52451

gcc 4.8.3

54350

52390

52753

54842

gcc 4.9.2

54669

54036

53758

56151

And after computing the difference against the -O2 results for each version, you'll get this:

compiler

-O2

-O3

-O4

clang 3.1

54808

-0.26%

-

clang 3.2

55320

-0.66%

-

clang 3.3

55314

-0.73%

-

clang 3.4

55144

-1.21%

-0.40%

clang 3.5

55766

-2.04%

-1.43%

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc 4.1.2

52808

0.40%

-

-

gcc 4.2.4

53474

-1.20%

-0.79%

-

gcc 4.3.6

52355

0.53%

0.21%

-

gcc 4.4.7

51685

0.74%

0.98%

-

gcc 4.5.4

53739

-1.70%

-2.00%

4.36%

gcc 4.6.4

53144

0.92%

-0.46%

3.44%

gcc 4.7.4

54354

-1.44%

-2.49%

-3.50%

gcc 4.8.3

54350

-3.61%

-2.94%

0.91%

gcc 4.9.2

54669

-1.16%

-1.67%

2.71%

This only confirms that on clang all the optimization levels hurt the performance (although only a tiny bit). For gcc, the only thing that makes a bit of difference in the right direction is the -flo flag. But even this makes less difference than the compiler version (the gcc-4.9.2 with -O2 is almost as fast as gcc-4.8.3 with -flto).

TPC-DS

Ok, so that was a transactiona workload. Now let's see the impact on analytical workloads ... first, the data load, consisting from the same steps as before:

COPY data into all the tables

create indexes

VACUUM FULL (not really necessary)

VACUUM FREEZE

ANALYZE

but for all the various optimization combinations:

Clearly, no significant impact - exactly as in the initial post. In case you prefer the tabular form of the results (similar to the one presented for pgbench), this time tracking the total duration of the loading process (in seconds):

compiler

-O2

-O3

-O4

clang-3.1

407

407

-

clang-3.2

399

396

-

clang-3.3

399

395

-

clang-3.4

406

397

411

clang-3.5

405

405

411

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc-4.1.2

401

406

-

-

gcc-4.2.4

407

402

397

-

gcc-4.3.6

401

398

400

-

gcc-4.4.7

400

402

398

-

gcc-4.5.4

398

394

391

393

gcc-4.6.4

406

398

400

397

gcc-4.7.4

385

384

384

387

gcc-4.8.3

390

384

390

383

gcc-4.9.2

379

383

374

379

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler

-O2

-O3

-O4

clang-3.1

407

-0.10%

-

clang-3.2

399

0%

-

clang-3.3

399

1%

-

clang-3.4

406

2%

-1.18%

clang-3.5

405

0%

-1.38%

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc-4.1.2

401

-1.11%

-

-

gcc-4.2.4

407

1%

2%

-

gcc-4.3.6

401

0%

0%

-

gcc-4.4.7

400

-0.61%

0%

-

gcc-4.5.4

398

1%

1%

1%

gcc-4.6.4

406

1%

1%

2%

gcc-4.7.4

385

0%

0%

-0.57%

gcc-4.8.3

390

1%

-0.02%

1%

gcc-4.9.2

379

-1.04%

1%

-0.13%

Now, let's see the impact on query performance (notice the chart shows range 150-210, in seconds):

And the results in the tabular form:

compiler version

-O2

-O3

-O4

clang-3.1

176

174

-

-

clang-3.2

176

172

-

-

clang-3.3

174

185

-

-

clang-3.4

189

176

181

-

clang-3.5

174

175

179

-

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc-4.1.2

186

200

-

-

gcc-4.2.4

189

186

186

-

gcc-4.3.6

189

186

185

-

gcc-4.4.7

181

178

183

-

gcc-4.5.4

173

169

166

160

gcc-4.6.4

171

173

172

153

gcc-4.7.4

171

166

183

160

gcc-4.8.3

171

170

172

161

gcc-4.9.2

164

167

162

153

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler version

-O2

-O3

-O4

clang-3.1

176

1.14%

-

-

clang-3.2

176

2.27%

-

-

clang-3.3

174

-6.32%

-

-

clang-3.4

189

6.88%

4.23%

-

clang-3.5

174

-0.57%

-2.87%

-

compiler

-O2

-O3

-O3 -march=native

-O3 -march=native -flto

gcc-4.1.2

186

-7.53%

-

-

gcc-4.2.4

189

1.59%

1.59%

-

gcc-4.3.6

189

1.59%

2.12%

-

gcc-4.4.7

181

1.66%

-1.10%

-

gcc-4.5.4

173

2.31%

4.05%

7.51%

gcc-4.6.4

171

-1.17%

-0.58%

10.53%

gcc-4.7.4

171

2.92%

-7.02%

6.43%

gcc-4.8.3

171

0.58%

-0.58%

5.85%

gcc-4.9.2

164

-1.83%

1.22%

6.71%

For clang, the results vary for each version - on 3.3 the -O3 results in ~6% slowdown, on 3.4 it's ~6% speed-up. For the last version (3.5) it's a slight slowdown for both -O3 and -O4.

For gcc, the -O3 and -O3 -march=native flags are a bit unpredictable - on older versions this might give either slight improvement or significant slowdown (see for example gcc-4.7.4 where -O3 gives ~3% speed-up, but -O3 -march=native results in ~7% slowdown).

The only flag that really matters on gcc is apparently -flto (i.e. Link Time Optimization), giving ~5-7% speedup for most versions. That's not negligible, although it's not a ground-breaking speed-up either.

Summary

The various optimization flags don't have much impact - in most cases it's ~1-2%.

When they do have an impact, it's often in the unexpected (and undesirable) direction, actually making it slower.

The one flag that apparently makes a measurable difference in the right direction is -flto, giving ~3% speed-up in pgbench and ~7% in TPC-DS.