If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

I saw all the linked results as well.
Basically, a few percent improvement. About 4-5% over stock.

Free performance is always good, but may be not at the cost of 3X time and 2.5X RAM use.

When it becomes a more consistent win, it will make sense to use it in release builds of binaries that are redistributed. I think it's definitely worth having packaging take 3x longer if it makes the resulting binary 5% faster and strips out lots of dead code too.

When you are using just a single command to compile, like gcc -march=native -O3 -flto -fwhole-program ... it works fine, but when you use a makefile with separate C(XX)FLAGS and LDFLAGS you need to pass the C(XX)FLAGS along to the LDFLAGS, else the optimization will suffer greatly. So you should do something like this:

I've done many LTO comparisons and it's not always that there is any gain (alot of the benefits of LTO can be had by just defining functions as static when appropriate) but I've never come across such regressions as shown here in Michael's tests. Hence I'm thinking he is not passing the C(XX)FLAGS along to the linker through the LDFLAGS in the tests which uses a makefile with separate C(XX)FLAGS/LDFLAGS, which in turn means the C(XX)FLAG optimizations aren't being used when generating the final binary.

it works fine, but when you use a makefile with separate C(XX)FLAGS and LDFLAGS you need to pass the C(XX)FLAGS along to the LDFLAGS, else the optimization will suffer greatly. So you should do something like this:

or just reference the CXXFLAGS variable as I did above:
LDFLAGS = $(CXXFLAGS) -Wall

I believe this is necessary due to the ability of using LTO on object files written in different languages, but I may be wrong. I haven't really dived into LTO as I haven't gotten any major gains from it for my own code, particularly when compared to PGO which pretty much always yield gains, often significant.

I believe this is necessary due to the ability of using LTO on object files written in different languages, but I may be wrong. I haven't really dived into LTO as I haven't gotten any major gains from it for my own code, particularly when compared to PGO which pretty much always yield gains, often significant.

I've never heard of PGO until now, but would love to see some recent benchmarks. Most of the articles I saw were reporting up to ~10% gains.

Also, from man gcc:

Code:

To use the link-time optimizer, -flto needs to be specified at compile time and during the final link.

No, then you get -O0 optimizations. LTO means link-time optimizations, which means the linker does the optimizations, which again means the linker needs the optimization flags, but the compiler does not.

So
CXXFLAGS = -flto
LDFLAGS = -O3 -march=native -flto -fwhole-program

Would work, but your example would not.

Note you can also speed up the compilation even more by disabling fat object files, by default GCC produces object files that both contain the code for LTO linking and traditional object code, the later is not needed if you are going to use LTO anyway on the final link. Edit: Using -fno-fat-lto-objects as a compile time flag.

Note you can also speed up the compilation even more by disabling fat object files, by default GCC produces object files that both contain the code for LTO linking and traditional object code, the later is not needed if you are going to use LTO anyway on the final link. Edit: Using -fno-fat-lto-objects as a compile time flag.

This produces individual object files with unoptimized assembler code, but the resulting binary myprog is optimized at -O3. If, instead, the final binary is generated without -flto, then myprog is not optimized.