If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Open64 Compiler Tuning On AMD Bulldozer FX-8150

11-02-2011, 06:30 PM

Phoronix: Open64 Compiler Tuning On AMD Bulldozer FX-8150

After recently comparing the AMD Bulldozer with the GCC, Open64, and LLVM/Clang compilers, in this article is a look at the performance of AMD's Open64 compiler when using their recommended compiler tuning options for Bulldozer when building software.

Comment

Great. It shows that the "stock" options are pretty OK normally. I remember getting more and more requests for options to add when trying to compare array of compilers which in the end turned into an unmantainable amount of work.

By the way... Any news on PCC (last I read was that it could compile the FreeBSD kernel - what about the OpenBSD migration to PCC etc - any progress?) and/or the mob-branch of TCC (which can be statically compiled to musl libc or uclibc, which might make it even faster)?

Comment

Pretty predictable results. There are no consistent "magic" flags or compiler that deliver huge gains across all applications. Every application out there behaves differently to different flags or compiler and usually just leaving them to auto detect the capabilites and features will end up giving you the best overall performance. One just has to ask is it worth going through all the effort of finding that right combination for what is usually minimal gains.

Comment

Pretty predictable results. There are no consistent "magic" flags or compiler that deliver huge gains across all applications.

No, certainly no 'magic' flags. However, by giving the compiler the best possible data with which to make it's optimization decisions you do get the best result which can result in very substantial gains. The option which enables this is profile-guided-optimization (also known as feedback-driven-optimization).

The reason this is not enabled by default is because it requires you to first compile the program you wish to optimize in an information-gathering stage and the execute it to gather runtime data (branching, cache usage etc), and finally compile it again (final time) where the runtime data gathered will be used to optimize the code in the most efficient way.

This can of course be automated, an example would be Firefox which allows you the generate these pgo-optimized binaries leading to much faster performance (many may recall the debate surrounding the windows binary of firefox running much faster under wine than the native linux version which was due to the linux builds not enabling pgo at that time).

While I've never come across a program where it wasn't faster with pgo than without, it's worth mentioning that the gains depend very much on how much the optimization heuristics failed to make accurate guesses when compiling without pgo. Many optimizations which has potential to bring huge performance gains such as loop unrolling are notoriously hard to estimate which is why no compiler I know of enables them by default, however when using pgo I know atleast GCC automatically enables them since they can accurately employed with the given profile-data.

Apart from that, link time optimization can also yield decent performance increase by being able to look at an entire program as a whole rather than as separate code chunks. From my tests, link-time optimizations yield performance gains of 5% at best but I'm sure there are exceptions.

Comment

Link-time optimization is always a good idea; code placement can be crucial particularly if you can cram more of the critical path into L1 instruction cache.

Speaking of which - I'm always annoyed by the lack of analysis in these articles. "We ran foo and it yielded this number X. Next."

Articles like this teach readers next to nothing, it offers pretty much zero enhancement to understanding.

I would look at results for, e.g. GraphicsMagick and ask myself "why aren't the BD-specific optimizations helping?"
And then I'd re-run the test using valgrind cachegrind and see what the code that the compiler generated is actually doing, in an instruction-level profile, and look at the cache hits and misses. (Of course, this assumes that you've built a new enough valgrind that has already been updated to support the new AVX etc. instructions....)

Comment

Speaking of which - I'm always annoyed by the lack of analysis in these articles. "We ran foo and it yielded this number X. Next."

Articles like this teach readers next to nothing, it offers pretty much zero enhancement to understanding.

I would look at results for, e.g. GraphicsMagick and ask myself "why aren't the BD-specific optimizations helping?"
And then I'd re-run the test using valgrind cachegrind and see what the code that the compiler generated is actually doing, in an instruction-level profile, and look at the cache hits and misses. (Of course, this assumes that you've built a new enough valgrind that has already been updated to support the new AVX etc. instructions....)

Thank you, I was starting to think I was alone on that point.
Countless sites have already thrown together graphs and called it a day, leaving readers to do the "e-peen = f(bar length)" math.

I can read the data just fine, please focus on giving me information instead.

Comment

Thank you, I was starting to think I was alone on that point.
Countless sites have already thrown together graphs and called it a day, leaving readers to do the "e-peen = f(bar length)" math.

I can read the data just fine, please focus on giving me information instead.

Totally agree!!
I have to rely on readers' comments to gain some info/clue on what's happening in the figures! (provided that no troll wars start or OT hijacking occur)
Maybe some follow-up article (or some editing on the article later on) deriving from forum discussions would be of much use to increase the overall Phoronix value and interest.

Even a wrong guess is better than no guess: forums are there to correct/insult/discuss about'em

Comment

Apart from that, link time optimization can also yield decent performance increase by being able to look at an entire program as a whole rather than as separate code chunks. From my tests, link-time optimizations yield performance gains of 5% at best but I'm sure there are exceptions.

It was still somewhat broken in 4.6.1 for me, failing to build some apps and libs altogether. Apparently also not working too well w/ mingw.