Well, it's not very good. I have been testing my acovea flag results (posted here ) against more traditional "optimized" CFLAGS. The results have not argued strongly in favor of using Acovea based recommendations.

For each test, I would run the given app against sample data three times with my "normal" CFLAGS, then recompile and run it three times against the acovea CFLAGS, averaging the results. No other significant load existed at the time on the machine. No window system was running (GDM was and therefor xorg, as were my standard services like NFS and Samba, but they weren't actively doing anything). The actual tests were performed from an SSH session from another machine.

Note: I am aware -march normally implies -mtune. I leave -mtune present in the case that -march is filtered for some reason. For the acovea flags, I used the following methodology: I explicitly include all flags marked as "Yes", explicitly exclude all flags marked as "No", and then vary from -O1 to -O2 and finally -O3. For the acovea "alt" set I use -O3 and only explicitly include "Yes" indications, some of which it should be noted are logical not conditions against compilation methods.

TESTS

Test for flac-1.1.1

In this test I encoded Tchaikovsky's 1812 Overture using the "--best" flag option for flac.

Results:

Code:

ACOVEA -O1:
real 2m3.003s
user 2m2.616s
sys 0m0.313s

ACOVEA -O2:
real 2m4.853s
user 2m4.430s
sys 0m0.333s

ACOVEA -O3:
real 2m4.395s
user 2m3.971s
sys 0m0.348s

ACOVEA alt:
real 1m2.734s
user 1m2.348s
sys 0m0.323s

REGULAR:
real 1m9.937s
user 1m9.545s
sys 0m0.326s

Test for lame-3.96.1
In this test I encoded the above 1812 Overture from raw .wav to mp3 using no special options.

Code:

ACOVEA -O1:
real 1m12.179s
user 1m11.916s
sys 0m0.210s

ACOVEA -O2:
real 1m10.361s
user 1m10.109s
sys 0m0.203s

ACOVEA -O3:
FAILED - Segmentation fault (compiled twice to make sure)

ACOVEA alt:
FAILED - Segmentation fault (compiled twice to make sure)

REGULAR:
real 1m6.611s
user 1m6.354s
sys 0m0.189s

Test for bzip2-1.0.2-r3

In this test I compressed the raw .WAV of the previously used Tchaikovsky's 1812 Overture. The file is fairly large, with a size of 166368764 bytes. No flags for bzip2 were used.

Results:

Code:

ACOVEA -O1:
real 0m50.877s
user 0m50.321s
sys 0m0.475s

ACOVEA -O2:
real 0m48.955s
user 0m48.435s
sys 0m0.447s

ACOVEA -O3:
real 0m46.516s
user 0m45.972s
sys 0m0.471s

ACOVEA alt:
real 0m42.366s
user 0m41.845s
sys 0m0.460s

REGULAR:
real 0m43.687s
user 0m43.162s
sys 0m0.450s

Conclusions
I am aware my test cases are drawn from a specific class of programs, that being encode/decode style logic. This is the easiest case to find reproducable results with; if others want to try more complex types of programs with 100% reproducable data sets, by all means please do!

In the examples given, Acovea based results can't really be recommended. It's true in one case they resulted in an approximately 11% performance increase for the flac encoding, but in other tests it either performed worse, much worse, or failed to execute compared to "normal" optimizing CFLAGS. The interaction of the flags recommended appears highly situational and largely just noise when compared with the GCC "meta" flags of -O settings.

I would hazard a guess that acovea's default benchmarks are simply not indicative of the programs I used to test, and therefore made little if any headway in optimizing. Short of running an acovea style analysis of each program individually, I'm not sure how this would be fixed.

It is difficult to set individual flags that will give an overall improvement in speed. It all depends on what the program you want to run does and how it does it internally. In order to optimize a specific program you will have to perform the type of tests that you have done and adjust flags individually. That is not desirable in general.

The gcc suite sets internally many flags based on the -O? flag, they are all documented in the gcc man pages.

I have done numerous tests myself on my AMD64 3200+ and have come up with a set of flags that overall gives the most optimal performance and stability. The last is not the least important, as you found out with some programs that segfaulted when run.

In general, it is best to stick with a minimal amount of flags and use the ones recommended for each platform.

I think you have done a great job and I applaud you for your persistence in testing the various combination. Great write-up!

Erik_________________'Yes, Firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.'

Instead of using acovea I benchmarked my system in much the same way. I used Lame and some default optimizations. I md5 summed all of the resulting mp3s. -O3 gave me the best time. The I started adding other combinations of cflags until I started noticing speed improvements. Again I md5 summed the resulting mp3s. I threw out the cflags that gave me different md5 sums, most notably -ffast-math. Then I started taking the cflags out that gave me no significant improvement in encoding time until I was left with the minimal cflags that reduced my encoding time by 40%. In case you were wondering here are my -cflags for my Athlon 1800+ on an Epox Via 8HKA+.

I'm curious as to people using -O3, due to the fact that most tests agree that inlining functions slow down code on modern processors. As well as redundant CFLAGS such as specifying -fomit-frame-pointer on -O2 and above, because the GCC man page states that this is already implied.

Not to start another rant again, but actually reading the man (or info ) pages can help a lot too, and save time._________________jabber: MighMoS@jabber.org

I'm curious as to people using -O3, due to the fact that most tests agree that inlining functions slow down code on modern processors.

Qualify "most test results". I think that's probably "some test results I read", as I find that is most often the case and then people generalize. Not trying to knock against you, it's just been my very common experience.

The answer is I don't trust any of them as a generalization and try to test it myself to see. GCC has evolved recently at a very fast pace and its level of support for different processors varies considerably. What is true for one class of processor with a specific cycle rate, cache, and instruction set may be completely different for another. Thus, I test it myself.

Quote:

As well as redundant CFLAGS such as specifying -fomit-frame-pointer on -O2 and above,

For a very simple reason, and yes many of them have RTFM. If you RTFM the portage manual, you will realize that occasionally portage will filter some flags without telling you at the ebuild level. It's therefore valid to string individual flags after your "meta" optimization flag, in the hopes that if the ebuild filters say -O3 you will still retain some optimization behaviors. In fairness however anything that filters "-O2" would most likely filter all flags, so not much point there.

The specific combination you point out, "-O2 -fomit-frame-pointer", is not the default behavior for Intel class processors. From the gcc man page:

Quote:

"-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging.

Since omitting the frame pointer is destructive to rewinding on Intel class processors, GCC does not do this until explicitly indicated on those systems. So hopefully you didn't give your pet peeve advice to anybody running an Intel class system =)

Since omitting the frame pointer is destructive to rewinding on Intel class processors, GCC does not do this until explicitly indicated on those systems. So hopefully you didn't give your pet peeve advice to anybody running an Intel class system =)

-Twist

Actually, I havn't, because I just read up on it the other day. Sorry about the small rant there, and you are right about "most test results". *backs away slowly*_________________jabber: MighMoS@jabber.org

stuff like that, but written by someone who knows what they are talking about.

I used to have one of those, but I got too much abuse from lovech^W clueless ricers over it, so I got rid of it.

Seriously though, I'm trying to get the following in as official policy on how we handle CFLAGS:

Quote:

Guidelines for Flag Filtering

If a package breaks with any reasonable CFLAGS, it is best to filter the problematic flag if a bug report is received. Reasonable CFLAGS are -march=, -mcpu=, -mtune= (depending upon arch), -O2, -Os and -fomit-frame-pointer. Note that -Os should usually be replaced with -O2 rather than being stripped entirely. The -fstack-protector flag should probably be in this group too, although our hardened team claim that this flag never ever breaks anything...

If a package breaks with other CFLAGS, it is perfectly ok to close the bug with a WONTFIX suggesting that the user picks more sensible global CFLAGS. Similarly, if a bug report is received and is determined or suspected to be caused by daft CFLAGS, an INVALID resolution is appropriate.

Take from that what you will about what you should have in make.conf...

Actually, I havn't, because I just read up on it the other day. Sorry about the small rant there, and you are right about "most test results". *backs away slowly*

LOL ok I guess I came across a bit too strong there. I was honestly just trying to convey the idea that -fomit-frame-pointer was not automatic with -O or above on Intel arch machines.

As for the 'most test results' thing, it's a common problem that I fall into myself, even as a coder and somebody who is very conversant with compilers and their behavior. This is why conceptually I like Acovea; it seems that it's either flawed somewhat in implementation (not enough breadth to the example benchmark code) or simply that GCC is prone to many contradictory behaviors that can't be generalized across an architecture, but must be taken in context to a specific set of code. I tend to favor the latter myself, but again it means nothing without more extensive testing =)

Quote:

I used to have one of those, but I got too much abuse from lovech^W clueless ricers over it, so I got rid of it.

Seriously though, I'm trying to get the following in as official policy on how we handle CFLAGS:

I think that is an ok set of rules for the general case, sure. While it's annoying to get non-bugs submitted by Gentoo users who are doing unreasonble things with the compiler, it sort of comes with the territory and is part of the Gentoo flexibility/experience, so I would urge you to not turn to the dark side of bitterness on this issue =). I think the "stable" keyword ebuilds should all be responsible for handling any set of input CFLAGS to retain stable behavior (note that this most likely means rejecting almost all of them) and that your proposed policy would get us there.

If wishes were fishes though...I'd love to use the participatory nature of the Gentoo community to get definitive on some of this stuff. For instance, while we can label -fomit-frame-pointer as "safe" in that it doesn't break any known ebuilds, it would be great if we had a bug-buddy like facility to actually KNOW that for sure as part of the base install. Except maybe not as cumbersome and ugly as bug-buddy =). Something like -ftracer with the newer GCC releases, which (according to the GCC mailing list) should be entirely safe and improve the ability of other optimizations. -funit-at-a-time should also be safe, short of consuming extra memory for compiles, but I honestly don't have a feel at all for whether it breaks anything as I don't use it. It would be great if we could poll and consolidate results with some of these flag variants automatically.

Ah well. In the meantime, don't try this at home! Experienced coder here attempting compilations on a closed course with appropriate safety gear. The sponsers remind you to not exceed your ability or that of your gear by sticking with stable keywords and not overriding ebuild behavior. Thank you, drive through.

If you want stable, don't set CFLAGS at all in make.conf. Just rely upon the profile-provided settings. Gentoo developers are not here to correct every single possible stupid thing you can do with make.conf.

that kinda throws the whole 'freedom of choice' philosophy out the window though. sorry, just poking your buttons. i do appreciate the all work you do here for us and gentoo in general.

seriously though, i was surprised that "-pipe" isn't on that whitelist. are there actually situations where -pipe needs to be filtered or has caused problems (just curious)._________________by design, by neglect
for a fact or just for effect

Last edited by rhill on Thu Dec 30, 2004 1:22 am; edited 1 time in total

Thanks for you work -- I'm glad someone has done something useful with my reporting scripts.

Comments:

* It seems that, apart from compilation problems, your Acovea "alt" CFLAGS did pretty well. This suggests that Acovea, for the algorithms you have chosen, has more reliably found negatives than affirmatives (apparently, the "maybe"'s from -O3 provided a big performance boost).

* The algorithms you have chosen are far more complex and heuristic than those employed by Acovea as benchmarks. On the former, this means that memory-intensive optimizations might be beneficial since you are moving a lot of data and burning a lot of cycles anyway. On the latter, I'm not knowledgeable enough to impute how this would affect the performance of specific switches ....

* Is not GCC optimization for AMD notoriously bad? As you say in another post, the cross-dependencies of the various switches might be too extensive for even Acovea to dissect with its evolution.

Yes - I would hazard to guess that GCC is decent about deciding on its own when a method is negative (probably based on total instruction/tick count) and simply doesn't use it. So although those options came out as "no" according to Acovea, in real use GCC might benefit from them occasionally.

Quote:

The algorithms you have chosen are far more complex and heuristic than those employed by Acovea as benchmarks.

The biggest fault I can find with my "real world" examples is that they are all memory intensive. They all pump a lot of data in total, they all want to do lots of fairly wide address space lookups and compares, etc. However, it's the nature of the beast that these type of apps are not only good demonstrations but also where I tend to spend a lot of wait time in real life. For purely algorithmic benchmarks, I could have used nbench or the like, and for heavy mathmatics, xfractint or celestia on a complex solution I suppose. Might still go back and do that.

Quote:

Is not GCC optimization for AMD notoriously bad?

AMD themselves are actively helping the GCC crew in getting their instruction scheduling up to par, and it is reportedly vastly improved in the later versions. Since I tested with 3.4.3, I figured that was good enough. It's definitely true that the GCC 2.9 series was simply awful with AMD procs, and early 3 series (aside from general brokeness and stability issues) wasn't renowned either. I could and probably will run the same kind of comparison on one of my P4 machines, I just haven't gotten around to it yet.

Something like -ftracer with the newer GCC releases, which (according to the GCC mailing list) should be entirely safe and improve the ability of other optimizations.

Which only goes to show that the GCC mailing list can't be entirely trusted, since -ftracer breaks teTeX in a very weird fashion (executables don't crash but weirdly duplicate the file name they get passed, which of course causes the file not to be found). For details see https://bugs.gentoo.org/show_bug.cgi?id=50417 (ebuild *still* doesn't filter that flag, and I'm pretty peeved about it.. I even begged nicely )

As far as I'm aware, teTeX is the only package broken by -ftracer though. I use a bashrc-based filtering so teTeX doesn't get passed -ftracer but the rest do.

The last line actually takes -O2 to -O3 - it's there because many ebuilds filter -O3. I chose to ignore that, but then that's my choice, and I wholeheartedly agree with the default restrictive filtering.

As to your Acovea findings - it's hardly surprising. The best optimizations for any software are, in this order:
(a) Having a good design from the start and not as an afterthought
(b) Using algorithms that are best suited for the task
(c) Using the compiler's profiling facilities to identify bottlenecks
.
.
.
(somewhere around letter m) Compiler flags

As to your Acovea findings - it's hardly surprising. The best optimizations for any software are, in this order:
(a) Having a good design from the start and not as an afterthought
(b) Using algorithms that are best suited for the task
(c) Using the compiler's profiling facilities to identify bottlenecks
.
.
.
(somewhere around letter m) Compiler flags

USE flags are specific to Gentoo and indicate a system-level interest (or not) in the application/feature indicated by the flag.

Compile flags are switches to indicate to GCC particular code generation behavior. In this case, -f indicates an "option", whereas -m indicates a "machine option". Most commonly -m is something that is specific to the processor type that is the compile target.

It is correct to use -m to specify fpmath, sse, and mmx switches. All are particular to the processor, not to code generation in general.

This test in invalid. Because you are evolving compile flags independently for each test, then accepting the ones that on average give you the best performance, the test is not even as good as:
1) start with no optimizations and run each program, taking a reading.
2) turn on an optimization, test, take a reading.
3) turn on a different optimization and test.
4) The optimizations that give benefits, use, the others drop.

The genetic algorithm is probably worse, because it does not do a comprehensive test, and takes MUCH longer. The GA test is supposed to show which flags work best IN TANDEM, so taking the best average results will probably result in worse performance than O2 or O3, which the gcc team has probably already tested for best average performance independantly. What you need to do is:
1) Only include in the list of flags to test, those which you will have no qualms using in your final system build, ie, leave out -malign-double
2) For each generation of the GA, *ALL* benchmarks are run and a rating is given to that "set" of flags as the GA fitness function
3) run the GA until you are satisfied with the overall results (since the set of flags is rather small as far as GA's are concerned, 20 generations should be good with a population of 50-100).
4) use ALL the flags of the winner GA on your system, because what you are testing is not "flag -fomg-fast is beneficial" but rather "flags -fsometimes-good -falmost-never -fduh-use-me-always and -mim-a-typewriter when used in tandem beats -O3 on average"

Basically, what I am saying, is that if you run 6 independant GA's then take the average results, your data is completely meaningless and you're better off sticking with the tried and true "-O2 -pipe". Rewrite this GA if you want to get real data out of it.

This test in invalid. Because you are evolving compile flags independently for each test, then accepting the ones that on average give you the best performance, the test is not even as good as:
1) start with no optimizations and run each program, taking a reading.
2) turn on an optimization, test, take a reading.
3) turn on a different optimization and test.
4) The optimizations that give benefits, use, the others drop.

Yes, except that you lose information about poor interactions altogether. By picking out the best average flags, you are not just extracting the switches which are beneficial over a variety of algorithms, but also those that "play nice" with others. This varies from machine to machine, it seems.

Quote:

What you need to do is:
1) Only include in the list of flags to test, those which you will have no qualms using in your final system build, ie, leave out -malign-double
2) For each generation of the GA, *ALL* benchmarks are run and a rating is given to that "set" of flags as the GA fitness function
3) run the GA until you are satisfied with the overall results (since the set of flags is rather small as far as GA's are concerned, 20 generations should be good with a population of 50-100).
4) use ALL the flags of the winner GA on your system, because what you are testing is not "flag -fomg-fast is beneficial" but rather "flags -fsometimes-good -falmost-never -fduh-use-me-always and -mim-a-typewriter when used in tandem beats -O3 on average"

This is not too different from now, except for step 3. The danger here is that you overoptimize to this particular aggregate situation, which is only a rough mapping to the space of all apps you will be compiling. By testing each algorithm separately, you have a larger base of variegated populations whose best traits you can extract statistically.

The bottom line is that I'm testing for "nice" flags, you are trying to find an optimum. In the case that interactions are very important to performance (i.e., strong correlation) as you contend, there's no way that the small Acovea tests can predict the performance of real world apps, so the discussion is moot -- every app would have to be optimized seperately anyway. If the optimizing interactions are weak but the interactions that cause breakage are strong (as I contend), then you want to draw "valuable" traits from a broad base of organisms. (*)

This is all borne out by the reports on the old thread (mostly anecdotal): programs aren't any faster, but programs build more reliably and execute with far more stability than the canonical -O2 or -O3.

One good suggestion you make is to diligently weed out flags that you would never use anyway, like "-malign-double", from the set of available flags -- they might cause bad interactions with certain flags that are otherwise valuable.

(*) It should be noted that the intended purpose of Acovea is to test compilers against the different supplied benchmarks, or a specific algorithm against a specific compiler. My scripts generate the inference I describe._________________Personal overlay | Simple backup scheme