I have a few machines that are running either Athlon MPs or Athlon XPs, and it's getting irritating building separate binaries for each one. I'd like to build one binary for all of those machines - what's the most agressive -march= argument I can use that'll work across both of those? Alternatively, where are those friggin' things documented so I can look this up myself?

Apparently, both -march=athlon-mp and -march=athlon-xp both imply -mcpu=athlon -m3dnow -msse (where -mtune is the same thing as -mcpu, except that -mcpu is deprecated in gcc 3.4+).

However, I'm still not entirely clear on what to do here. It looks like I can use -mcpu=athlon -m3dnow -msse on both. Great. So I also use -mfpmath=sse. However, the manual says that the code will only be scheduled optimally for the athlon, but there will be no non-386-compatible code generated without a -march switch. So, do I just use -march=athlon, or should I semi-randomly choose between -march=athlon-xp / -march=athlon-mp since they're pretty much the same thing?

It *appears* that merely including -msse and -m3dnow (as well as -mfpmath=sse) in conjunction with -march=athlon should get what I want - maximum processor-specific compile-time tuning for both my Athlon XP and MPs, but I'd be interested in hearing from someone who's not just pulling that out of their butt after 15 minutes of searching and reading.

The mp has that extra "mp" cpuflag, while the other machines are actually Semprons (which I was told are supposed to use -march=athlon-xp, and that makes sense as AMD is marketing their "budget" XPs under the Sempron name without actually changing much beyond the CPU ID). So, marketing-wise, I'm not working with any XPs - and they're both actually mp capable. Here's some cpuinfo goodness though, for what it's worth:

(among others) and see how that works. That should be identical to -march=athlon-xp, but when I come back to it in a few momths, I'll know what I did this way. If I use athlon-xp on the mp machines, I'm apt to wonder what I was thinking (and yes, comments in the file to remind myself, those are out of the question!).

You could use -march=athlon -mtune=athlon-xp -msse
Don't use -mfpmath=sse that's actually slower on the Athlon-XP (and I guess the MP and Sempron as well), just don't set that flag at all. Other flags that could be helpful are -falign-jumps=16 -falign-loops=16 -falign-functions=64 -mno-push-args -maccumulate-outgoing-args_________________"Those who deny freedom to others deserve it not for themselves." - Abraham Lincoln
Free Culture | Defective by Design | EFF

You could use -march=athlon -mtune=athlon-xp -msse
Don't use -mfpmath=sse that's actually slower on the Athlon-XP (and I guess the MP and Sempron as well), just don't set that flag at all. Other flags that could be helpful are -falign-jumps=16 -falign-loops=16 -falign-functions=64 -mno-push-args -maccumulate-outgoing-args

I normally build with -Os, which disables the alignment optimizations.

maccumulate-outgoing-args implies mno-push-args, and generates larger code size. While both the alignment options and the argument handling options almost always result in slightly faster code, they also increase memory usage which makes it more likely that I'll run beyond the physical memory limit and start swapping. I tend to not go for optimizations that will run me out of memory in exchange for an imperceptible speed increase...

Thanks for pointing out the -mfpmath=sse thing, though. Do you have any benchmarks or documentation to back that up? I believe you, but was unable to find anything to either agree or disagree (I'm curious as to why it's slower).

I don't understand why you're so concerned about space and memory, unless you use an old box with very limited harddisk space and limited ram. I use -O2 -finline-functions (the only extra one in -O3 that's worth using on x86, as I understand it).

I don't have any documentation ready, but I do read up on what knowledgeable users, both on this forum and elsewhere, say and recommend, and I follow up on their links, as well as the GCC doumentation. I don't claim to be an expert, but following the advice of experts and reading on the subject whenever I come across it, has led me to use these flags, with good result. I have a stable and speedy system that I am content with._________________"Those who deny freedom to others deserve it not for themselves." - Abraham Lincoln
Free Culture | Defective by Design | EFF

I don't understand why you're so concerned about space and memory, unless you use an old box with very limited harddisk space and limited ram. I use -O2 -finline-functions (the only extra one in -O3 that's worth using on x86, as I understand it).

I don't have any documentation ready, but I do read up on what knowledgeable users, both on this forum and elsewhere, say and recommend, and I follow up on their links, as well as the GCC doumentation. I don't claim to be an expert, but following the advice of experts and reading on the subject whenever I come across it, has led me to use these flags, with good result. I have a stable and speedy system that I am content with.

The size of the executable isn't due to a concern with drive space (the difference is negligable at that level) or really with RAM. I use -Os on my modern SMP systems rather than -O2, because the only difference is that -Os doesn't use the alignment optimizations. Those aligmnet optimizations do things like inserting extra space before function calls, loop jump targets, etc. While that can result in a very small performance increase (basically due to simplified mathematical operations), it also spreads code out. That spreading out of code increases the chance of cache misses. On a uniprocessor system with a modern processer that has lots of on-chip cache, that's not a huge deal. However, on an SMP system with separate chips - and thus separate on-chip caches - the performance hits of cache misses can be pretty sigificant - significant enough to outweigh the minor alignment benefit. Having to fetch stuff from system memory is *way* slower than fetching from processor cache - it's like the difference between swapping and fitting in to physical memory. It's this space concern that's the main reason why it's a bad idea to buld everything with -O3 (well, that and -funroll-loops results in slower code when the number of iterations isn't know in advance, which is the case in lots of loops that I write - and presumably in other coders' stuff).

I guess I'm technically worrying about memory, after all, but I'm really worrying about L1 and L2 cache usage, rather than system memory. And there's not much you can do to increase the size of the on-chip cache, which is typically on the order of 128-512K. My pretty current Athlon MP system, for example, has 1.5GB RAM, but the chips only have 256K of cache. When you're tyring to fit as much code as possible into 256K, it's worthwhile to worry about space. Not to mention that compilation time is slightly improved over -O2, and significantly improved over -O3. Referring to is as optimizing for size is deceptive, though, since people do usually think of system memory and drive space first - forgetting about the cache which is arguably more important.

That said, I was able to find more information on the SSE thing which agreed - basically the SSE implementation on the Athlons isn't all that awesome - it's more for compatability. The Athlons do, however, have a kick-ass 387 unit (which is what's used in place of SSE).

Referring to is as optimizing for size is deceptive, though, since people do usually think of system memory and drive space first - forgetting about the cache which is arguably more important.

refering to it as size optimisation is perfectly accurate. Any erroneous assumptions anyone may make about how and where that is relevant is only self-deception. Hwvr. thanks for emphasising the effect of the code size in relation to the cache, obviously important.

Quote:

That said, I was able to find more information on the SSE thing which agreed - basically the SSE implementation on the Athlons isn't all that awesome - it's more for compatability. The Athlons do, however, have a kick-ass 387 unit (which is what's used in place of SSE).

That's why its best to let the compiler decide rather than forcing the issue with -mfpmath .

OK, now I understand your reasoning. But as I understood it, those alignment optimizations are fine-tuned for the Athlon XP (don't know about other procesors) to make optimum use of its cache, and are actually promoted by AMD themselves. (If you search or dig through the Cflags Central thread you might find the link.)_________________"Those who deny freedom to others deserve it not for themselves." - Abraham Lincoln
Free Culture | Defective by Design | EFF

OK, now I understand your reasoning. But as I understood it, those alignment optimizations are fine-tuned for the Athlon XP (don't know about other procesors) to make optimum use of its cache, and are actually promoted by AMD themselves. (If you search or dig through the Cflags Central thread you might find the link.)

"Optimum" in this case probably refers to execution speed, not space. The alignment optimizations use padding, and padding wastes space, period. My concern, like I said, is more directed to my SMP systems than the uniprocessor systems. In an SMP system, if the processor looks for code that's not in its cache, in addition to having to fetch the information from the next level up of memory, it *also* has to wait for the other processor to finish whatever it's doing / free up locks, etc. If there are lots of cache misses, then the other processor could *also* be fetching from memory. So, the cache miss is now a little more than twice as expensive than it would be on a uniprocessor system. It's sort of like moving a uniprocesor system to system RAM that's only half as fast. That can pretty quickly negate any gain you get from slightly easier math operations when calculating jump targets.

Of course, that's just a pessimistic generalization - it's plausible that using -O2 won't cause noticable performance differences, and likely that a difference either way won't be felt under typical use. Since I'm generalizing tuning for both SMP and uniproc systems, the odds are in favor of -Os being better for me than -O2. Were I using just uniproc systems (or mostly uni systems), I'd more than likely use -O2, though.

Optimum in any domain is always relative to what you are trying to do and once you are down scratching around the sub 1% performance gain you really need to define exactly what use you are targetting your system for.

dannysauer: your reasoning sounds good, how about running a couple of tests? Could be interesting.

_________________Linux, because I'd rather own a free OS than steal one that's not worth paying for.
Gentoo because I'm a masochist
AthlonXP-M on A7N8X. Portage ~x86

Thanks Gentree, i underclocked it cause i was having hard locks on it. But I figured out that my graphics card isn't in very tight and the card keeps slipping out and causing the lock.

I was runing my 2500+ at 3200+ by clocking the memory up from 333mhz to 400mhz, so i thought it was something with that so i just kicked it down to 200mhz. I'm just too lazy to reboot and kick it back up

(plus i've had it like this for 2 weeks and haven't really missed the speed all that much, but next reboot i'll fix it [damn linux never needing reboots... ])

the Os compile took almost 52s against 50.1 with Os. ie -2%; small but nice.

Thanks for getting me interested in this , if files typically coming out 20% smaller I'll probably rebuild this machine on -Os

I'm just doing xorg to see if I can see an speed up as well.
_________________Linux, because I'd rather own a free OS than steal one that's not worth paying for.
Gentoo because I'm a masochist
AthlonXP-M on A7N8X. Portage ~x86

Does anyone know where the Socket 754 Semprons fit into the scheme of things, They show up as as an unknown Hammer processor in /proc/cpuinfo, and I am led to believe they are essentially an athalon 64 with less cache and the 64 bit extentions disabled. I can't post my cpuinfo at the moment as the machiene is currently out on loan, but I can get it if this would help.

Does anyone know where the Socket 754 Semprons fit into the scheme of things, They show up as as an unknown Hammer processor in /proc/cpuinfo, and I am led to believe they are essentially an athalon 64 with less cache and the 64 bit extentions disabled. I can't post my cpuinfo at the moment as the machiene is currently out on loan, but I can get it if this would help.

Currently I'm treating it as an Athalon XP.

It's a K8/Athlon64 with a smaller cache and 64-bit addressing disabled. I'd treat it like an Athlon XP with sse2 (so add -msse2) from the compiler's point of view, though.

Cool - I'm glad you've got a it more time for this than I do presently. The amusing thing is that a big part of my job is benchmarking/tuning new hardware, but I haven't had time to run tests on my own machines.

My sempron is one of the first 3100+, so i think it based on the old 130nm core. I guess athalon-xp is the way to go for my cflags. As a slight aside, what would happen if I configured the kernel for a k8/hammer rather than an athalon(-xp)/k7?_________________www.technomancer.me.uk

very handy site for getting the low-down on you extact CPU model, you'll need to pull it out and look at the serial number.

If you want to try -march and -mtune for athlon 64 best check /proc/cpuinfo for sse? in flags and explicitly add or disable in cflags. This seems to be the main compatability diff on these families , obviously the cache size etc. is irrelevant in this context.

As for the kernel I think it is probably safe to try and you will either a get a kernel that will boot or no. Since it is basically a hamstrung Hammer core it should be safe but dont say I advised you to do it _________________Linux, because I'd rather own a free OS than steal one that's not worth paying for.
Gentoo because I'm a masochist
AthlonXP-M on A7N8X. Portage ~x86

I know several people build Athlon-MP setups using Athlon-XP and some clever hardware hacking, to the best of my knowledge the chips are very similar and since not many MP chips are required and they are built by the same specs - I assume that the odds that a given set of XPs would work as MPs with the correct gates connected are quite high. (I'm not encouraging this it will void warrenty and can make your system go boom - if you break it you get to keep both pieces)

Anyways, I guess this means you can rely on the fact that both cpus have basically the same charactaristics, so the performance drop by using -march=i686 -mtune=athlon-xp would be minimal

There's the basic guide - remember to research the specifics for your athlon core type and don't complain if things become unstable - the cores are probably untested before shipping at the new setting or could be marked down for failing the specific mp test.