Share this story

Further Reading

First disclosed in January 2018, the Meltdown and Spectre attacks have opened the floodgates, leading to extensive research into the speculative execution hardware found in modern processors, and a number of additional attacks have been published in the months since.

Today sees the publication of a range of closely related flaws named variously RIDL, Fallout, ZombieLoad, or Microarchitectural Data Sampling. The many names are a consequence of the several groups that discovered the different flaws. From the computer science department of Vrije Universiteit Amsterdam and Helmholtz Center for Information Security, we have "Rogue In-Flight Data Load." From a team spanning Graz University of Technology, the University of Michigan, Worcester Polytechnic Institute, and KU Leuven, we have "Fallout." From Graz University of Technology, Worcester Polytechnic Institute, and KU Leuven, we have "ZombieLoad," and from Graz University of Technology, we have "Store-to-Leak Forwarding."

Intel is using the name "Microarchitectural Data Sampling" (MDS), and that's the name that arguably gives the most insight into the problem. The issues were independently discovered by both Intel and the various other groups, with the first notification to the chip company occurring in June last year.

A recap: Processors guess a lot

All of the attacks follow a common set of principles. Each processor has an architectural behavior (the documented behavior that describes how the instructions work and that programmers depend on to write their programs) and a microarchitectural behavior (the way an actual implementation of the architecture behaves). These can diverge in subtle ways. For example, architecturally, a processor performs each instruction sequentially, one by one, waiting for all the operands of an instruction to be known before executing that instruction. A program that loads a value from a particular address in memory will wait until the address is known before trying to perform the load and then wait for the load to finish before using the value.

Microarchitecturally, however, the processor might try to speculatively guess at the address so that it can start loading the value from memory (which is slow) or it might guess that the load will retrieve a particular value. It will typically use a value from the cache or translation lookaside buffer to form this guess. If the processor guesses wrong, it will ignore the guessed-at value and perform the load again, this time with the correct address. The architecturally defined behavior is thus preserved, as if the processor always waited for values before using them.

But that faulty guess will disturb other parts of the processor; the main approach is to modify the cache in a way that depends on the guessed value. This modification causes subtle timing differences (because it's faster to read data that's already in cache than data that isn't) that an attacker can measure. From these measurements, the attacker can infer the guessed value, which is to say that the attacker can infer the value that was in cache. That value can be sensitive and of value to the attacker.

Buffering...

MDS is broadly similar, but instead of leaking values from cache, it leaks values from various buffers within the processor. The processor has a number of specialized buffers that it uses for moving data around internally. For example, line fill buffers (LFB) are used to load data into the level 1 cache. When the processor reads from main memory, it first checks the level 1 data cache to see if it already knows the value. If it doesn't, it sends a request to main memory to retrieve the value. That value is placed into an LFB before being written to the cache. Similarly, when writing values to main memory, they're placed temporarily in store buffers. Through a process called store-to-load forwarding, the store buffer can also be used to service memory reads. And finally, there are structures called load ports, which are used to copy data from memory to a register.

All three buffers can hold stale data: a line fill buffer will hold data from a previous fetch from main memory while waiting for the new fetch to finish; a store buffer can contain a mix of data from different store operations (and hence, can forward a mix of new and old data to a load buffer); and a load port similarly can contain old data while waiting for the new data from memory.

Just as the previous speculative execution attacks would use a stale value in cache, the new MDS attacks perform speculation based on a stale value from one of these buffers. All three of the buffer types can be used in such attacks, with the exact buffer depending on the precise attack code.

The "sampling" in the name is because of the complexities of this kind of attack. The attacker has very little control over what's in these buffers. The store buffer, for example, can contain stale data from different store operations, so while some of it might be of interest to an attacker, it can be mixed with other irrelevant data. To get usable data, many, many attempts have to be made at leaking information, so it must be sampled many times.

Further Reading

On the other hand, the attacks, like the Meltdown and Foreshadow attacks, bypass the processor's internal security domains. For example, a user mode process can see data leaked from the kernel, or an insecure process can see data leaked from inside a secure SGX enclave. As with previous similar attacks, the use of hyperthreading, where both an attacker thread and a victim thread run on the same physical core, can increase the ease of exploitation.

Limited applicability

Generally, an attacker has little or no control over these buffers; there's no easy way to force the buffers to contain sensitive information, so there's no guarantee that the leaked data will be useful. The VU Amsterdam researchers have shown a proof-of-concept attack wherein a browser is able to read the shadowed password file of a Linux system. However, to make this attack work, the victim system is made to run the passwd command over and over, ensuring that there's a high probability that the contents of the file will be in one of the buffers. Intel accordingly believes the attacks to be low or medium risk.

That doesn't mean that they've gone unfixed, however. Today a microcode update for Sandy Bridge through first-generation Coffee Lake and Whiskey Lake chips will ship. In conjunction with suitable software support, operating systems will be able to forcibly flush the various buffers to ensure that they're devoid of sensitive data. First-generation Coffee Lake and Whiskey Lake processors are already immune to MDS using the load fill buffers, as this happened to be fixed as part of the remediation for the level 1 terminal fault and Meltdown attacks. Moreover, the very latest Coffee Lake, Whiskey Lake, and Cascade Lake processors include complete hardware fixes for all three variants.

For systems dependent on microcode fixes, Intel says that the performance overhead will typically be under three percent but, under certain unfavorable workloads, could be somewhat higher. The company has also offered an official statement:

Microarchitectural Data Sampling (MDS) is already addressed at the hardware level in many of our recent 8th and 9th Generation Intel® Core™ processors, as well as the 2nd Generation Intel® Xeon® Scalable Processor Family. For other affected products, mitigation is available through microcode updates, coupled with corresponding updates to operating system and hypervisor software that are available starting today. We've provided more information on our website and continue to encourage everyone to keep their systems up to date, as it's one of the best ways to stay protected. We'd like to extend our thanks to the researchers who worked with us and our industry partners for their contributions to the coordinated disclosure of these issues.

Like Meltdown, this issue does appear to be Intel-specific. The use of stale data from the buffers to perform speculative execution lies somewhere between a performance improvement and an ease-of-implementation issue, and neither AMD's chips nor ARM's designs are believed to suffer the same problem. Architecturally, the Intel processors all do the right thing—they do trap and roll back faulty speculations, as they should, as if the bad data was never used—but as Meltdown and Spectre have made very clear, that's not enough to ensure the processor operates safely.

Share this story

161 Reader Comments

...A few things are quite a bit faster, up to 10x. Most are not really all that much faster at all, in the twice as fast to almost nothing range.

The one in that second link for H.264 Video Encoding is particularly relevant, as that is probably something they did hand optimize to some extent. Notice that even though it is an application which should really benefit, it barely speeds up.

Are we looking at the same link/image? Turning on the most basic optimizations speeds up H.264 by 40%. You call that "barely speeding it up"? If Intel made a processor that was 40% faster than its current fastest offering, it would be front-page news for a week.

Looking at the differences in that second link graph-by-graph between O0 and O3, I'm seeing 80%, 252%, 485%, 33%, and 47%. I don't understand how you're not impressed with these speedups.

But regardless, what's your point? This conversation started with somebody saying that programmers are generally going to be worse than compilers at optimizing stuff, if only because of the sheer volume of code involved in most modern projects. Pointing to a web site that indicates how well compilers can optimize stuff is only half of an argument. Basically it's irrelevant unless you can also point to a web site that shows how much humans can optimize the same code... can you? If you can't, I don't know what you're trying to say.

That statement was "Programmers are generally going to be worse at any kind of optimization no matter the architecture."

That certainly is not true for "any kind", as most of the really good optimizations are entirely done by humans. Compilers tend only to get the easy ones which do not require any sort of modification to the logic or knowledge of the specific task, but that is fairly limiting.

It probably is not true for Chromium in general either. My expectation is that if you somehow produced a "clean" version without all of the human optimizations (such that the code is written somewhat competently, but for readability only, not speed), then compiled it with -O3, it would likely be far worse than if you took their hand optimized code and used -O0, or -Og.

A 40% speedup is small if you are optimizing by hand, few would bother with that. You generally go through quite a bit of optimization before you hit things which are that minor, and rarely do you need to go that far.

The video encoder likely only sees that benefit as they did not bother with some of this as well, I am sure they could have optimized the rest of their encoding routine if they had reason to do so. If compilers will generally do a specific optimization for you, why bother? I do not bother suggesting what to inline, for example. That is something a compiler is good at sorting out, even if I could do so.

A benchmark for code would be trivial to come up with, and entirely meaningless if you are talking about human code decisions affecting optimization. That works for a compiler as it is the same code, with a different flag.

A very poorly written program can very easily be tens or hundreds of thousands of times slower than one which someone with time and knowledge wrote to execute quickly. I could trivially produce a slow version and a fast version of some routine with a delta of thousands of times, if I get to write both (and they do enough processing to matter, as a lot of the big numbers are due to scaling).

As a real world example I fixed a report someone complained about not long ago which was taking almost five minutes to generate with just a few statements to set an index, and it now runs in substantially less than a second. All I really did is go look through some SQL and make sure everything they were looking up in a join had an index, that was very quick, and very effective.

I have a specialty compiler of my own which had an early version which took approximately four hours to crunch through a large directory of input source code. Version 2 after some optimizations? 18 seconds.

I had a data analysis utility in Java which I was asked to look at as a consultant many years ago when their monthly cycle was taking more than a month. It was well written, but my C rewrite for speed ran as quickly as they could feed data to the system, and could complete in less than five minutes while doing quite a bit more than the original system after some hardware upgrades to better saturate its input (this would not have affected the run time of the prior system, as it could easily feed it). I did chase down improvements of just 20% or so for that one, but I also spent maybe four months on optimizing it by the end.

Programmer optimization will generally change what the code does in a more substantial way, simply because it can. I could go look at the assembly and try to improve upon it a little bit by rearranging instructions. Maybe I could make it 2% faster than the compiler can in this way, maybe five. I have other options which are not available to a compiler though, and they are much superior.

A 40% speedup would not be something I would target manually in most cases, the first passes are looking for orders of magnitude (I want more like 1,000%, or 80,000% if I am going to bother). It takes rather a lot of optimization before you get down to thinking a 2x speedup is a good use of time, and 40% gets into the range where you need to ask how universal that is between CPUs with various cache sizes and other features.

To some degree it is cheating, as I am aware of things the compiler is not. I do not need to make the same guarantees in a task specific program as a compiler does, and can substantially change how the code operates to fit what I want to do.

Even for really simple stuff like having a set of lists such that some are lookup lists which are obviously sorted so you can search it rapidly when you need to find a given value, and some will need to be processed entirely and are therefore in a random order, my options are better.

I can do things like decide sorting all of them on the key of the most frequently used lookup is a gain as when it goes to search for the item the ones it checks during a binary search will tend to already be in the cache from the last lookup. I can even do things like have it prefetch both the possibilities on what it expects is the last loop or two, as that is likely where it will differ, and it both guards against a miss and acts as a preload for the next search.

I can also do things like decide that nice human readable data structure someone created is not ideal, as you spend your time looking at only one or two values. Those get moved to their own structure and grouped together so you have more values you care about in the same page or cache line.

That changes logic, so the compiler cannot do any of that for you.

The cost of running a compiler for an extra 20 seconds or so is very low though, so it certainly does win from a cost perspective. Much of the time nobody even tried to optimize the thing you are running even a little bit, and if you want a speedup it is coming from the compiler or the CPU. If someone cared though, there are a lot of cases where someone could go make it a thousand times faster without resorting to anything as effort intensive as individually choosing instructions.

...As someone who does that for a living, I have to disagree, and there are a lot of people optimizing web browsers "by hand". They wouldn't so fast as they are without them. In cases like javascript performance or video decoding, it goes all the way down assembler-level or intrinsics triggering specific instructions.

Sure those who do that are not average programmers, and average programmers are more likely to waste their time by overriding the compiler, but saying it is literally impossible, is still flat out wrong.

The OP was saying that nobody has the time to hand-optimize an entire web browser. Saying that some small pieces are hand-optimized doesn't contradict that point.

EDIT: The implication that started this sub-thread is wrong anyway. No amount of programming skill is going to give you the same benefit as speculative execution. They are two different things that compliment each other, they can't replace each other. So this whole discussion about compilers or hand-optimization of assembly is irrelevant.

That is all true. I was only objecting to the last statement as it was going to far, which is why I quoted only that part of the comment to make it clear I wasn't objecting to the rest.

Though I would call hand-optimizing small pieces if those pieces have been "manually" identified to be central to performance, hand-optimizing the entire application, but I know what you meant, and am just being pedantic now

I think Intel made reasonable decisions. It turns out that some of the industry's assumptions are possibly wrong, and that at least some side channels are hard to address in software, or, plausibly, need greater control of the hardware to suppress the side channel. Further, there are many more side channels than previously considered.

Well, it was a matter of taking a risk the previous time. They won in performance and lost in security.

It's like I'm sending you an important document that you need. You can either choose to check the file name, title, chapters and content or only check the name and title or even only the name and trust that the rest is good. Of course, only checking the name and title is much more time efficient for you so it's very attractive to do just that.

That's kind of what Intel did. They assumed that all software is good and they can trust it, which obviously is not a correct assumption.

This time around things were more complicated than that, but the possibility did exist to prevent such issues, it just was not considered important enough to fully implement.

Inevitably now-days CPU architectures will have to take into consideration the VM environments much more than before since virtualization is kind-of taking over all the server market.

I think Intel made reasonable decisions. It turns out that some of the industry's assumptions are possibly wrong, and that at least some side channels are hard to address in software, or, plausibly, need greater control of the hardware to suppress the side channel. Further, there are many more side channels than previously considered.

Given intel are vulnerable to this, meltdown and worse off on spectre, no. I'd argue the architectural behaviour is wrong.

They were not ever reasonable decisions - putting performance ahead of security on machines designed with virtualisation instructions, specifically to enable multiple virtual machines (i.e., certainly anything pentium onwards) with different security domains to be run on said processor(s).

Side-channels have traditionally been regarded as software problems, not hardware problems

That might have something to do with said side channels having traditionally been implemented in software, not hardware. This is not what happened here.

Cryptographic cache side channels say hi. Been a thing forever.Cryptographic power side channels say hi. Been a thing forever.Cryptographic latency side channels say hi. Been a thing forever.

All of those are well known side channels that are hardware based and have existed for pretty much ever. No hardware solution is viable. Must be solved in software.

Yes, because the behavior feeding information to those side channels is defined in the software. That's why I consider those side channels to be implemented in software. You fix the software, you close the channel.

Quote:

These are also, fyi., primarily software problems as well. The architectural state makes no guarantees around these functions as they are non-architectural. If you need them in a secure known state, you only option is to flush.

Yes, because the behavior feeding information to these side channels is defined in the hardware. That's why consider these side channels to be implemented in hardware. You fix the hardware, you close the channel.

You can't fix these ones solely in software, so if you need a secure architectural state you do need to fix that behavior in hardware. If it is enough to work around it you can do that by crippling your software with otherwise spurious flushes.

Side-channels have traditionally been regarded as software problems, not hardware problems

That might have something to do with said side channels having traditionally been implemented in software, not hardware. This is not what happened here.

Cryptographic cache side channels say hi. Been a thing forever.Cryptographic power side channels say hi. Been a thing forever.Cryptographic latency side channels say hi. Been a thing forever.

All of those are well known side channels that are hardware based and have existed for pretty much ever. No hardware solution is viable. Must be solved in software.

Yes, because the behavior feeding information to those side channels is defined in the software. That's why I consider those side channels to be implemented in software. You fix the software, you close the channel.

Quote:

These are also, fyi., primarily software problems as well. The architectural state makes no guarantees around these functions as they are non-architectural. If you need them in a secure known state, you only option is to flush.

Yes, because the behavior feeding information to these side channels is defined in the hardware. That's why consider these side channels to be implemented in hardware. You fix the hardware, you close the channel.

You can't fix these ones solely in software, so if you need a secure architectural state you do need to fix that behavior in hardware. If it is enough to work around it you can do that by crippling your software with otherwise spurious flushes.

And no, the architectural state for pretty much every architecture out there does not deal with temporal structures like branch predictors, caches, etc. The closest they every get is Memory Order Models which none of these side channel attacks violate.

...That statement was "Programmers are generally going to be worse at any kind of optimization no matter the architecture."

That certainly is not true for "any kind", as most of the really good optimizations are entirely done by humans. Compilers tend only to get the easy ones which do not require any sort of modification to the logic or knowledge of the specific task, but that is fairly limiting.

This is a pretty stupid conversation. If you're going to say that a human is better or worse than a compiler at optimizing, then the presumption must be that they're doing the same basic task. In other words, they're taking the logic and algorithms expressed by the source code and converting it to machine code. The human doesn't get to change a bubble sort to a quicksort or whatever, since that's not the algorithm in the source code.

The task of coding higher-level algorithms isn't something that a compiler does, so to say that a human is better at it makes about as much sense as saying that a human is better at making a pizza than a compiler. Well, duh.

Caveat; I am not an engineer. But, does the old CISC (x86)/RISC discussion fit in here? I believe that CISC gets around the x86 limited number of CPU registers by creating a whole slew of virtual ones and then rotating them in at need to the physical ones. This shuffle is less important for RISC as it can have many more physical registers. Does this shuffling make speculative computation more important for x86 CISC than RISC?

Caveat; I am not an engineer. But, does the old CISC (x86)/RISC discussion fit in here? I believe that CISC gets around the x86 limited number of CPU registers by creating a whole slew of virtual ones and then rotating them in at need to the physical ones. This shuffle is less important for RISC as it can have many more physical registers. Does this shuffling make speculative computation more important for x86 CISC than RISC?

Any modern processor will essentially translate CISC instructions to RISC instructions before doing anything else. So whether the instructions stored in memory are CISC or RISC is almost irrelevant. (Only relevant due to the small overhead of this translation process.)

...That statement was "Programmers are generally going to be worse at any kind of optimization no matter the architecture."

That certainly is not true for "any kind", as most of the really good optimizations are entirely done by humans. Compilers tend only to get the easy ones which do not require any sort of modification to the logic or knowledge of the specific task, but that is fairly limiting.

This is a pretty stupid conversation. If you're going to say that a human is better or worse than a compiler at optimizing, then the presumption must be that they're doing the same basic task. In other words, they're taking the logic and algorithms expressed by the source code and converting it to machine code. The human doesn't get to change a bubble sort to a quicksort or whatever, since that's not the algorithm in the source code.

The task of coding higher-level algorithms isn't something that a compiler does, so to say that a human is better at it makes about as much sense as saying that a human is better at making a pizza than a compiler. Well, duh.

And yet, this is generally the nature of human optimization. Why go small at the cost of much effort, when you can go big relatively cheaply?

At a base level it is all a change to algorithm, the computer just has a far more limited view. This limits what it can do to optimizations which can be done with possibly a lot of effort, but little knowledge. Some of those optimizations at O3 are not in fact safe, just likely safe (it looks like this has changed, and recent versions are just very intolerant of undefined behavior, and possibly slower than O2).

Compilers have actually gotten quite a bit better in their reach (especially once things like link time code generation and profiling became popular), but it still has nothing on some humans.

I can write assembly too (x86 assembly was my first language, 27 years ago). I could beat a compiler last time I tried in the instruction generation game (about two years ago at this point). I do not beat it by much though, so we could say the compiler does a pretty reasonable job here.

That makes the obvious focus things the compiler cannot do well, and yet can have a far greater effect.

Yes, I am "cheating" in a way, but the people who pay me, and the users who use my programs, would prefer I cheat where possible.

...Yeah that doesn't sound like branch prediction, just straight-line prefetching into the instruction cache is what that is, no?

Indeed, raxx7 seems to be trying to make the point that branch prediction is the same as speculative execution (it isn't) or that anything that has a branch predictor also does speculative execution (it doesn't). A branch predictor is just some logic that predicts branches, that's it. Could be used for a bunch of stuff from prefetching instructions to full-on OoO speculative execution.

Erm... no.raxx7 is trying to make the point that any pipelined processor tends to need to speculatively work on the results of branch prediction or you get a very slow processor.

A single issue in-order pipelined processor takes N cycles to process an instruction but the execution is staged so it can start work on the next instruction in the clock cycle immediately after.This causes a problem with branches: the branch instruction will takes M (<=N) cycles to be resolved.One option is to stall the pipeline frontend when a branch is encountered. I can't think of a single processor which does this.

The common option is to keep the pipeline working based on the outcome of branch prediction* and flushing the pipeline when the prediction turns out wrong.* Where "prediction" can be as simple as assume all branches are not taken.

For a minimal pipeline length, M can be so short that the speculatively worked instructions will not reach the execution stage (eg, go to ALU or access memory) before the pipeline is flushed.

But this does not necessarily hold true for a wider and deeper in-order CPUs.

TL; DR Speculative execution can be found in (deep/wide) in-order CPUs, it's not an exclusive of OoO CPUs.

raxx7 has an excellent point, which I have been meaning to support.

I am a programmer, but I do know VHDL and Verilog. From the programming side, this is basically a matter of the hardware needing maybe 5 cycles to actually finish your addition. It therefore needs at least 5 instructions it can dispatch in the meantime which do not rely upon the results of that addition, or it needs to wait and do nothing.

That would lead to the natural requirement that anything which depends upon that result be several operations later at a minimum, and this frequently does not fit real world usage (anything with a lot of if or case statements suffers greatly).

The processor therefore guesses, and starts down the path it thinks is the answer. If it is right, it is five times faster. It can usually predict this with some efficiency, but I must admit I do a fair bit of optimization which is intended to make the code more predictable to the CPU. In no small part that boils down to making sure you get the same answer repeatedly if you can.

...And yet, this is generally the nature of human optimization. Why go small at the cost of much effort, when you can go big relatively cheaply? ...

Yes, of course algorithms should be optimized. Do you think anybody ever said otherwise?

I have no idea what your point is. It seems like you're just posting to disagree with anything I post. Do you have a point?

My point is that compilers barely scratch the surface when it comes to optimizing your algorithms (which is really all you or the computer are doing in any case).

It is quite nice that they do so much for you, but what they do not do is far greater. You want a better compiler because it speeds up any program where someone bothers to set an optimization flag (you might be depressed at how often even this is not done, and the program you are running was compiled with the default -O0 or the equivalent on another OS).

You want a better CPU because it speeds everything up, as both compiler optimizations and manual ones will target it to some degree. You actually get a better relative speedup from the CPU with unoptimized code, as there is more to work with.

This is what tempts so many to think they can write a compiler which will do it all for you, but usually that ignores just how much you need to know about the task. This is a very hard problem, all considered.

...My point is that compilers barely scratch the surface when it comes to optimizing your algorithms (which is really all you or the computer are doing in any case).

Only if you start out with a bad/slow algorithm. Why would you assume that?

Quote:

This is what tempts so many to think they can write a compiler which will do it all for you, but usually that ignores just how much you need to know about the task. This is a very hard problem, all considered.

I don't think anybody ever thought or said this. You're arguing with nobody against a strawman here. Remember, this conversation started with somebody wondering if we should revisit VLIW/EPIC architectures as a way to mitigate security problems with speculative execution. EPIC processors lean heavily on the compiler to expose parallelism. All other things being equal, compilers are pretty good at this. Maybe if we put some more effort into it, they would be better. If you disagree with any of this, say so. Otherwise you're talking to nobody.

...My point is that compilers barely scratch the surface when it comes to optimizing your algorithms (which is really all you or the computer are doing in any case).

Only if you start out with a bad/slow algorithm. Why would you assume that?

Quote:

This is what tempts so many to think they can write a compiler which will do it all for you, but usually that ignores just how much you need to know about the task. This is a very hard problem, all considered.

I don't think anybody ever thought or said this. You're arguing with nobody against a strawman here. Remember, this conversation started with somebody wondering if we should revisit VLIW/EPIC architectures as a way to mitigate security problems with speculative execution. EPIC processors lean heavily on the compiler to expose parallelism. All other things being equal, compilers are pretty good at this. Maybe if we put some more effort into it, they would be better. If you disagree with any of this, say so. Otherwise you're talking to nobody.

I do disagree with this, compilers are not very good at exposing parellelism, which is why they failed. I also do not think they could easily do better.

CPUs do this by actively violating the logical rules, and trying to undo it. That is much easier in some ways, but it has a real cost if you try to do it in software, and seems not to end up faster in most cases. My opinion is that it would need to come with a great CPU speed boost to possibly work in the real world.

...I do disagree with this, compilers are not very good at exposing parellelism, which is why they failed. I also do not think they could easily do better.

CPUs do this by actively violating the logical rules, and trying to undo it. That is much easier in some ways, but it has a real cost if you try to do it in software, and seems not to end up faster in most cases. My opinion is that it would need to come with a great CPU speed boost to possibly work in the real world.

I also think the rationale behind IA64-like architectures is flawed but that's unrelated to whether or not compilers can produce good code for them.

You say that compilers aren't good at exposing parallelism... what's your rationale for thinking this?

Remember the original Pentium (P5)? It required compilers to group pairs of independent instructions so they could be executed in parallel. This is a pretty straightforward task, compilers can do it well, and software that was recompiled for the Pentium usually ran 20-30% faster on a Pentium. But you're saying that compilers can't do this sort of thing very well? I mean, what's your evidence? What's your rationale?

...I do disagree with this, compilers are not very good at exposing parellelism, which is why they failed. I also do not think they could easily do better.

CPUs do this by actively violating the logical rules, and trying to undo it. That is much easier in some ways, but it has a real cost if you try to do it in software, and seems not to end up faster in most cases. My opinion is that it would need to come with a great CPU speed boost to possibly work in the real world.

I also think the rationale behind IA64-like architectures is flawed but that's unrelated to whether or not compilers can produce good code for them.

You say that compilers aren't good at exposing parallelism... what's your rationale for thinking this?

Remember the original Pentium (P5)? It required compilers to group pairs of independent instructions so they could be executed in parallel. This is a pretty straightforward task, compilers can do it well, and software that was recompiled for the Pentium usually ran 20-30% faster on a Pentium. But you're saying that compilers can't do this sort of thing very well? I mean, what's your evidence? What's your rationale?

As with many things they are good at it up to a point, but this point is insufficient to make a VLIW CPU work well, nevermind something like automatic scaling to multiple cores.

I am not saying they are entirely unable, but they are very limited compared to a specialized human.

Keep in mind that when your compiler does something for you, not only do people know how to do that optimization, they could generalize it. That is far harder to do in several ways.

...As with many things they are good at it up to a point, but this point is insufficient to make a VLIW CPU work well, nevermind something like automatic scaling to multiple cores....

I agree that no amount of compiler or human optimization would make a EPIC processor competitive with a modern OoO processor but that's independent of whether or not compilers are good at the sort of optimization necessary to take advantage of an EPIC processor's available resources.

If you're so convinced that IA64 compilers do a bad job vs. humans optimizing IA64 code by hand, surely you have examples... ? I can't find any after a couple minutes of googling.

...Yeah that doesn't sound like branch prediction, just straight-line prefetching into the instruction cache is what that is, no?

Indeed, raxx7 seems to be trying to make the point that branch prediction is the same as speculative execution (it isn't) or that anything that has a branch predictor also does speculative execution (it doesn't). A branch predictor is just some logic that predicts branches, that's it. Could be used for a bunch of stuff from prefetching instructions to full-on OoO speculative execution.

Erm... no.raxx7 is trying to make the point that any pipelined processor tends to need to speculatively work on the results of branch prediction or you get a very slow processor.

A single issue in-order pipelined processor takes N cycles to process an instruction but the execution is staged so it can start work on the next instruction in the clock cycle immediately after.This causes a problem with branches: the branch instruction will takes M (<=N) cycles to be resolved.One option is to stall the pipeline frontend when a branch is encountered. I can't think of a single processor which does this.

The common option is to keep the pipeline working based on the outcome of branch prediction* and flushing the pipeline when the prediction turns out wrong.* Where "prediction" can be as simple as assume all branches are not taken.

For a minimal pipeline length, M can be so short that the speculatively worked instructions will not reach the execution stage (eg, go to ALU or access memory) before the pipeline is flushed.

But this does not necessarily hold true for a wider and deeper in-order CPUs.

TL; DR Speculative execution can be found in (deep/wide) in-order CPUs, it's not an exclusive of OoO CPUs.

It is not speculative execution unless the instructions actually do reach the execution stage.

The quote that 486 predicts branches as not taken is misleading. It doesn't do any prediction, by its pipelined nature it will have started decoding instructions following the jump in straight-line simply because it doesn't know that there is a branch coming up, when it gets around to noticing that this instruction is a conditional branch, it starts prefetching the target in the same cycle as the jump is executing. This articlehttp://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdfon p34 has the description, written by the chief architect of the 486. Notice that it works this way whether the branch is conditional or unconditional. While it might vaguely make sense to call straight-line decoding after a conditional branch as a "prediction" that it isn't taken, it's clearly nonsense to claim that an unconditional branch is predicted to be not taken.

A few things are quite a bit faster, up to 10x. Most are not really all that much faster at all, in the twice as fast to almost nothing range.

The one in that second link for H.264 Video Encoding is particularly relevant, as that is probably something they did hand optimize to some extent. Notice that even though it is an application which should really benefit, it barely speeds up.

In fact, looking more closely, their actual note is "In the case of programs like x264 that tend to already rely upon hand-tuned code, there isn't much of a difference beyond the most basic optimization levels."

Um, that H.264 thing goes the opposite way to your assertion. Even though it was already hand-optimized, which by your argument should mean there's nothing left for the compiler to do, the compiler was still able to measurably speed it up.

...A few things are quite a bit faster, up to 10x. Most are not really all that much faster at all, in the twice as fast to almost nothing range.

The one in that second link for H.264 Video Encoding is particularly relevant, as that is probably something they did hand optimize to some extent. Notice that even though it is an application which should really benefit, it barely speeds up.

Are we looking at the same link/image? Turning on the most basic optimizations speeds up H.264 by 40%. You call that "barely speeding it up"? If Intel made a processor that was 40% faster than its current fastest offering, it would be front-page news for a week.

Looking at the differences in that second link graph-by-graph between O0 and O3, I'm seeing 80%, 252%, 485%, 33%, and 47%. I don't understand how you're not impressed with these speedups.

But regardless, what's your point? This conversation started with somebody saying that programmers are generally going to be worse than compilers at optimizing stuff, if only because of the sheer volume of code involved in most modern projects. Pointing to a web site that indicates how well compilers can optimize stuff is only half of an argument. Basically it's irrelevant unless you can also point to a web site that shows how much humans can optimize the same code... can you? If you can't, I don't know what you're trying to say.

That statement was "Programmers are generally going to be worse at any kind of optimization no matter the architecture."

That certainly is not true for "any kind", as most of the really good optimizations are entirely done by humans. Compilers tend only to get the easy ones which do not require any sort of modification to the logic or knowledge of the specific task, but that is fairly limiting.

It probably is not true for Chromium in general either. My expectation is that if you somehow produced a "clean" version without all of the human optimizations (such that the code is written somewhat competently, but for readability only, not speed), then compiled it with -O3, it would likely be far worse than if you took their hand optimized code and used -O0, or -Og.

A 40% speedup is small if you are optimizing by hand, few would bother with that. You generally go through quite a bit of optimization before you hit things which are that minor, and rarely do you need to go that far.

The video encoder likely only sees that benefit as they did not bother with some of this as well, I am sure they could have optimized the rest of their encoding routine if they had reason to do so. If compilers will generally do a specific optimization for you, why bother? I do not bother suggesting what to inline, for example. That is something a compiler is good at sorting out, even if I could do so.

A benchmark for code would be trivial to come up with, and entirely meaningless if you are talking about human code decisions affecting optimization. That works for a compiler as it is the same code, with a different flag.

A very poorly written program can very easily be tens or hundreds of thousands of times slower than one which someone with time and knowledge wrote to execute quickly. I could trivially produce a slow version and a fast version of some routine with a delta of thousands of times, if I get to write both (and they do enough processing to matter, as a lot of the big numbers are due to scaling).

As a real world example I fixed a report someone complained about not long ago which was taking almost five minutes to generate with just a few statements to set an index, and it now runs in substantially less than a second. All I really did is go look through some SQL and make sure everything they were looking up in a join had an index, that was very quick, and very effective.

I have a specialty compiler of my own which had an early version which took approximately four hours to crunch through a large directory of input source code. Version 2 after some optimizations? 18 seconds.

I had a data analysis utility in Java which I was asked to look at as a consultant many years ago when their monthly cycle was taking more than a month. It was well written, but my C rewrite for speed ran as quickly as they could feed data to the system, and could complete in less than five minutes while doing quite a bit more than the original system after some hardware upgrades to better saturate its input (this would not have affected the run time of the prior system, as it could easily feed it). I did chase down improvements of just 20% or so for that one, but I also spent maybe four months on optimizing it by the end.

Programmer optimization will generally change what the code does in a more substantial way, simply because it can. I could go look at the assembly and try to improve upon it a little bit by rearranging instructions. Maybe I could make it 2% faster than the compiler can in this way, maybe five. I have other options which are not available to a compiler though, and they are much superior.

A 40% speedup would not be something I would target manually in most cases, the first passes are looking for orders of magnitude (I want more like 1,000%, or 80,000% if I am going to bother). It takes rather a lot of optimization before you get down to thinking a 2x speedup is a good use of time, and 40% gets into the range where you need to ask how universal that is between CPUs with various cache sizes and other features.

To some degree it is cheating, as I am aware of things the compiler is not. I do not need to make the same guarantees in a task specific program as a compiler does, and can substantially change how the code operates to fit what I want to do.

Even for really simple stuff like having a set of lists such that some are lookup lists which are obviously sorted so you can search it rapidly when you need to find a given value, and some will need to be processed entirely and are therefore in a random order, my options are better.

I can do things like decide sorting all of them on the key of the most frequently used lookup is a gain as when it goes to search for the item the ones it checks during a binary search will tend to already be in the cache from the last lookup. I can even do things like have it prefetch both the possibilities on what it expects is the last loop or two, as that is likely where it will differ, and it both guards against a miss and acts as a preload for the next search.

I can also do things like decide that nice human readable data structure someone created is not ideal, as you spend your time looking at only one or two values. Those get moved to their own structure and grouped together so you have more values you care about in the same page or cache line.

That changes logic, so the compiler cannot do any of that for you.

The cost of running a compiler for an extra 20 seconds or so is very low though, so it certainly does win from a cost perspective. Much of the time nobody even tried to optimize the thing you are running even a little bit, and if you want a speedup it is coming from the compiler or the CPU. If someone cared though, there are a lot of cases where someone could go make it a thousand times faster without resorting to anything as effort intensive as individually choosing instructions.

You've gone off on a complete tangent here. The comment was not about the kind of "optimizations" that involve changing a quadratic algorithm to linear etc. [although amazingly enough, these days compilers can solve linear recurrence relations at compile time and turn two nested loops into a few multiplications/additions].

The original statement was about VLIW architectures and compilers having a hard time optimizing for them, and humans not having enough time to do so. This is obviously not a question about picking a better algorithm -- this is absolutely a question of loop unrolling, removing data dependencies etc etc. Humans are not capable of doing that for a large project for any architecture whether it's CISC, RISC or VLIW.

If 40% isn't all that important, the H.264 code would probably be fine written in C and vectorized by an optimizing compiler, assuming it's got the right basic algorithms implemented. Any hand-written assembler is probably unnecessary by now, it may have been needed a few years ago -- back then I think only Intel's proprietary compiler did a decent job at vectorizing.

...Yeah that doesn't sound like branch prediction, just straight-line prefetching into the instruction cache is what that is, no?

Indeed, raxx7 seems to be trying to make the point that branch prediction is the same as speculative execution (it isn't) or that anything that has a branch predictor also does speculative execution (it doesn't). A branch predictor is just some logic that predicts branches, that's it. Could be used for a bunch of stuff from prefetching instructions to full-on OoO speculative execution.

Erm... no.raxx7 is trying to make the point that any pipelined processor tends to need to speculatively work on the results of branch prediction or you get a very slow processor.

A single issue in-order pipelined processor takes N cycles to process an instruction but the execution is staged so it can start work on the next instruction in the clock cycle immediately after.This causes a problem with branches: the branch instruction will takes M (<=N) cycles to be resolved.One option is to stall the pipeline frontend when a branch is encountered. I can't think of a single processor which does this.

The common option is to keep the pipeline working based on the outcome of branch prediction* and flushing the pipeline when the prediction turns out wrong.* Where "prediction" can be as simple as assume all branches are not taken.

For a minimal pipeline length, M can be so short that the speculatively worked instructions will not reach the execution stage (eg, go to ALU or access memory) before the pipeline is flushed.

But this does not necessarily hold true for a wider and deeper in-order CPUs.

TL; DR Speculative execution can be found in (deep/wide) in-order CPUs, it's not an exclusive of OoO CPUs.

raxx7 has an excellent point, which I have been meaning to support.

I am a programmer, but I do know VHDL and Verilog. From the programming side, this is basically a matter of the hardware needing maybe 5 cycles to actually finish your addition. It therefore needs at least 5 instructions it can dispatch in the meantime which do not rely upon the results of that addition, or it needs to wait and do nothing.

That would lead to the natural requirement that anything which depends upon that result be several operations later at a minimum, and this frequently does not fit real world usage (anything with a lot of if or case statements suffers greatly).

The processor therefore guesses, and starts down the path it thinks is the answer. If it is right, it is five times faster. It can usually predict this with some efficiency, but I must admit I do a fair bit of optimization which is intended to make the code more predictable to the CPU. In no small part that boils down to making sure you get the same answer repeatedly if you can.

That is not how it works, actually. The 486 that was being talked about is tuned to execute one instruction per cycle, even though it needs a couple more cycles to decode the instructions. If you're working purely in the registers (or on-chip cache) data dependencies don't matter to it, because it is a purely in-order processor with no parallelism. i.e. if you have an add instruction followed by another one that depends on the result, there's no problem even though each instruction needs a total of 3 cycles (2 decodes and one execute) to finish. Data dependencies become an issue once you have multiple execution units you need to keep occupied, and when your clock speed is so high that many instructions take more than one cycle to execute, not as a total of decode + execute.

Caveat; I am not an engineer. But, does the old CISC (x86)/RISC discussion fit in here? I believe that CISC gets around the x86 limited number of CPU registers by creating a whole slew of virtual ones and then rotating them in at need to the physical ones. This shuffle is less important for RISC as it can have many more physical registers. Does this shuffling make speculative computation more important for x86 CISC than RISC?

Any modern processor will essentially translate CISC instructions to RISC instructions before doing anything else. So whether the instructions stored in memory are CISC or RISC is almost irrelevant. (Only relevant due to the small overhead of this translation process.)

.... and that "Small overhead" of the translation process may well even be outweighed by the more compact CISC code being more L1/L2 cache memory efficient. Thus more program code can be cached. There's a small win there to help offset...

Caveat; I am not an engineer. But, does the old CISC (x86)/RISC discussion fit in here? I believe that CISC gets around the x86 limited number of CPU registers by creating a whole slew of virtual ones and then rotating them in at need to the physical ones. This shuffle is less important for RISC as it can have many more physical registers. Does this shuffling make speculative computation more important for x86 CISC than RISC?

Any modern processor will essentially translate CISC instructions to RISC instructions before doing anything else. So whether the instructions stored in memory are CISC or RISC is almost irrelevant. (Only relevant due to the small overhead of this translation process.)

.... and that "Small overhead" of the translation process may well even be outweighed by the more compact CISC code being more L1/L2 cache memory efficient. Thus more program code can be cached. There's a small win there to help offset...

Yeah might be more efficient re: main memory bandwidth and L2 utilization... but caching x86 instructions in L1 has been a bummer... at one point Intel was using trace caches, not sure if they still do... at another point, they reworked the L1 I-cache to include data about where instructions began... both involve more complication and overhead than if they just cached the instructions just like any other data.

No they don't. Superscalar is the ability to execute multiple instructions in parallel. There are lots of processors that can do this without the ability to do any speculative execution. (The original Pentium is a good example, and much more recent ARM processor cores.)

Actually... no.The original Pentium had speculative execution and so did the 80486 (branch prediction).The 486 only had static branch prediction but the Pentium had a simple for of dynamic branch prediction.The various in-order supercalar modern ARM cores have more sophisticated branch prediction.

Although you are correct in that superscalar execution is independent of speculative execution, processors actually became speculative before they became superscalar as even for single issue pipelined processor branch prediction is a big performance improvement.

Do you have a citation for this? The interwebs seem to indicate that the Pentium introduced branch prediction, and the Pentium Pro introduced speculative execution.

And branch prediction is a primitive form of speculative execution. It's right there in the name... if the computer does calculations based on a prediction of the results of computations that haven't been performed yet, that is speculation, and trying to predict which way a branch will go ahead of time is, well, speculative.

Now some might say this isn't true speculation since the prefetched instructions aren't fully executed until after the branch is taken, but they are in the pipeline and are being decoded and other speculative preparations for their execution are taking place. It's just that those preparations are easy to unwind on failure, since no registers or other visible computational state have been modified, just the initial stages of the pipeline, which are invisible and can just be dropped.

A few things are quite a bit faster, up to 10x. Most are not really all that much faster at all, in the twice as fast to almost nothing range.

The one in that second link for H.264 Video Encoding is particularly relevant, as that is probably something they did hand optimize to some extent. Notice that even though it is an application which should really benefit, it barely speeds up.

In fact, looking more closely, their actual note is "In the case of programs like x264 that tend to already rely upon hand-tuned code, there isn't much of a difference beyond the most basic optimization levels."

Um, that H.264 thing goes the opposite way to your assertion. Even though it was already hand-optimized, which by your argument should mean there's nothing left for the compiler to do, the compiler was still able to measurably speed it up.

I think it fits what I said perfectly, and it does not substantially speed up.

That is in comparison to the programs where there were not hand optimized, and you see a speed up of several times to a full order of magnitude from compilation with an optimization flag on. 40% is small compared to that. The author of that benchmark seems to agree with me as well, I am not the only one who looks at that and says it barely speeds up.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

-Og is your debug build, and doing it at -O0 would actively make your life harder in several ways even aside from not really being representative of the performance of the final such that it would be harder to optimize by hand for this. It does not get much faster from -Og in that chart. It was probably never at any point built with -O0 during development.

That speedup only applies taking it from to -O0 to anything else, so only very basic optimizations affect it. If you ask it to do more extensive optimizations, it does not speed up any further, and some slow down (-O2 is slower than -O1). Considering that they likely never had any intention of a compile with no optimization, they likely did not bother with the really basic stuff any compiler would do for them. They may not even know, as it is basically the same speed all the way down to -Og. If you want an accurate assessment as to what the compiler did to improve it from their starting point, it would be scores of 135.5 vs 142, so very close.

If you are going to object to me changing an algorithm in order to speed things up, I am going to object to the compiler doing so. If it unrolls a loop or inlines something which was specified as a separate function, it changed it. That programmer specified one iteration per loop, much like they specified a certain style of sort. Compilers do not restrict themselves to merely rearranging instructions these days, they do anything they can which will still result in correct operation. So does a human.

Also, it probably would be much, much, slower if written in simple C and only optimized by the compiler. A 40% difference might be fine, but that is not going to be a 40% difference. It is nearly inconceivable that someone managed to optimize this such that gcc cannot really speed it up, but failed to check it against a basic build to ensure they were really speeding it up substantially.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

Erm...gcc -O0 (no optimization) is the default for that reason it's certainly the most common use case.gcc -O0 does not disable or interfere with debugging and profiling.

gcc -Og is an optimization mode which does not interfere with debugging and profiling.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

Erm...gcc -O0 (no optimization) is the default for that reason it's certainly the most common use case.gcc -O0 does not disable or interfere with debugging and profiling.

gcc -Og is an optimization mode which does not interfere with debugging and profiling.

Optimize debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. It is a better choice than -O0 for producing debuggable code because some compiler passes that collect debug information are disabled at -O0.

Like -O0, -Og completely disables a number of optimization passes so that individual options controlling them have no effect. Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

You pretty much always use -Og if you want to look at the assembly.

People not only test and debug software without optimization, people actually often distribute and use software without any optimization.Eg, commonly used build tools like autoconf and CMake default to no optimization flags.IIRC, most of the software packages you'll find in Debian/Ubuntu is built without any optimization flags.I'm too lazy to check about Red Hat.

As for the cases when one wants to look at the assembly produced by the compiler there are several distinct cases.The most common case is one wanting maximum performance and checking if the code generated by the compiler is fast. But at this point the developer is usually aware of optimization flags and is only interested in code generated by the desired release optimization level (eg, -O2).

The other case is when the developer finds weird bugs and starts suspecting that the compiler is doing something funny.

A few things are quite a bit faster, up to 10x. Most are not really all that much faster at all, in the twice as fast to almost nothing range.

The one in that second link for H.264 Video Encoding is particularly relevant, as that is probably something they did hand optimize to some extent. Notice that even though it is an application which should really benefit, it barely speeds up.

In fact, looking more closely, their actual note is "In the case of programs like x264 that tend to already rely upon hand-tuned code, there isn't much of a difference beyond the most basic optimization levels."

Um, that H.264 thing goes the opposite way to your assertion. Even though it was already hand-optimized, which by your argument should mean there's nothing left for the compiler to do, the compiler was still able to measurably speed it up.

I think it fits what I said perfectly, and it does not substantially speed up.

That is in comparison to the programs where there were not hand optimized, and you see a speed up of several times to a full order of magnitude from compilation with an optimization flag on. 40% is small compared to that. The author of that benchmark seems to agree with me as well, I am not the only one who looks at that and says it barely speeds up.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

What the author of that benchmark is saying is that the various "ricer" options don't do all that much more, not that optimization does nothing. Taking -Og or -O1 etc as the baseline doesn't make any sense if you want to know what the compiler is doing. That's just moving the goalposts.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

Erm...gcc -O0 (no optimization) is the default for that reason it's certainly the most common use case.gcc -O0 does not disable or interfere with debugging and profiling.

gcc -Og is an optimization mode which does not interfere with debugging and profiling.

Optimize debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. It is a better choice than -O0 for producing debuggable code because some compiler passes that collect debug information are disabled at -O0.

Like -O0, -Og completely disables a number of optimization passes so that individual options controlling them have no effect. Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

You pretty much always use -Og if you want to look at the assembly.

I agree with most of what you've said (i.e. you don't want -O0 if you're debugging) but do you have any citation for the claim that most people leave it at -O0 because they're not aware of the -O option? I find that unbelievable. Nobody outside of a complete beginner just starting to code is going to be unaware of the options. I would wager that -O2 is the most common setting in production code.

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

You pretty much always use -Og if you want to look at the assembly.

People not only test and debug software without optimization, people actually often distribute and use software without any optimization.Eg, commonly used build tools like autoconf and CMake default to no optimization flags.IIRC, most of the software packages you'll find in Debian/Ubuntu is built without any optimization flags.I'm too lazy to check about Red Hat.

autoconf and CMake default to no optimization because it is up to the person using autoconf or CMake to specify them. Your assertion that most packages in debian are built without optimization is utter nonsense.

They did optimize it until there was nothing to do, but they almost certainly did that using -Og, not -O0. Nobody actually uses -O0 for development as it essentially disables your debugging and profiling utilities too, so if you want to look at the improvement the compiler made to what they did by hand, you are looking at -Og as the baseline.

Erm...gcc -O0 (no optimization) is the default for that reason it's certainly the most common use case.gcc -O0 does not disable or interfere with debugging and profiling.

gcc -Og is an optimization mode which does not interfere with debugging and profiling.

Optimize debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. It is a better choice than -O0 for producing debuggable code because some compiler passes that collect debug information are disabled at -O0.

Like -O0, -Og completely disables a number of optimization passes so that individual options controlling them have no effect. Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

You pretty much always use -Og if you want to look at the assembly.

I agree with most of what you've said (i.e. you don't want -O0 if you're debugging) but do you have any citation for the claim that most people leave it at -O0 because they're not aware of the -O option? I find that unbelievable. Nobody outside of a complete beginner just starting to code is going to be unaware of the options. I would wager that -O2 is the most common setting in production code.

In major projects it is definitely my expectation that -O2 would be the setting. I am currently trying to find something to contest the assertion raxx made that most projects in the repos are -O0.

When you are talking about random programmers, I frequently see people miss it entirely and not specify the -O option at all. Many of them come from Windows, and are developing there. By volume, this is a lot of code, even if it is not heavily used code.

Edit: Also, I wanted to mention that when programmers feel they have found a compiler bug, and can get it to work at -O0 but not -O2, it is rarely actually the case that this is a compiler issue.

What really happens in the majority of cases is that the programmer relied upon undefined behavior, or it contains a real error which is simply not exposed in all cases. While it is probably every couple of months someone claims they are running into a compiler issue, the ones which turn out to be that are very rare. I probably have enough fingers to count real bugs I have seen in major C compilers.

Not specifying one is indeed the most common, but not if you are optimizing it at all. That is usually done simply because the programmer is not aware of the -O flag. It also tends to mean they are not debugging it on that system.

Wilco Dijkstra of ARM argues, "Unfortunately unlike other compilers, GCC generates extremely inefficient code with -O0. It is almost unusable for low-level debugging or manual inspection of generated code. So a -O option is always required for compilation. -Og not only allows for fast compilation, but also produces code that is efficient, readable as well as debuggable. Therefore -Og makes for a much better default setting."

You pretty much always use -Og if you want to look at the assembly.

People not only test and debug software without optimization, people actually often distribute and use software without any optimization.Eg, commonly used build tools like autoconf and CMake default to no optimization flags.IIRC, most of the software packages you'll find in Debian/Ubuntu is built without any optimization flags.I'm too lazy to check about Red Hat.

As for the cases when one wants to look at the assembly produced by the compiler there are several distinct cases.The most common case is one wanting maximum performance and checking if the code generated by the compiler is fast. But at this point the developer is usually aware of optimization flags and is only interested in code generated by the desired release optimization level (eg, -O2).

The other case is when the developer finds weird bugs and starts suspecting that the compiler is doing something funny.

CFLAGS Options for the C compiler. The default value set by the vendor includes -g and the default optimization level (-O2 usually, or -O0 if the DEB_BUILD_OPTIONS environment variable defines noopt).

There is at least one other default in there which notes it does not work below -O1, so making it -O0 is actually going to take some effort in comparison. I do not find this likely unless the program only works at -O0 (which is generally due to bad code.)

...I think it fits what I said perfectly, and it does not substantially speed up.

Can we stop with this absolute insanity of claiming that a 40% speedup is somehow not substantial? What if I made your computer 40% faster? What if I made your car 40% faster? What if you flew in an airplane that was 40% faster than other airplanes? Stop being a godd**n idiot.

Quote:

That is in comparison to the programs where there were not hand optimized

Again, why are you talking about programs where the algorithm/source code isn't already optimized? Who brought up such programs? Who ever wanted to talk about such programs? The conversation from the outset was about how compilers perform at optimizing for EPIC architectures (presumably vs. other architectures), and actually had nothing to do with how well humans can optimize anything, let alone how well humans can optimize stuff that was s**ttily written from the outset. The discussion is over here, and you're about a mile away from it.

...I think it fits what I said perfectly, and it does not substantially speed up.

Can we stop with this absolute insanity of claiming that a 40% speedup is somehow not substantial? What if I made your computer 40% faster? What if I made your car 40% faster? What if you flew in an airplane that was 40% faster than other airplanes? Stop being a godd**n idiot.

Quote:

That is in comparison to the programs where there were not hand optimized

Again, why are you talking about programs where the algorithm/source code isn't already optimized? Who brought up such programs? Who ever wanted to talk about such programs? The conversation from the outset was about how compilers perform at optimizing for EPIC architectures (presumably vs. other architectures), and actually had nothing to do with how well humans can optimize anything, let alone how well humans can optimize stuff that was s**ttily written from the outset. The discussion is over here, and you're about a mile away from it.

Take a walk, compare yourself to a jogger. It looks like they might have a substantial improvement in speed.

Compare that speed increase to someone taking a rocket to orbit. Does it still look substantial?

Does it make sense to fly all passengers on a Concorde, when a 737 would do it much more cheaply?

I do contest that it is substantial in that case, it only seems so to you as you have no ability to control what the program really does as an end user. If all you have is a binary (or source and no programming knowledge), it does not matter that someone could make it many hundreds or thousands of times faster, you are stuck with what you have.

1) Nobody bothers with manual optimization for a 40% speedup. Even if you wanted to, you would be doing that in your free time, as vanishingly few people would pay you prevailing programmer wages to do something like manually determine instructions by hand for such a piddly gain. It is enormously ineffective from a cost perspective unless you have already done a great deal of optimization, to the point very few programs will see it (not even major heavily used ones). Even then, you would write it in assembly, and there would be no C version to compare it to unless you wrote a reference version as documentation.

2) They did not optimize it from -O0, they did it from -Og. The speedup is much less than 40% when you look at the version they actually optimized, it is low single digits. I would not assume that holds on all systems either, it is too close for such an assumption.

3) It is meaningful when we are discussing what a compiler could reasonably do for you. EPIC does not work out, as you are asking the compiler to do too much. You are getting into things it takes a human to do currently, or which requires that the machine guess. I remain unconvinced we can write a compiler for an EPIC system such that it will generally be faster than a more standard processor devoting that die area to better attempting this in hardware.

The programs which are not well optimized at all are where both compilers and CPUs shine. Some in that list are several times faster from -Og to -O2.

Maybe you can write such a compiler, but I think writing a compiler of this type is a very hard problem, and am not confident I could do so effectively enough to make it worthwhile. The chip may in theory be faster, but if it takes a human optimizing it to achieve it's potential, it is doomed.

That 40% is great if it is free, but is not if you had to spend a lot of time on the specific program.

And there is no point in any advantage people might say Intel has over AMD, when every time a flaw is discovered on Intel processors we get like 25%, 3% or more decreased in performance, that adds up pretty quickly.

...Take a walk, compare yourself to a jogger. It looks like they might have a substantial improvement in speed.

Compare that speed increase to someone taking a rocket to orbit. Does it still look substantial?

So what's your point again? That it's irrelevant that compilers can make the H.264 code 40% faster because you can make it a million times faster if you optimize the algorithms yourself as a human being? Well f**king have at it then. Let us know how that works out for you.

Quote:

1) Nobody bothers with manual optimization for a 40% speedup.

Nonsense. Apparently you've never worked on anything that takes a significant amount of compute time. Optimizing something such that it's 5% faster can often save hours of compute time and meaningful amounts on AWS bills (or whatever cloud compute cluster you use... or your own electricity bills if you own your own hardware). Continuing this discussion with you when you're going to post such trolling garbage is pointless.

and it has got to the point that any advantage some people may say that Intel still has over AMD makes no sense, when each time a flaws appear on Intel we get a 25%, 3%, 5%, or more decrease on performance, that adds up pretty quickly

...Take a walk, compare yourself to a jogger. It looks like they might have a substantial improvement in speed.

Compare that speed increase to someone taking a rocket to orbit. Does it still look substantial?

So what's your point again? That it's irrelevant that compilers can make the H.264 code 40% faster because you can make it a million times faster if you optimize the algorithms yourself as a human being? Well f**king have at it then. Let us know how that works out for you.

Quote:

1) Nobody bothers with manual optimization for a 40% speedup.

Nonsense. Apparently you've never worked on anything that takes a significant amount of compute time. Optimizing something such that it's 5% faster can often save hours of compute time and meaningful amounts on AWS bills (or whatever cloud compute cluster you use... or your own electricity bills if you own your own hardware). Continuing this discussion with you when you're going to post such trolling garbage is pointless.

Well, my actual point is that humans just have better options, so nobody would chase something so small... also... they already optimized this one likely to much greater effect than you recognize.

You are making a large and very incorrect assumption that the compiler is primarily responsible for that run time. It is highly likely, that had nobody gone over it by hand, it would be several times to tens of times slower regardless of the settings you give the compiler (as in there are no possible settings you could feed it which would make a program written by someone who has no idea what is efficient come within a few times the performance).

My second point is that it is not 40% either. To do that you need to use a setting it is likely nobody actually used at any point while writing it, meaning you are not testing their optimized version, you are testing something else entirely. What are you comparing? It certainly is not their work product, that is -Og, so what is it?

I am sure, if they can beat -O2 (which they did if -O1 is faster than -O2, keep in mind the compiler is basically making it *slower* if you tell it to optimize their optimized code), they could beat it starting at -O0, should they have had a reason to do so. They just do not, as it would be a waste of time and probably harder than writing it in asm at that point.

Someone would need to write a rather large check for it to be likely anybody is going to "have at it", this is universally not free (which seems to be the real issue you are not understanding here... we leave much much better things on the table due to lesser costs than this). That may realistically have a six or seven figure cost in this case. Are you in a position to pay for something like this, and would you spend your money on it if so? Do you think someone else will? Do you think someone who does this as a day job is going to volunteer to do it for free?

A "fair" test would be to take something which appears unoptimized in any case (meaning one of the ones seeing a 500% or more boost from compilation at -O2... otherwise you are really just seeing if I can optimize it better than the human who already did so), then see if a person could revise that program to beat the compiler. That would still be expensive, note how that pretty much only happens for programs where it really matters.

I regularly have people skip optimizations which are expected to produce a full order of magnitude performance increase if they would take real time. It takes a lot of AWS billing to make up for even a little bit of programming time, and there are always too many projects on the schedule such that even if it did, it may not really be worth it.

Having a programmer chase a 40% speedup because hardware only really works if you do that... no thanks, I will buy something else.

Edit:Also, I just want to throw out there as a supporting point how different performance is between languages.

You end up with a lot of things which max the time out as the multiple is thousands of times.

It is very common this is a multiple in the 5 - 10 range. Language alone can lead to a performance difference of tens to thousands of times, and that is really trivial when you are talking about some of this.

If they allowed assembly you would get another 10x at least, in my opinion. This is rarely worth it though, and obviously is not the target of a high level language benchmark.

I want to reiterate that not only are the optimizations gcc makes known, they are something you can generalize, and that is a far harder problem. There are still a number of things which can be done if you can analyze it in the kind of detail a human is capable of, which a compiler will not (and cannot safely) touch.