Premature PessimizationEasy on yourself, easy on the code: All other things being equal, notably code complexity and readability, certain efficient design patterns and coding idioms should just flow naturally from your fingertips and are no harder to write than the pessimized alternatives. This is not premature optimization; it is avoiding gratuitous pessimization.— Herb Sutter, Andrei Alexandrescu —Whenever we need to optimise the code, we should profile it, plain and simple. However, sometimes it makes sense just to know ballpark numbers for relative costs of some popular operations, so you won’t do grossly inefficient things from the very beginning (and hopefully won’t need to profile the program later 🙂 ).

So, here it goes – an infographics which should help to estimate costs of certain operations in CPU clocks cycles – and answer the questions such as “hey, how much L2 read usually costs?”. While answers to all these questions are more or less known, I don’t know of a single place where all of them are listed and put into perspective. Let’s also note that while the listed numbers, strictly speaking, apply only to modern x86/x64 CPUs, similar patterns of relative operation costs are expected to be observed on other modern CPUs with large multi-level caches (such as ARM Cortex A, or SPARC); on the other hand, MCUs (including ARM Cortex M) are different enough so some of the patterns may be different there.

Last but not least, a word of caution: all the estimates here are just indications of the order of magnitude; however, given the scale of differences between different operations, these indications may still be of use (at least to be kept in mind to avoid “premature pessimisation”).

On the other hand, I am still sure that such a diagram is useful to avoid saying things “hey, virtual function calls cost nothing” – which may or may not be true depending on how often you call them. Instead, using the infographics above – you’ll be able to see that

if you call your virtual function 100K times per second on a 3GHz CPU – it probably won’t cost you more than 0.2% of your CPU total; however, if you’re calling the same virtual function 10M times per second – it can easily mean that virtualisation eats up double-digit percentages of your CPU core.

Another way of approaching the same question is to say that “hey, I’m calling virtual function once per piece of code which is like 10000 cycles, so virtualisation won’t eat more than 1% of the program time” – but you still need some kind of way to see an order of magnitude for the related costs – and the diagram above will still come in handy 🙁 .

Preliminaries aside, let’s take a closer look at those items on our infographics above.

ALU/FPU Operations

For our purposes, when speaking about ALU operations, we will consider only register-register ones. If memory is involved, the costs can be VERY different – and will depend on “how bad the cache miss was” for the memory access, as discussed below.

“Simple” Operations

These days (and on modern CPUs), “simple” operations such as ADD/MOV/OR/… can easily have costs of less than 1 CPU cycle. This doesn’t mean that the operation will be literally performed in half a cycle. Instead –

while all operations are still performed in a whole number of cycles, some of them can be performed in parallel

In [Agner4] (which BTW is IMO the best reference guide on CPU operation costs), this phenomenon is represented by each operation having two associated numbers – one is latency (which is always a whole number of cycles), and another is throughput. It should be noted, however, that in real-world, when going beyond order of magnitude estimates, exact timing will depend a lot on the nature of your program, and on the order in which the compiler has put seemingly-unrelated instructions; in short – whenever you need something better than an order-of-magnitude guesstimate, you need to profile your specific program, compiled with your specific compiler (and ideally – on a specific target CPU too).

Further discussion of such techniques (known as “out of order execution”), while being Really Interesting, is going to be way too hardware-related (what about “register renaming” which happens under the hood of CPU to reduce dependencies which prevent out-of-order from working efficiently?), and is clearly out of our scope at the moment 🙁 .

Floating-Point Operations

Costs of floating-point operations are taken from [Agner4], and range from 1-3 CPU cycles for addition (FADD/FSUB) and 2-5 cycles for multiplication (FMUL), to 37-39 cycles for division (FDIV).

If using SSE scalar operations (which apparently every compiler and its dog does these days), the numbers will go down to 0.5-5 cycles for multiplication (MULSS/MULSD), and to 1-4o cycles for division (DIVSS/DIVSD); in practice, however, you should expect more like 10-40 cycles for division (1 cycle is “reciprocal throughput”, which is rarely achievable in practice).

128-bit Vector Operations

For quite a few years, CPUs are providing “vector” operations (more precisely – Single Instruction Multiple Data a.k.a. SIMD operations); in Intel world they’re known as SSE and AVX and in ARM world – as ARM Neon. One funny thing about them is that they operate on “vectors” of data, with data being of the same size (128 bit for SSE2-SSE4, 256 bit for AVX and AVX2, and 512 bits for upcoming AVX-512) – but interpretations of these bits being different. For example, 128-bit SSE2 register can be interpreted as (a) two doubles, (b) four floats, (c) two 64-bit integers, (d) four 32-bit integers, (e) eight 16-bit integers, (f) 16 8-bit integers.

[Agner4] gives the costs of integer addition over 128-bit vector at < 1 cycle if the vector is interpreted as 4×32-bit integers, and at 4 cycles if it is 2×64-bit integers; multiplication (4×32 bits) goes at 1-5 cycles – and last time I checked, there were no integer division vector operations in x86/x64 instruction set. For floating-point operations over 128-bit vectors, the numbers go from 1-3 CPU cycles for addition and 1-7 CPU cycles for multiplication, to 17-69 cycles for division.

Bypass Delays

“switching between integer and floating-point instructions is not freeOne not-so-obvious thing related to calculation costs, is that switching between integer and floating-point instructions is not free. [Agner3] gives this cost (known as “bypass delay”) at 0-3 CPU cycles depending on the CPU. Actually, the problem is more generic than that, and (depending on CPU) there can also be penalties for switching between vector (SSE) integer instructions and usual (scalar) integer instructions.

Branching

The next thing which we’ll be discussing, is branching code. Branch (an if within your program) is essentially a comparison, plus a change in the program counter. While both these things are simple, there can be a significant cost associated with branching. Discussing why it is the case, is once again way too hardware-related (in particular, pipelining and speculative execution are two things involved here), but from the software developer’s perspective it looks as follows:

if the CPU guesses correctly where the execution will go (that’s before actually calculating condition of if), then cost of branching operation is about 1-2 CPU cycles.

however, if the CPU makes an incorrect guess – it results in CPU effectively “stalling”

“on modern Intel CPUs branch prediction is always dynamic (or at least dominated by dynamic decisions)Let’s note that while GCC’s __builtin_expect() macro is widely believed to affect branch prediction – and it used to work this way just 15 years ago, it is no longer the case at least for Intel CPUs anymore (since Core 2 or so). As described in [Agner3], on modern Intel CPUs branch prediction is always dynamic (or at least dominated by dynamic decisions); this in turn, implies that __builtin_expect()-induced differences in the code are not expected to have any effect on branch prediction (on modern Intel CPUs, that is). However, __builtin_expect() still has effect on the way code is generated, as described in “Memory Access” section below.

Memory Access

“Back in 80s, it was possible to calculate the speed of the program just by looking at assembly.Back in 80s, CPU speed was comparable with memory latency (for example, Z80 CPU, running at 4MHz, spent 4 cycles on a register-register instruction, and 6 cycles on a register-memory instruction). At that time, it was possible to calculate the speed of the program just by looking at assembly.

Since that point, speeds of CPUs have grown by 3 orders of magnitude, while memory latency has improved only 10-30-fold or so. To deal with remaining 30x+ discrepancy, all kinds of caches were introduced. Modern CPU usually has 3 levels of caches. As a result, speed of memory access depends very significantly on the answer to “where the data we’re trying to read, is residing?” The lower the cache level where your request was found – the faster you can get it.

L1 and L2 cache access times can be found in official documents such as [Intel.Skylake]; it lists access L1/L2/L3 times at 4/12/44 CPU cycles respectively (NB: these numbers vary slightly from one CPU model to another one). Actually, as mentioned in [Levinthal], L3 access times can go as high as 75 cycles if the cache line is shared with another core.

However, what is more difficult to find, is information about main RAM access times. [Levinthal] gives it at 60ns (~180 cycles if CPU is running at 3GHz).

Optimisation hint: DO improve data locality. For more discussion on it, see, for example, [NoBugs].

Besides memory reads, there are also memory writes. While intuitively write is perceived to be more expensive than read, most often it is not; the reason for it is simple – CPU doesn’t need to wait for the write to complete before going forward (instead, it just starts writing – and goes ahead with the other business). This means that most of the time, CPU can perform memory write in ~1 cycle; this is consistent with my experience, and seems to correlate with [Agner4] reasonably well. On the other hand, if your system happens to be memory-bandwidth-bound, numbers can get EXTREMELY high; still, from what I’ve seen, having memory bandwidth overloaded by writes is a very rare occurrence, so I didn’t reflect it on the diagram.

And besides data, there is also code.

Another optimisation hint: try to improve code locality too. This one is less obvious (and usually has less drastic effects on performance than poor data locality). Discussion on the ways to improve code locality can be found in [Drepper]; these ways include such things as inlining, and __builtin_expect().

Let’s note that while __builtin_expect(), as mentioned above, doesn’t have effect on branch prediction on Intel CPUs anymore, it still has an effect on the code layout, which in turn impacts code spatial locality. As a result, __builtin_expect() doesn’t have effects which are too pronounced on modern Intel CPUs (on ARM – I have no idea TBH), but still can affect a thing or three performance-wise. Also there were reports that under MSVC, swapping if and else branches of if statement has effects which are similar to __builtin_expect() ones (with “likely” branch being the if branch of two-handed if), but make sure to take it with a good pinch of salt.

NUMA

One further thing which is related to memory accesses and performance, is rarely observed on desktops (as it requires multi-socket machines – not to be confused with multi-core ones). As such, it is mostly server-land; however, it does affect memory access times significantly.

“When multiple sockets are involved, modern CPUs tend to implement so-called NUMA architecture, with each processor having its own RAMWhen multiple sockets are involved, modern CPUs tend to implement so-called NUMA architecture, with each processor (where “processor” = “that thing inserted into a socket”) having its own RAM (opposed to earlier-age FSB architecture with shared FSB a.k.a. Front-Side Bus, and shared RAM). In spite of each of the CPUs having its own RAM, CPUs share RAM address space – and whenever one needs access to RAM physically located within another one – it is done by sending a request to another socket via ultra-fast protocol such as QPI or Hypertransport.

Surprisingly, this doesn’t take as long as you might have expected – [Levinthal] gives the numbers of 100-300 CPU cycles if the data was in the remote CPU L3 cache, and of 100ns (~=300 cycles) if the data wasn’t there, and remote CPU needed to go to its own main RAM for this data.

CAS

Sometimes (in particular, in non-blocking algorithms and while implementing mutexes), we want to use so-called atomic operations. In academy, only one atomic operation, known as CAS (Compare-And-Swap) is usually considered (on the grounds that everything else can be implemented via CAS); in real-world, there are usually more of them (see, for example, std::atomic in C++11, Interlocked*() functions in Windows, or __sync_*_and_*() in GCC/Linux). These operations are quite weird beasts – in particular, they require special CPU support to work properly. On x86/x64, appropriate ASM instructions are characterised by having LOCK prefix, so CAS on x86/x64 is usually written as LOCK CMPXCHG.

What matters from our current perspective is that these CAS-like operations are going to take significantly longer than usual memory access (to provide atomic guarantees, CPU needs to synchronise things at least between different cores – or in case of multi-socket configurations, also between different sockets).

[AlBahra] gives the cost of CAS operations at about 15-30 clock cycles (with little difference between x86 and IBM Power families). Let’s note that this number is valid only when two assumptions stand: (a) we’re working with a single-core configuration, and (b) that CAS-ed memory is already in L1 cache.

As for CAS costs in multi-socket NUMA configurations – I wasn’t able to find the data about CAS, so I will need to speculate for the time being 🙁 . On the one hand, IMO it will be next-to-impossible to have latencies of CAS operation on “remote” memory less than round-trip of HyperTransport between the sockets, which in turn is comparable to the cost of NUMA L3 cache read. On the other hand, I don’t really see the reasons to go higher than that :-). As a result, I am guesstimating the cost of NUMA different-socket CAS (and CAS-like) operations at around 100-300 CPU clock cycles.

TLB

Whenever we’re working with modern CPUs and modern OS’s, at app-level we are usually dealing with “virtual” address space; in other words – if we run 10 processes, each of these processes can (and probably will) have its own address 0x00000000. To support such isolation, CPUs implement so-called “virtual memory”. In x86 world – it was first implemented via “protected mode” introduced as early as 1982 in 80286.

Usually, “virtual memory” works on per-page basis (for x86 each page is either 4K or 2M or at least in theory – even 1G(!) in size), with CPU being aware of the current process being run (!), and re-mapping virtual addresses into physical addresses – on each memory access, that is. Note that this re-mapping occurs completely behind the scenes, in a sense that all CPU registers (except for those specifically dealing with mapping) contain all the pointers in “virtual memory” format.

And as soon as we said “mapping” – well, the information about this mapping needs to be stored somewhere. Moreover, as this mapping (from virtual addresses into physical addresses) happens on each and every memory access – it needs to be Damn Fast. To help with it, a special kind of cache, known as Translation Lookaside Buffer (TLB) is normally used.

As for any type of cache, there is a cost of missing TLB; for x64 it is reported between 7-21 CPU cycles [7cpu]. Overall, TLBs are quite difficult to affect; however, a few recommendations can still be made in this regard:

once again – improving overall memory locality helps to reduce TLB misses too; the more local your data is – the less your chances are to get out of TLB.

consider using “huge pages” (those 2M pages on x64). The larger pages are – the less entries in TLB you’ll need; on the other hand, using “huge pages” comes with some caveats, and as a result – is a two-edged sword. Which means that you need to make sure to test it for your specific app.

consider turning off ASLR (=”Address Space Layout Randomization”). As discussed in [Drepper], enabling ASLR, while being good for security, hits performance, and exactly because of TLB misses too 🙁 .

Software Primitives

Now we’re done with those things which are directly hardware-related, and will be speaking about certain things which are more software-related; still, they’re really ubiquitous, so let’s see how much we spend every time we’re using them.

C/C++ Function Calls

First, let’s see the cost of C/C++ function call. Actually, C/C++ caller does a damn lot of stuff before making a call – and callee makes another few things too.

[Efficient C++] estimates CPU costs for a function call at 25-250 CPU cycles depending on number of parameters; however, it is quite an old book, and I don’t have a better reference of the same caliber 🙁 . On the other hand, from my experience, for a function with a reasonably small number of parameters, it is more like 15-30 cycles; this also seems to apply to non-Intel CPUs as measured by [eruskin].

“keep in mind that these days compilers tend to ignore inline specifications more often than notOptimisation hint: Use inline functions where applicable. However, keep in mind that these days compilers tend to ignore inline specifications more often than not 🙁 . Therefore, for really time-critical pieces of code you may want to consider __attribute__((always_inline)) for GCC, and __forceinline for MSVC compilers to make them do what you need. However, do NOT overuse this forced-inline stuff for not-so-critical pieces of code, it can make things worse rather easily.

BTW, in many cases gains from inlining can exceed simple removal of call costs. This happens because inlining enables quite a few additional optimisations (including those related to reordering to achieve the proper use of hardware pipeline). Also let’s not forget that inlining improves spatial locality for the code – which tends to help a bit too (see, for example, [Drepper]).

Indirect and Virtual Calls

Discussion above was related to usual (“direct”) function calls. Costs of indirect and virtual calls are known to be higher, and there is pretty much a consensus on that indirect call causes branching (however, as [Agner1] notes, as long as you happen to call the same function from the same point in code, branch predictors of modern CPUs are able to predict it pretty good; otherwise – you’ll get a misprediction penalty of 10-30 cycles). As for virtual calls – it is one extra read (reading VMT pointer), so if everything is cached at this point (which it usually is), we’re speaking about additional 4 CPU cycles or so.

On the other hand, practical measurements from [eruskin] show that the cost of virtual functions is roughly double of the direct call costs for small functions; within our margin of error (which is “an order of magnitude”) this is quite consistent with the analysis above.

Curiously recurring template patternThe curiously recurring template pattern (CRTP) is an idiom in C++ in which a class X derives from a class template instantiation using X itself as template argument— Wikipedia —Optimisation hint: IF your virtual calls are expensive – in C++ you may want to think about using templates instead (implementing so-called compile-time polymorphism); CRTP is one (though not the only one) way of doing it.

Allocations

These days, allocators as such can be quite fast; in particular, tcmalloc and ptmalloc2 allocators can take as little as 200-500 CPU cycles for allocation/deallocation of a small object [TCMalloc].

However, there is a significant caveat related to allocation – and adding to indirect costs of using allocations: allocation, as a Big Fat rule of thumb, reduces memory locality, which in turn adversely affects performance (due to uncached memory accesses described above). Just to illustrate how bad this can be in practice, we can take a look at a 20-line program in [NoBugs]; this program, when using vector<>, happens to be from 100x to 780x faster (depending on compiler and specific box) than an equivalent program using list<> – all because of poor memory locality of the latter :-(.

“In some real-world cases flattening your data structures can speed up your program as much as 5x.Optimisation hint: DO think about reducing number of allocations within your programs – especially if there is a stage when lots of work is done on a read-only data. In some real-world cases flattening your data structures (i.e. replacing allocated objects with packed ones) can speed up your program as much as 5x. A real-world story in this regard. Once upon a time, there was a program which used some gigabytes of RAM, which was deemed too much; ok, I rewrote it to a “flattened” version (i.e. each node was first constructed dynamically, and then an equivalent “flattened” read-only object was created in memory); the idea of “flattening” was to reduce memory footprint. When we ran the program, we observed that not only memory footprint was reduced by 2x (which was what we expected), but that also – as a very nice side effect – execution speed went up by 5x.

Kernel Calls

If our program runs under an operating system,1 then we have a whole bunch of system APIs. In practice,2 quite a few of those system calls cause kernel calls, which involve switches to kernel mode and back; this includes switching between different “protection rings” (on Intel CPU – usually between “ring 3” and “ring 0”). While this CPU-level switching back and forth itself takes only ~100 CPU cycles, other related overheads tend to make kernel calls much more expensive, so usual kernel call takes at least 1000-1500 CPU cycles 🙁 [Wikipedia.ProtectionRing].

1 yes, there are still programs which run without it

2 at least if we’re speaking about more or less conventional OS

C++ Exceptions

These days, C++ exceptions are said to be zero-cost until thrown. Whether it is really zero – is still not 100% clear (IMO it is even unclear whether such a question can be asked at all), but it is certainly very close.

“these 'zero-cost until thrown' exception implementations come at the cost of a huge pile of work which needs to be done whenever an exception is thrownHowever, these “zero-cost until thrown” implementations come at the cost of a huge pile of work which needs to be done whenever an exception is thrown. Everybody agrees that the cost of exception thrown is huge, however (as usual) experimental data is scarce. Still, an experiment by [Ongaro] gives us a ballpark number of around 5000 CPU cycles (sic!). Moreover, in more complicated cases, I would expect it to take even more.

Return Error and Check

An ages-old alternative to exceptions is returning error codes and checking them at each level. While I don’t have references for performance measurements of this kind of things, we already know enough to make a reasonable guesstimate. Let’s take a closer look at it (we don’t care much about performance in the case when error arises, so will concentrate on costs when everything is fine).

Basically, cost of return-and-check consists of three separate costs. The first one is the cost of conditional jump itself – and we can safely assume that 99+% of the time it will be predicted correctly; which means the cost of conditional jump in this case is around 1-2 cycles. The second cost is the cost of copying the error code around – and as long as it stays within the registers, it is a simple MOV – which is, given the circumstances, is 0 to 1 cycles (0 cycles means that MOV has no additional cost, as it is performed in parallel with some other stuff). The third cost is much less obvious – it is a cost of the extra register necessary to carry the error code; if we’re out of registers – we’ll need PUSH/POP pair (or a reasonable facsimile), which is in turn a write + L1 read, or 1+4 cycles. On the other hand, let’s keep in mind that chances of PUSH/POP being necessary, vary from one platform to another one; for example, on x86 any realistic function would require them almost for sure; however, on x64 (which has double number of registers) – chances of PUSH/POP being necessary, go down significantly (and in quite a few cases, even if register is not completely free, making it available may be done by compiler cheaper than a dumb PUSH/POP).

Adding all three costs together, I’d guesstimate costs of return-error-code-and-check (in normal case) at anywhere between 1 and 7 CPU cycles. Which in turn means that if we have one exception per 10000 function calls – we’re likely to be better with exceptions; however, if we have one exception per 100 function calls – we’re likely to be better with error codes. In other words, we’ve just reconfirmed a very well-known best practice – “use exceptions only for abnormal situations” 🙂 .

Thread Context Switches

Last but certainly not least, we need to speak about costs of thread context switches. One problem with estimating them is that, well, it is very difficult to figure them out. Common wisdom tells that they’re “damn expensive” (hey, there should be a reason why nginx outperforms Apache) – but how much this “damn expensive” is?

From my personal observations, the costs were at least 10000 CPU cycles; however, there are lots of sources which are giving MUCH lower numbers. In fact, however, it is all about “what exactly we’re trying to measure”. As noted in [LiEtAl], there are two different costs with relation to context switches.

The first cost is direct costs of thread context switching – and these are measured at about 2000 CPU cycles3

However, the second cost is MUCH higher; it is related to cache invalidation by the thread; according to [LiEtAl], it can be as large as 3M CPU clocks. In theory, with completely random access pattern, modern CPU with 12M of L3 cache (and taking penalty of the order of 50 cycles per access) can take a penalty of up to 10M cycles per context switch; still, in practice the penalties are usually somewhat lower than that, so the number of 1M from [LiEtAl] makes sense. This “much higher” estimate is also consistent with the number of spinlocks on x64 (which defaults to 4000 at least for Windows/x64): if it is usually beneficial to wait for 4000 iterations (amounting at the very least to 15-20K CPU cycles, and more like 40-50K cycles from what I’ve seen) reading that variable-which-is-currently-locked within a busy loop – just in hope that the variable will unlock before 4000 iterations is over, all of this trouble and CPU cycles merely to avoid a context switch – it means that the cost of the context switch is usually much higher than those tens-of-thousands-of-CPU-cycles-we’re-ready-to-spend-in-a-busy-loop-doing-nothing-useful.

3 that is, if my math is correct when converting from microseconds into cycles

Wrapping it Up

Phew, it was quite a bit of work to find references for all these more-or-less known observations.

Also please note that while I’ve honestly tried to collect all the related costs in one place (checking 3rd-party findings against my own experiences in the process), it is just a very first attempt at this, so if you find reasonably compelling evidence that certain item is wrong – please let me know, I will be happy to make the diagram more accurate.

Comments

Ola,
I am quite surprised by to low cost of L1/L2/L3 cache misses .
What kind of CPU are you refering to ? Most likely top notch PC 🙂
I would be really interested how these (and all the others) look like on consoles PS3/4/NEO XBOX 360 , XBox ONE .
From my experience these are F***** expensive 😉
Martin

Wait, what is listed is not the cost of “cache miss” – it is time when it takes when there is a “hit”. So to calculate cost of “L3 miss” – you need to subtract “L3 read” from “main RAM read” (and it can easily reach 100 cycles, which is quite a lot, especially as it can happen all the time).

I’ve took a quick look at your OS, and I LOVE it :-). The problem, however, is that most of us still need to program for Windows (ouch) and/or Linux (which is getting bigger and bigger all the time – days of DamnSmallLinux which was able to do something useful in 4M of RAM, are long gone 🙁 ). 🙁

re: exceptions
their cost is not so significant because exceptions are (supposedly) used when otherwise a serious problem would occur (out of memory, division by zero, etc). In other words, an exception is (supposedly) raised when normal program execution can not continue. Please correct me if i am wrong.

You’re right, though I am right too :-). I am speaking about cost per occurrence – and this cost is indeed damn high for exceptions. Which means that exceptions SHOULD NOT be used for anything which happens more than once in a long while. However, if somebody is careless enough to raise them 1M times per second – well, they will be in trouble. And this is perfectly consistent with what you’ve just said :-).

This cost is important to answer the question “what exactly qualifies as a ‘normal’ execution flow”; quite often it is not as obvious (as in “should we consider malformed incoming packet as normal occurrence or not?”); having this guesstimate, it is easier to answer such questions at least from performance point of view.

Regarding C++ Exceptions… if they cost only 5000 cycles then contrary to the common “wisdom” there can actually be times when they’d be quite efficient as a control structure! Think of a search algorithm which throws an ‘Item’ object (if the item was found) or an ‘EndOfContainer’ exception (if an item was not found). If your container automatically throws the exception when it’s exhausted, you will save a boolean evaluation and a conditional jump (because the search can be hard-coded as an ‘infinite’ loop). So then, if your data structure contains tens of thousands (or more) items, the accumulated savings can easily exceed the cost of a throw.

“automatically throws the exception” – can you elaborate on how you’re going to implement it? Do you mean playing around with VirtualAlloc()/mmap() and then relying on OS “access violation” exception (but this is quite different from C++ one) – or something different?

I think that this is a great chart, and every software engineer should have a rough idea of these costs in their head when they are designing code.

One thing that might be a useful addition to your diagram is to add the cost of acquiring uncontended locks (e.g. a mutex and/or read/writer lock). I find that the cost of acquiring a shared read lock between two threads running on different CPUs can be quite large (because they need to read/update a count of the number of readers which can cause cache contention between the separate caches on the separate CPUs).

Please compare the exception cost to realistic alternatives. You throw an exception and catch it, say, three function calls above.The alternative is at least 4 tests and alternatives. That leads to more code. Also, the stack unraveling code is ad hoc and not easily optimized.

That said, there is no excuse for walking the stack twice for a throw in “release”/”-o2″ mode.

I should say that comparing things as such is beyond the scope of this exercise (comparing all the pairs which might be of interest, would take way too long). On the other hand, answering a question “how much error-return-and-check costs” is well within the scope, so I’ve added it to the post (and will ask to add it to the picture too); while it is a guesstimate, I’m confident that – like anything else in this table – it stands with “within an order of magnitude” accuracy.

Basically, the cost of return-and-check seems to consist of three components: 1. cost of conditional jump (correctly predicted most of the time), 2. cost of R/R copying error code around, 3. cost of PUSH/POP to pass error code around if we’re out of registers. Which adds up to 1-7 cycles (I’ve also added more detailed analysis to the post). In turn, it means that if we have one exception per 10000 calls with return-and-check, we’ll be better with exceptions, and if we have one exception per 100 such calls – we’ll probably be better with error codes. Which is perfectly consistent with long-standing “common wisdom”/”best practice” of “using exceptions only for error handling” (which BTW I’ve learned over 20 years ago from a 2nd edition of a certain book titled “The C++ Programming Language” :-)).

For the Exceptions vs. Return Code & Check comparison, in the latter case, assuming we are considering the cost of handling an error, the first cost, that of the conditional jump would *not* be predicted, and thus would cost the estimated 10-20 cycles. The error-case happens after a bunch of non-error-cases, so the branch predictor would continue to predict non-error. ( If there were a bunch of error-case paths before this, this is not an exceptional error at all. 🙂 )

> that of the conditional jump would *not* be predicted, and thus would cost the estimated 10-20 cycles.

For the case when the error happens – sure, but for the case when it doesn’t (and it should be 99.(9)% of the time) – it will be predicted. Therefore, to calculate costs of the “normal” case we need to count for prediction to be successful.

Okay, ‘Return Error and Check’ doesn’t clearly say it is the normal case.

I think it would be informative to have the ‘Return Error and Check’ exceptional case, along with the ‘C++ Exceptions’ normal case. As it stands right now, it looks like you are directly comparing the two, and that comparison is currently misleading.

Can you please include the costs of atomic operations, including both a simple “lock inc” or “lock xadd” and a full “lock cmpxchg”? For the costs of those, see Paul McKenney’s work. (Note that the answers decreased significantly a few generations ago on Intel hardware, though those operations still take significantly longer than a normal instruction.)

This is one thing I’d very interested in myself, thanks a lot for the reference, I’ve already included it to the article and will add to the picture ASAP! On the other hand – I wasn’t able to find information on the costs of CAS/atomics in multi-socket configurations, so needed to guesstimate (if you have any references about CAS on multi-socket NUMAs – please LMK).

> Note that the answers decreased significantly a few generations ago on Intel hardware

Let me guess – it happened around the time when FSB was abolished (and therefore bus lock was no longer necessary for sync) ;-).