I see the terms software benchmarking and profiling used sometimes interchangeably but as far as my understanding goes there's a subtile difference.

Both are connected by time. But whereas benchmarking is mainly about determining a certain speed score that can be compared with other applications, profiling gives you exact information about where your application spends most of its time (or number of cycles).

For me it was always like: integration testing is the counterpart to benchmarking and unit tesing the counterpart to profiling. But how does micro-benchmarking fit in this?

Profiling and benchmarking are flip sides of the same coin, profiling helps you to narrow down to where optimization would be most useful, benchmarking allows you to easily isolate optimizations and cross-compare them.

3 Answers
3

A benchmark is something that measures the time for some whole operation. e.g. I/O operations per second under some workload. So the result is typically a single number, in either seconds or operations per second. Or a data set with results for different parameters, so you can graph it.

You might use a benchmark to compare the same software on different hardware, or different versions of some other software that your benchmark interacts with. e.g. benchmark max connections per second with different apache settings.

Profiling is not aimed at comparing different things: it's about understanding the behaviour of a program. A profile result might be a table of time taken per function, or even per instruction with a sampling profiler. You can tell it's a profile not a benchmark because it makes no sense to say "that function took the least time so we'll keep that one and stop using the rest".

You use a profile to figure out where to optimize. A 10% speedup in a function where your program spends 99% of its time is more valuable than a 100% speedup in any other function. Even better is when you can improve your high-level design so the expensive function is called less, as well as just making it faster.

Microbenchmarking is a specific form of benchmarking. It means you're testing one super-specific thing to measure just that in isolation, not the overall performance of anything that's really useful.

Micro-benchmarking is a special case of benchmarking. If you do it right, it tells you which operations are expensive and which are cheap, which helps you while trying to optimize. If you do it wrong, you probably didn't even measure what you set out to measure at all. e.g. you wrote some C to test for loops vs. while loops, but the compiler made different code for different reasons, and your results are meaningless. (Different ways to express the same logic almost never matter with modern optimizing compilers; don't waste time on this.) Micro-benchmarking is hard.

The other way to tell it's a micro-benchmark is that you usually need to look at the compiler's asm output to make sure it's testing what you wanted it to test. (e.g. that it didn't optimize across iterations of your repeat-10M-times loop by hoisting something expensive out of the loop that's supposed to repeat the whole operation enough times to give duration that can be accurately measured.)

Micro-benchmarking can distort things, because they test your function with caches hot and branch predictors primed, and they don't run any other code between invocations of the code under test. This can make huge loop unrolling look good, when as part of a real program it would lead to more cache misses. Similarly, it makes big lookup-tables look good, because the whole lookup table ends up in cache. The full program usually dirties enough cache between calls to the function that the lookup table doesn't always hit in cache, so it would have been cheaper just to compute something. (Most programs are memory-bound. Re-computing something not too complex is often as fast as looking it up.)

Thank you for taking the time. But my confusion starts with micro-benchmarking - would that mean I just benchmark only a certain function of my program? And exactly what is than the difference to profile a certain function?
– Jim McAdamsSep 8 '16 at 9:28

@JimMcAdams: Yes, that's exactly the sort of thing microbenchmarking is all about: repeating the same work many times. In a profile result, drilling down to a single function would hopefully show you % of total time on a line-by-line basis. (Or instruction by instruction, since at this granularity it matters more what the asm looks like than the source.) Or you could profile recording cache misses instead of clock cycles.
– Peter CordesSep 8 '16 at 10:43

Often people do profiling not to measure how fast a program is, but to find out how to make it faster.

Often they do this on the assumption that slowness is best found by measuring the time spent by particular functions or lines of code.

There is a clear way to think about this: If a function or line of code shows an inclusive percent of time, that is the fraction of time that would be saved if the function or line of code could be made to take zero time (by not executing it or passing it off to an infinitely fast processor).

There are other things besides functions or lines of code that can take time.
Those are descriptions of what the program is doing, but they are not the only descriptions.

Suppose you run a profiler that, every N seconds of actual time (not just CPU time) collects a sample of the program's state, including the call stack and data variables.
The call stack is more than a stack of function names - it is a stack of call sites where those functions are called, and often the argument values.
Then suppose you could examine and describe each of those.

For example, descriptions of a sample could be:

Routine X is in the process of allocating memory for the purpose of initializing a dictionary used in recording patients by routine Q when such a thing becomes necessary.

The program is in the process of reading a dll file for the purpose of extracting a string resource that, several levels up the call stack, will be used to populate a text field in a progress bar that exists to tell the user why the program is taking so long :)

The program is calling function F with certain arguments, and it has called it previously with the same arguments, giving the same result. This suggests one could just remember the prior result.

The program is calling function G which is in the process of calling function H just to decipher G's argument option flags. The programmer knows those flags are always the same, suggesting a special version of G would save that time.

etc. etc.

Those are possible descriptions.
If they account for F percent of time, that is the probability that each sample will meet that description.
Simpler descriptions are:

Routine or line of code X appears on Q percent of stack samples. That is measured inclusive percent.

Routine D appears immediately above routine E on R percent of stack samples. That number could be put on the arc of a call graph from D to E.

Stack sequence main->A->B->C->D->E is the sequence that appears on the largest number of samples. That is the "hot path".

The routine that appears most often at the bottom of the stacks is T. That is the "hot spot".

Most profiler tools only give you these simple descriptions.
Some programmers understand the value of examining the samples themselves, so they can make more semantic descriptions of why the program is spending its time.
If the objective were to accurately measure the percentage of time due to a particular description, then one would have to examine a large number of samples.
But if a description appears on a large fraction of a small number of samples, one has not measured it accurately, but one knows it is large, and it has been found accurately.
See the difference?
You can trade off accuracy of measurement for power of speedup finding.

A benchmark can help you observe the system's behavior under load,
determine the system's capacity, learn which changes are important, or
see how your application performs with different data.

Profiling is the primary means of measuring and analyzing where time
is consumed. Profiling entails two steps: measuring tasks and the time
elapsed, and aggregating and sorting the results so that the important
tasks bubble to the top. --High performance MySQL

What I understand is: benchmark is measure to know your application while profiling is measure to improve your application.