In today's modern server and cloud based systems the multi-layered complexity can make it hard to pin point performance issues and bottlenecks. This book is packed full useful analysis techniques covering tracing, kernel internals, tools and benchmarking.

Critical to getting a well balanced and tuned system are all the different components, and the book has chapters covering CPU optimisation (cores, threading, caching and internconnects), memory optimisation (virtual memory, paging, swapping, allocators, busses), file system I/O, storage, networking (protcols, sockets, physical connections) and typical issues facing cloud computing.

The book is full of very useful examples and practical instructions on how to drill down and discover performance issues in a system and also includes some real-world case studies too.

It has helped me become even more focused on how to analyse performance issues and consider how to do deep system instrumentation to be able to understand where any why performance regressions occur.

All-in-all, a most systematic and well written book that I'd recommend to anyone running large complex servers and cloud computing environments.

We had a discussion some time ago about the speed of malloc and whether you should malloc/free small objects as needed or reuse them with e.g. memory pools. There were strong opinions on either side which made me search the net for some hard numbers on malloc speed.

It turns out there weren’t any. Search engines only threw up tons of discussions about what the speed of malloc would theoretically be.

So I wrote a test program that allocates chunks of different sizes, holds on to them for a while and frees them at random times in a random order. Then I made it multithreaded to add lock contention in the mix. I used 10 threads.

On quad core laptop with 4GB of memory glibc can do roughly 1.5 million malloc/free pairs per second.

On a crappy ARM board with a single core glibc can do 300 000 malloc/free pairs per second.

What does this mean in practice? If you are coding any kind of non-super-high-performance app, the only reasons you would care about reducing mallocs are:

you do hundreds of malloc/free calls per second for long periods at a time (can lead to severe memory fragmentation)

you have absolute latency requirements in the sub-millisecond range (very rare)

your app is used actively for hours on end (e.g. Firefox)

you know, through measurement, that the mallocs you are removing constitute a notable part of all memory allocations

But, as Knuth says, over 97% percent of the time malloc is so fast that you don’t have to care.

No, really! You don’t!

Update: The total memory pool size I had was relatively small to reflect the fact that the working set is usually small. I re-ran the test with 10x and 100x pool size. The first was almost identical to the original test, the latter about 10 times slower. That is still ~175 000 allocations per second, which should be plenty fast. I have also uploaded the code here for your enjoyment.