One day Aurelien Broszniowski from Terracotta emailed me his list of bottlenecks, we cc’ed Russell in on the conversation, he gave me his list, I have a list, and here’s the resulting stone soup.

Russell said this is his “I wish I knew when I was younger" list and I think that’s an enriching way to look at it. The more experience you have, the more different types of projects you tackle, the more lessons you’ll be able add to a list like this. So when you read this list, and when you make your own, you are stepping through years of accumulated experience and more than a little frustration, but in each there is a story worth grokking.

Bad design : The developers create an app which runs fine on their computer. The app goes into production, and runs fine, with a couple of users. Months/Years later, the application can't run with thousands of users and needs to be totally re-architectured and rewritten.

Algorithm complexity

Dependent services like DNS lookups and whatever else you may block on.

Stack space

Disk:

Local disk access

Random disk I/O -> disk seeks

Disk fragmentation

SSDs performance drop once data written is greater than SSD size

OS:

Fsync flushing, linux buffer cache filling up

TCP buffers too small

File descriptor limits

Power budget

Caching:

Not using memcached (database pummeling)

In HTTP: headers, etags, not gzipping, etc..

Not utilising the browser's cache enough

Byte code caches (e.g. PHP)

L1/L2 caches. This is a huge bottleneck. Keep important hot/data in L1/L2. This spans so much: snappy for network I/O, column DBs run algorithms directly on compressed data, etc. Then there are techniques to not destroy your TLB. The most important idea is to have a firm grasp on computer architecture in terms of CPUs multi-core, L1/L2, shared L3, NUMA RAM, data transfer bandwidth/latency from DRAM to chip, DRAM caches DiskPages, DirtyPages, TCP packets travel thru CPU<->DRAM<->NIC.

CPU:

CPU overload

Context switches -> too many threads on a core, bad luck w/ the linux scheduler, too many system calls, etc...

IO waits -> all CPUs wait at the same speed

CPU Caches: Caching data is a fine grained process (In Java think volatile for instance), in order to find the right balance between having multiple instances with different values for data and heavy synchronization to keep the cached data consistent.

Backplane throughput

Network:

NIC maxed out, IRQ saturation, soft interrupts taking up 100% CPU

DNS lookups

Dropped packets

Unexpected routes with in the network

Network disk access

Shared SANs

Server failure -> no answer anymore from the server

Process:

Testing time

Development time

Team size

Budget

Code debt

Memory:

Out of memory -> kills process, go into swap & grind to a halt

Out of memory causing Disk Thrashing (related to swap)

Memory library overhead

Memory fragmentation

In Java requires GC pauses

In C, malloc's start taking forever

If you have any more to add or you have suggested fixes, please jump in.

Nice list, have to say I've experienced many of these, but not all. I have some to add:

To somewhat mitigate the issues of database working size exceeding RAM (at least with MySQL), use partitioning to limit the amount of data the database engine needs to run through/update. Depending on your access patterns, you should partition based on insertion order or select locality.

When working with the web and if real time consistency is not an issue (eventual consistency is sufficient), store data (mostly) denormalized. It's a little more effort to synchronize, but worth it with the reduced query overhead. When you're not typing like a mad man with 0 hours of sleep trying to keep a server up under load, you have the time to think about and solve the synchronization problem while taking a shower.

With Event Driven programming, keeping track of state object references and releasing them when done is easily forgotten.

With the network, if using an iptables based firewall, keep track of your conntrack limit and the number of connections being tracked. There are no log messages when packets are dropped because you've exceeded the limit.

- Hardware : Make sure your RAID controllers are performing optimally. I ran into an issue where the battery on the RAID controller had drained out and even though the RAID controller was not showing as being down, it severly degraded performance because it stopped using the onboard cache

- DB : Make sure your logs are being written to a different disk subsystem than your data. Helps improve performance and recovery process

I blogged about this topic some time ago at http://kudithipudi.org/2009/03/11/lessons-of-the-trade-troubleshooting-database-perfromance/

I'd like to add some bullet points from our company experience. - Bad use of ORMs like classical N+1 problem or making abnormal number of database queries to update/insert objects one by one instead of using plain old SQL- Neglecting asynchronous processing. Examples: interacting with external service API in synchronous fashion from web app that leads to lots of IO blocked threads and stacking of incoming web requests to processing queue as a result, non using of background jobs/message queue processors to offload web application resources- Forgetting about HTML caching on server side- Misuse of NoSql solutions, using them for reporting and bulk processing- Forgetting about db isolation levels and their consequences- Bad data model

Interesting read. I have also encountered many of these and in todays world of data-intensive applications I am seeing more and more falling on the db as in the architecture was not designed (xyears ago) to support horizontal (or even vertical) partitioning.

I do have a question How does "Lack of profiling, lack of tracing, lack of logging" fall into the category of a bottleneck? I'm not quite following that. While absolutely necessary when prudently applied. In my experience, logging and tracing often causes performance bottlenecks.