Caches losing a small percentage of data can mean a large increase in datstore load (eg 95% hit ratio dropping to ->90% means that DB load goes from 5%->10% - a doubling of load; the same 5% drop from 99.5%->94.5% means that DB load goes from 0.5%->5.5% - a 10x increase in DB load)

Client timeouts which close connections and start new ones can cause a connection storm to the (remote) cache. TCP connection creations are expensive compared to read requests, so the connection storm can severely impact performance. What's worse is that the timeouts tend to happen in the first place because of the cache coming under a heavier than normal load, so this is often a load increase causing a storm causing even worse performance

Periodic regular slowdowns are often caused by intensive I/O happening on the box - possible from other processes or backups or log copies

Hash table expansion takes time - caches can have "blips" of bad performance while they rehash to handle more data (especially if it needs to lock during rehashing)

After running for a sufficiently long time, memory defragmentation may cause a spike in memory needed - and if that is too much memory, you run out of memory

Fragmented memory uses more memory than just the data needs, and this can be more than provisioned - which can cause running out of system memory

Put operations of different kinds on different thread pools so that the high priority tasks are not blocked by lower priority and slow ones

Make slow operations lockless since locking them is the worst thing to lock

Cap all memory requirements and avoid churn (reuse buffers) so that memory usage is deterministic and doesn't fragment

Use higher level concurrent structures if you can adapt your code to them into your application (eg ConcurrentHashMap) rather than building your own

If you have to use low-level concurrency tools, try to keep your use hidden behind higher level APIs

Simple lazily initializing a singleton doesn't work in a multi-threaded context and double-checked locking is fraught with concurrency errors for many implementations. Fast wrong results are NOT better than slightly slower correct results

Concurrency bugs that look like they'll happen very infrequently can actually happen often enough to be quite painful

Before you make a clever concurrency optimization, look at best practices, eg no need to contend on lock at all if the lock is protecting unrelated critical sections (so they should be using different locks); or if the lock is necessary, can slow operations be moved outside of the locked section.

Don't lock on mutable fields - the field can be set to a different object which means different threads using the same critical section can lock on different objects

synchronizing on an object which is different for different threads (eg 'this') but changes a common field (like a static field) is a known concurrency anti-pattern.

Any field guarded by a lock needs the lock to be accessible as widely as the field itself

Use notifyAll() rather than notify()

Waiting on a stale condition (a wait() where the notify() that should wake it as already happened before entering the wait()) can look very similar to a deadlock without an actual deadlock, ie things are available for processing, but nothing progresses because the threads that could progress are waiting for a signal that doesn't come (or maybe will come at some point in the future when the condition that triggers the signal happens to occur again)

Always check the wait() condition while holding the lock, ie the condition that let you enter the block to call the wait()

Object.wait() can unblock without a signal - a spurious wake. This can result in unintended progress when the condition that put it into the wait() state is still such that it should stay in the wait(). Check your wait() condition in a loop. synchronized (lock) {while(condition()){lock.wait();}progress();}finish();

Ask yourself: does it actually need to be multi-threaded? Does eager initialization cause any actual performance issue? Is there a high level pre-built structure that provides the thread-safe functionality you need so you can avoid using low-level concurrency structures?

Looking for concurrency bugs? Search for "synchronized" (or use a tool)

Distributed tracing is not just for latency - it gives you an architecture view as well as find out what is calling which other services

Trace across microservices by adding headers that hold trace events and pass the headers downstream - but make sure there is no significant overhead (maybe even sample if needed to keep down overheads)

Use size limited buffers and size limited trace entries to ensure that memory overhead of tracing is small