Chapter 11 - Performance and Scalability

Avoid premature optimization - first make it right, then make it fast, if not fast enough already (as indicated by actual performance measurements)

Tuning for scalability is often different from tuning for performance, and are often contradictory.

Amdahl's Law : Speedup <= 1/( F + (1-F)/N) where F is the fraction of computation that must be executed serially, and N is the number of processors.

A shared work queue adds some (often overlooked) serial processing. Result handling is another form of serialization hidden inside otherwise seemingly 100% concurrent programs.

Costs of using threads

context switches - managing shared data structures in OS and JVM take memory and CPU. Can also cause flurry of processor cache misses on a thread context switch.

When a thread blocks on a lock, it is switched out by JVM before reaching its full scheduled CPU quantum, leading to more overhead.

Context switching costs 5000-10000 clock cycles (few microseconds). Use vmstat to find % of time program spent in the kernel. High % can indicate high context switching.

synchronized and volatile result in the use of special CPU instructions called memory barriers that involve flushing/invalidating CPU caches, stalling execution pipelines, flushing hardware write buffers, and inhibit compiler optimizations as operations cannot be reordered.

Performance of contended and uncontended synchronization are very different. synchronized is optimized for the uncontended scenario (20 to 250 clock cycles). volatile is always uncontended.

Modern JVMs can optimize away locking code that can be proven to never contend.

Do not use object pools. Object allocation and GC were slow in earlier versions of Java. Now object allocation is faster than a C malloc - only 10 machine instructions. Object pools also introduce synchronization overheads

Chapter 12 - Testing Concurrent Programs

Every test must wait till all the threads created by it terminate. It should then report any failures in tearDown().

Testing blocking operations need some way to unblock a thread that has blocked as expected. This is usually done by doing the blocking operation in a new thread and interrupting it after waiting for some time. An InterruptedException is thrown if the operation blocked as expected.

Thread.getState() should not be used for concurrency control or testing. Useful only for debugging.

One approach to test producer-consumer programs is to check that everything that is put into a queue or buffer eventually comes out of it, and nothing else does.

For single producer-single consumer designs, use order sensitive checksum of elements that are added, and verify them when the element is removed. Do not use a synchronized shadow list to track the elements as that will introduce artificial serialization.

For multiple producer-consumer designs, use an order insensitive checksum that can be combined at the end of the test to verify that all enqueued elements have been dequeued.

Make sure that the checksums are not guessable by the compiler (for eg: consecutive integers), so that they are not precomputed. Use a simple random number generator like xorShift(int y) { y ^= (y << 6); y ^= (y >>> 21); y ^= (y << 7); return y;}

Test on multi-processor machines with fewer processors than active threads.

Generate more thread interleaving by using Thread.yield() to encourage more context switches during operations that access shared state.

Always include some basic functionality testing when doing performance testing to make sure that you are not measuring performance of broken code.

Ensure that garbage collection does not run at all during your test (check this using the -verbose:gc flag) OR ensure that garbage collection runs a number of times during the test (need to run test for a long time).

Your tests should run only after all code has been compiled; no point measuring performance of interpreted byte code. Dynamic compilation takes CPU resources. Compiled code executes much faster.

Code may be decompiled/recompiled multiple times during execution - for eg: if some previous assumption made by JVM is invalidated, or to compile with better optimization flags based on recently gathered performance statistics.

Run program long enough (several minutes) so that compilation and interpreted execution represent a small fraction of the results and do not bias it.

Or have an unmeasured warm-up run before starting to collect performance statistics.

Run JVM with -XX:+PrintCompilation so that we know when dynamic compilation happens.

When running multiple unrelated computationally intensive tests in a single JVM, place explicit pauses between tests in order to give the JVM a chance to catch up with its background tasks. Don't do this when measuring multiple related activities, since omitting CPU required by background tasks gives unrealistic results.

In order to obtain realistic results, concurrent performance tests should approximate the thread-local computation done by a typical application. Otherwise, there will be unrealistic contention.

Make sure that compilers do not optimize away benchmarking code.

Trick to make sure that benchmarking calculation is not optimized away: if (fox.hashCode() == System.nanoTime()) System.out.print(" ");

Complementary Testing Approaches

Code Review

Static analysis tools: FindBugs has detectors for:

Inconsistent synchronization.

Invoking Thread.run (Thread.start() is what is usually invoked, not Thread.run())

Unreleased lock

Empty synchronized block

Double-checked locking

Starting a thread from a constructor

Notification errors

Condition wait errors: Object.wait() or Condition.await() should be called in a loop with the appropriate lock held after testing some state predicate.