Anonymous classes can interfere with JIT optimization by adding an extra subclass that interferes with the JIT compiler's speculative optimization that a class has no subclasses.

Static methods are easy for the JIT compiler to inline (because they aren't overridden in subclasses). Utility classes should have small static methods to be efficient (rather than big or non-static methods).

The more predictable the outcome of the sequence of instructions, the faster the instructions can run as the JIT compiler can infer the prediction and optimize for it.

Profiling tools can lie, consider the results carefully.

jitwatch is a tool for analysing JIT compiled instructions; jmh is a tool for microbenchmarking.

False sharing (where fields close together are in the same CPU-cache line but are updated by different threads, which akes the caches thrash) can be avoided by adding padding between the fields, or using the @Contended annotation.

Logging is a common performance problem, especially if you do it synchronously. The most popular logging frameworks tend to have had many issues eliminated if you use them properly.

A non-G1 GC log line is broken into: timestamp of the event (wall clock time); time since the JVM started; type of event (minor, major, full=minor+major); then possibly multiple sections for each generation of: (the generation being collected; the heap size of that generation used before the GC started; the heap size of that generation used after the GC finished; the available heap size of that generation; the time taken for the event); then the combined (young+old gen) before, after, and available figures; then the same for permgen if relevant; and finally user (CPU time spent in user space), sys (CPU time spent in kernel space), and real (wall clock) times for the GC.

Normal user+sys is the total CPU time. In serial mode, real should be =user+sys. For parallel, you would typically see real*N=user+sys where N is the parallelism.

A G1 GC log entry has multiple lines, broken into: timestamp of the event (wall clock time); time since the JVM started; phase of event; generation of event; "GC Workers: N" gives the number GC threads; "Eden" line gives details of the young gen in multiple sections of (the space being collected; the heap size of that space used before the GC started; the heap size of that space used after the GC finished; the available heap size of that space; the time taken for the event) and the same for the total forthe generation; and finally user (CPU time spent in user space), sys (CPU time spent in kernel space), and real (wall clock) times for the GC.

There are many dozens of GC log line formats dependending on JVM vendor, JVM version, GC algorithm, flags chosen for printing, type of GC event. Most follow roughly similar information produced in outline, with some or all of these items: the timestamp, the type of event; the generation or heap space the event happens in, the before, after ana available heapp sizes, the time taken for the event

GC KPIs are throughput (the amount of time the application spends in doing application work, ie not paused in GCs); pause times; GC CPU consumption.

To identify memory leaks, look at the heap used after GCs; if this is continually increasing (especially after full GCs) it indicates a memory leak.

Streams only start evaluating when the terminal operation (eg .sum()) is encountered.

You can avoid executing conditional code by using a lambda expression to wrap the code, eg methods that need to be executed to get data for loggers - using a method call, that will always execute even if the logging level won't log; using a lambda to wrap the method lets the logger only execute the method when it actually needs the value to log it.

Functional programming avoids explicit state management, which makes it easy to parallelise - but you need to ensure that you don't explcitly manage state. Don't use loops (for, while, foreach) even implicitly (foreach()).

Collection.parallelStream() or Stream.parallel() splits a stream into parallel. Stream.serial() converts a stream to serial, but multiple calls to parallel() and serial() are lazily evaluated at the terminal operation - and the last call will win, with the whole filter-map-reduce execution in that mode (all operations serial or all operations parallel).

A parallel stream uses a single common fork-join pool for all streams; the pool size is that reported by Runtime.availableProcessors() - which can be changed with a command line option -Djava.util.concurrent.ForkJoinPool.common.parallelism=N; but the fork-join pool can also use the invoking thread in addition.

Nesting parallel stream operations will cause the operations to be distributed across the same set of threads, you don't get extra threads.

IO operations in stream operations may block a thread in the underlying fork-join pool.

You can use a custom ForkJoinPool to execute stream operations by submitting your combined stream operation to the pool.

Parallel stream operation is only faster than serial stream operation if the combination of the number of elements, the processing of each element, and the parallel operations being sufficiently indpendent of each other, that the combination overcomes the overhead of operating a parallel pool.

Files.lines() in JDK 9 maps files into in-memory buffers split on line separators as close to the middle as it can find, allowing multiple threads to actually read lines in parallel.

Pooling and reusing connections rather than setting up and tearing them down for each request is a big performance improvement.

If you will be sending the same data component repeatedly, you can marshall that once and cache the marshalled form or a reference to it (ideally in a send buffer close to the wire) for improved scalability.

Batching multiple messages into one payload improves scalability but impacts latency so is a balancing act.

Too many thread pools can result in too much context switching which impacts performance.

Distributed requests can result in cycles that block the request from progressing (a distributed deadlock). One solution is to provide priority request handling so that a request can be prioritised above existing work and so ensure progression (you'll need a mechanism to ensure that higher priority requests have available processing capability).

Don't combine timing measures into one resultant time, this hides the underlying cause. Retain all timing components so that it is easy to determine what is causing a long latency.

If using network IPC on the same host, try to optimize the loopback pathways.

When communicating across hosts, quickest route (least hop) routing is important for low latency

On virtual OSes, if you can use pinned VMs, these show lower latency than floating VMs.

Threads (and processes) set to have affinity to specific cores lets you minimize latency if there is not much conflict for the core.

Timeouts are difficult to set in distributed systems; too small a timeout will cause more drops or retries than necessary when the remote system happens to hit a slight bottleneck; too large a timeout ties up resources waiting for responses and can cause artificial limits to be hit when the requests should have been dropped or failed.

Optimize communications by: reducing as much as possible; send in big chunks; batch; optimise throughput using async but response time using sync; re-use connections; check and track errors; profile and tune; edge cases happen more than you expect.

A resource modified from different threads concurrently can become corrupt unless concurrency controls are applied.

The synchronized keyword applies to code blocks and methods, providing concurrency protection by ensuring that a crucial section of the code is never executed concurrently by two different threads.

The volatile keyword applies to fields and ensures that the data in the field is always that held in main shared memory rather than thread-local memory.

In the absence of explicit concurrency management threads can store and modify data in thread-local memory and flush to or fault from shared memory non-deterministically.

"volatile" provides memory visibility but execution control can only be achieved via "synchronized".

"volatile" guarantees read and write operations on the field are atomic even for long and doubles, non-volatile long and double fields reads and writes are not guaranteed to be atomic (all other field types are guaranteed to be atomic).

Synchronization does not come for free and introduces latency when accessing a lock currently held by another thread. This is known as thread contention and can also lead to deadlocks and livelocks.

Every Java object has an exclusive intrinsic lock, acquired by entering a synchronized block guarded by that object and then released when exiting the block. Synchronized methods has the whole method as the block, using "this" as the object (or the class instance for static methods). Thread coordination is enabled using wait() notify() and notifyAll().

Synchronizing over different lock objects allows you to synchronize fields separately.

CountDownLatch allows multiple threads to wait until the latch has countdown to 0. It initialises with a number, each thread-safe call to countDown() decrements the count and the await() call blocks until the count reaches 0.

CyclicBarrier allows threads to wait for each other to reach a common barrier point, it behavies similarly to CountDownLatch but counting up to N, with support for an optional runnable that runs after the last party arrives. It can be reused after threads are released.

Exchanger blocks until its counterpart presents its information. The same behaviour occurs at both sides.

Phaser is similar to CountDownLatch and CyclicBarrier, but more flexible. Parties can register and deregister at any time; you can block until all parties arrive, or it can be terminated forcing all synchronization methods to return.