In a large system you can profile for average latency or throughput. But to achieve consistent latencies you need to analyse key portions of your system - having simple components which run independently and can be tested standalone helps achieve this.

Latencies below 100 micro-second 99.9% of the time can be achieved with Java using commodity hardware by: using a low latency infrastructure (1 micro-second for short messages) for messaging and logging; minimizing network hops; focus on the worst performing requests; using each CPU core separately for a specific task/service, with it's own CPU cache data and code; using the L2 cache coherence bus as your messaging layer between the core-based high performance services.

As a service is decomposed into parallel operations with shared resources, the application will become more efficient and its responses will exhibit less latency up to the limit of the shared resources.

The most basic threading model is the single threaded, single process model - the simplest way to write code. The throughput of this will not increase with additional load and CPU utilization is limited to one core.

A step up from the basic threading model is a single threaded, multi-process, threading model where a new process gets created for each request. It adds overhead of process creation and constantly having to create and destroy resources for the process unless you can cache them across process lifecycles.

A step up from the single threaded, multi-process threading model is a single threaded, reused multi-process threading model: the resusable process is like a worker pool of processes. Complexity is added in managing the process lifecycle so that requests remain independent.

Multi-threaded, single process, threading models are usually more efficient than single-threaded models, at the cost of additional complexity and needing to handle more types of race conditions.

An API should return a (paged) collection rather than a single item if there are multiple items that could be returned, otherwise you are forcing inefficient multiple calls to obtain all items.

Third-party frameworks have specific tuning options which you need to understand to get the best performance from them. Invest time in learning those.

Splitting component to component calls across processes incurs a remote call that was previously in-process. This is okay as long as such calls are infrequent, otherwise they will cause many roundtrips which has a very high latency cost.

Look for key architectural metrics such as call-frequency, call-count between servers, transferred data, and which services talk to each other.

Bad Service Access Patterns: Excessive Service Calls (too many service calls per single end-to-end use case); N+1 Service Call Pattern (a request should use a service only once); High Service Network Payload (LAN overheads are dramatically different from WAN); Connection & Thread Pool Sizing (starvation causes blocking); Excessive use of Asynchronous Threads; Architectural Violations (all services should operate through the appropriate component/service); Lack of Caching.

Uncontrolled thread creation is highly inefficient; if you are using more threads overall than available hardware threads you should be controlling the thread usage with size-limited pools. Some metrics to monitor are: incoming requests/transactions and the total amount of active threads involved in execution; CPU utilization; context switching.

Why distribute your system? Scale and fault tolerance. But these are in conflict with each other - higher scale means more failure, more failure tolerance means more overhead which reduces scaling. Everything is a tradeoff between these.

Data locality is important for efficiency. Shard data so that nodes can operate efficiently in the appropriate dataset (or subset).

Synchronous systems are easier to reason about but less efficient.

Misbehaving clients can cause issues throughout the system if you don't protect against those (coordinated) errors.

Build "Silence" into your systems - think about what needs to communicate when then don't over-communicate. Can the cache items last longer? Can some components be isolated from each other? etc. The more slience, the more efficient.

The JVM can pause all application threads for a variety of reasons. Common ones include: garbage collection, different JIT actions (eg deoptimization and other on-stack-replacement causes), biased lock revocation, various JVMTI operations.

Logging JVM pauses from all safepoints can be achieved by adding the flag -XX:+PrintGCApplicationStoppedTime (this logs all stopped times from all reasons, despite the flag name).

To determine the reason for a pause, add the flags -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1, logging to a file with -XX:+LogVMOutput -XX:LogFile=something.log. The resulting log lines consist of the following fields: a timestamp; the operation that caused the pause (no vm operation corresponds to once a second pause to process all queued-up operations that are not urgent, settable with GuaranteedSafepointInterval); The number of threads stopped; The number of threads running; The number of threads blocked; internal safepoint timings.