Memory size is tuned by: reducing the data held in memory; reducing the overhead of holding data in memory (so that the structures take up less space); eliminate memory leaks.

Analyse gross memory usage with verbosegc and equivalent flags, noting heap sizes before->after and the times taken. Gross tuning is to resize the heap. Application object tuning is to use caches or weakly held objects.

Objects have an overhead which may include padding and depends on the JVM implementation - it can be 32 bytes per object. Below 30GB you can use compressed object pointers (automatically done in recent JVMs) to give a 30% saving in memory usage - which means that there is a jump in memory cost above 32GB of heap - which means that abive 32GB, you actually typically can fit fewer objects in the heap until above 48GB.

Primitive wrappers (whether implicit or explicit) are a large memory overhead on data.

Only use queueing when tight latency constraints don't matter.

Thrift generated communications have lower latency and bandwidth costs but higher maintenance overheads. They are not efficient to use internally as domain objects.

TheadLocals stay around if you use pooled threads - this can easily be a difficult to detect memory leak.

Tuning tradeoffs tend to be between memory footprint, latency and throughput. Improving one tends to make the others worse - unless you have spare CPU available, when sometimes you can trade the spare CPU for improvement in one of these without affecting the others.

The ideal young generation size is: big enough to hold one set of all concurrent request handling objects; with survivor space big enough to hold all active objects and any tenuring ones; with a tenuring threshold that tenures long-lived objects as fast as possible while not tenuring any short-lived objects.

Throughput collectors can tune themselves, after you give them hints: -XX:+UseAdaptiveSizePolicy -XX:MaxGCPauseMillis=NNN -XX:GCTimeRatio=MMM (the last is a %)

Garbage Collector recommended starting points: for bulk services, throughput collector with no adaptive sizing; for all others throughput collector with adaptive sizing, or if that fails, concurrent-mark-and-sweep;

Tune the young generation size first (after providing the old generation with enough size to retain the maximum working heap plus space overhead for GCs), focusing on the tenuring to size the survivor space and making Eden as large as possible up to pause time constraints. Tenuring should be strongly declining across ages.

Tuning for CMS: find the minimum and maximum working sizes (the stop-the-world throughput collector will tell you that), then overprovision memory above that by 25%-35% - CMS needs a cushion while it works and that's what this provides.

When a full GC happens, every thread-stack must be walked so more threads means longer GCs.

With JMS, the key problem with Queues is ensuring that there is only one reader that actually successfully executes a job (as opposed to none). The The key problem with Topics is ensuring that only one reader of the many actually successfully executes a job (as opposed to many readers executing it).

Persistence typically reduces throughput by an order of magnitude or more.

One solution for failover is to maintain a set of JVMs with one active and the rest on hot standby, using heartbeats to detect if the active dies and negotiate a failover to becoming active.

It is easy to add memory overhead - especially the more you abstract the code/frameworks. Good coding practices actually tend to encourage additional memory overhead.

A sample of a range of applications shows memory overheads of 50%-90% were usual, i.e. the actual data needed for processing took only 10%-50% of the memory used by the application to hold the data.

Distinguishing the actual data memory requirements from the additional overhead used by collections, data types, delegations, etc, allows you to determine whether the overhead cost is worth the functionality added and whether there is significant scope for improvement if memory needs reducing.

JVM and hardware impose significant memory overheads for small objects (headers and alignments).

HashSet is not a good collection choice for small collections in terms of memory overheads - though is often used exactly that way.

Default sizes for collections tends to add significant memory overheads - the actual number of elements held rarely matches the collection capacity. Using empty collections are particularly bad for memory overheads. Concurrent collections also have high memory overheads.

Gnu Trove collections include many space-efficient collections.

Caching unecessarily has a high memory overhead - for example caching the result of toString() is often unnecessary for performance but is frequently encountered; immutable data is often duplicated.

Simple short-lived objects are mostly free, but some objects are designed to live longer and are inefficient to be used, e.g. SimpleDateFormat creation costs are designed to be amortized over many uses (should be reused in a thread-local), as are formatters, converters, factory objects, schemas, connections, etc.

You should be careful with the lifetime of objects. Three typical reasons for long-lived data are: In-memory design; Cached/pooled/thread-local objects (space used to reduce time costs); Long transaction/request support objects.

Caches and pools should always be bounded. Large caches are not necessarily better - they may just use more resources for no benefit, and may even cause problems by using too much memory.

Soft references can be useful for simple caches/pools, shouldn't be relyed on as the sole ejection policy; also the may not leave enough headroom for temporary objects, causing the GC to run more often.

Weak references are ideal for objects tied to the lifetime of other objects, e.g. listeners, shared pools, annotations. Failure to unregister listeners is a common cause of leaks.

Beware autoboxing where compact data types are automatically converted into fatter Objects. If memory is really tight, a plain old array of primitive types can be the best choice.

Beneath 32GB of heap, the JVM will only use 4B per pointer instead of 8B. However, this means if you want a heap bigger than 32GB, you need to jump up a lot; Attila says 48GB.

Java optimizes for the common case of short-lived trash that can get quickly collected in the young generation - young generation allocation and garbage collection is really cheap: allocation is just a pointer shift and zeroing (if needed) the space between the last pointer and the new one; garbage collection only copies live objects to older space, then resets the pointer to the start of the young generation (no explicit deallocation needed).

Young generation GC time is proportional to the number of live young objects, which is usually small compared to the amount of trash.

The more memory you can give the young generation, the better, since allocation and deallocation is so cheap (though very big young generations could cause too large pauses).

You want a young generation big enough to hold active and tenuring objects; for long-lived objects to quickly tenure and reach the old generation; but you don't want survivors that could be collected in the young generation to get forced to the old generation early by memory pressure on the young generation.

Use -Xshare:on to enable class data sharing and -client for faster startup or -server for overall better speed. -XX:+TieredCompilation enables a "JIT compilation policy" similar to that used for -client for rapid startup time.

Start with ParallelOld/Parallel (-XX:+UseParallelOldGC/-XX:+UseParallelGC) GC first with -XX:UseAdaptiveSizePolicy and -XX:+PrintAdaptiveSizePolicy, and then move to CS (-XX:+UseConcMarkSweepGC which uses -XX:+ParNewGC automatically) or G1 (-XX:+UseG1GC with -XX:MaxGCPauseMillis=.. to set the target time) if latency requirements are not met. -XX:ParallelGCThreads can be used to specify number of parallel garbage collection threads to use and -XX:ParallelCMSThreads specifies the number of parallel CMS threads.

When -XX:+PrintReferenceGC output shows a high Reference reclamation time, enable -XX:+ParallelRefProcEnabled.

Specifying the survival time of a soft reference after last strong reference to the object has been collected using -XX:SoftRefLRUPolicyMSPerMB - smaller values mean more aggressive collection.

-XX:+ScavengeBeforeFullGC should be on (it is by default, but some people disable it)

-XX:+DisableExplicitGC, -XX:+ExplicitGCInvokeConcurrent and -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses can be used when explicit garbage collection System.gc() is used.

For fine tuning the young generation: -XX:+PrintTenuringDistribution is very useful for monitoring; -XX:SurvivorRatio sets the ratio of survivor space size to eden space size; -XX:TargetSurvivorRatio sets the target survivor space occupancy to target after a minor garbage collection; and set -XX:MaxTenuringThreshold too high rather than too low to avoid a full GC.

-XX:+PrintGCApplicationStoppedTime and -XX:+PrintGCApplicationConcurrentTime are useful for tracking down latency induced into the application as a result of JVM safepoint operations.

Avoid using Object-Relational Mappers (ORM); ORM SQL queries are often complex queries that the database cannot optimize well nor do they allow easy tweaking of queries, slowing down the tuning process.

Locks are like stop signs, non-locking solutions are usually faster and more scalable. Row level locking is better than table level locking. Use asynchronous replication and "eventual consistency" for clusters.

A single database is a bottleneck. Use parallel databases and let a driver select between them.