Note that this page is very large. The tips on this page are categorized
in other pages. Use
the tips index page to access smaller focused listings of tips.

This page lists many other pages available on the web, together with a condensed
list of tuning tips that each page includes. For the most part I've eliminated
any tips that are wrong, but one or two may have slipped past me. Remember that
the tuning tips listed are not necessarily good coding practice. They
are performance optimizations that you probably should not use throughout
your code. Instead they apply to speeding up critical sections of code
where performance has already been identified as a problem.

The tips here include only those that are available online for free. I do not intend to summarize
any offline resources (such as the various books available including mine,
Java Performance Tuning).
The tips here are of very variable quality and usefulness, some real gems but
some dross and quite a bit of repetition. Comments in square brackets, [], have
been added by me.

Use this page by using your browser's "find" or "search" option to identify
particular tips you are interested in on the page, and follow up by reading
the referenced web page if clarification is necessary.

This page is currently 411KB. This page is updated once a month. You can receive
email notification of any changes by subscribing to the
newsletter

ArrayList is faster than Vector except when there is no lock acquisition required in HotSpot JVMs (when they have about the same performance).

Vector and ArrayList implementations have excellent performance for indexed access and update of elements, since there is no overhead beyond range checking.

Adding elements to, or deleting elements from the end of a Vector or ArrayList also gives excellent performance except when the capacity is exhausted and the internal array has to be expanded.

Inserting and deleting elements to Vectors and ArrayLists always require an array copy (two copies when the internal array must be grown first). The number of elements to be copied is proportional to [size-index], i.e. to the distance between the insertion/deletion index and the last index in the collection. The array copying overhead grows significantly as the size of the collection increases, because the number of elements that need to be copied with each insertion increases.

For insertions to Vectors and ArrayLists, inserting to the front of the collection (index 0) gives the worst performance, inserting at the end of the collection (after the last element) gives the best performance.

LinkedLists have a performance overhead for indexed access and update of elements, since access to any index requires you to traverse multiple nodes.

LinkedList insertions/deletion overhead is dependent on the how far away the insertion/deletion index is from the closer end of the collection.

Synchronized wrappers (obtained from Collections.synchronizedList(List)) add a level of indirection which can have a high performance cost.

Only List and Map have efficient thread-safe implementations: the Vector and Hashtable classes respectively.

List insertion speed is critically dependent on the size of the collection and the position where the element is to be inserted.

For small collections ArrayList and LinkedList are close in performance, though ArrayList is generally the faster of the two. Precise speed comparisons depend on the JVM and the index where the object is being added.

WeakHashMap can be used to reduce memory leaks. Keys that are no longer strongly referenced from the application will automatically make the corresponding value reclaimable.

To use WeakHashMap as a cache, the keys that evaluate as equal must be recreatable.

Using WeakHashMap as a cache gives you less control over when cache elements are removed compared with other cache types.

Clearing elements of a WeakHashMap is a two stage process: first the key is reclaimed, then the corresponding value is released from the WeakHashMap.

String literals and other objects like Class which are held directly by the JVM are not useful as keys to a WeakHashMap, as they are not necessarily reclaimable when the application no longer references them.

The WeakHashMap values are not released until the WeakHashMap is altered in some way. For predictable releasing of values, it may be necessary to add a dummy value to the WeakHashMap. If you do not call any mutator methods after populating the WeakHashMap, the values and internal WeakReference objects will never be dereferenced [no longer true from 1.4, where most methods now allow values to be released].

WeakHashMap wraps an internal HashMap adding an extra level of indirection which can be a significant performance overhead. [no longer true from 1.4].

Every call to get() creates a new WeakReference object. [no longer true from 1.4].

WeakHashMap.size() iterates through the keys, making it an operation that takes time proportional to the size of the WeakHashMap. [no longer true from 1.4].

WeakHashMap.isEmpty() iterates through the collection looking for a non-null key, so a WeakHashMap which is empty requires more time for isEmpty() to return than a similar WeakHashMap which is not empty. [no longer true from 1.4, where isEmpty() is now slower than previous versions].

Start tuning by examining the application architecture for potential bottlenecks.

Architecture bottlenecks are often easy to spot: they are the connecting lines on the diagrams; the single threaded components; the components with many connecting lines attached; etc.

Ensure that application performance is measureable for the given performance targets.

Ensure that there is a test environment which represents the running system. This test-bed should support testing the application at different loads, including a low load and a fully scaled load representing maximum expected usage.

After targeting design and architecture, the biggest bang for your buck in terms of improving performance is choosing a better VM, and then choosing a better compiler.

Start code tuning with proof of concept bottleneck removal: this consists of using profilers to identify bottlenecks, then making simplified changes which may only improve the performance at the bottleneck for a specialized set of activities, and proceeding to the next bottleneck. After tuning competence is gained, move to full tuning.

Each multi-user performance test can typically take a full day to run and analyse. Even simple multi-user performance tuning can take several weeks.

After the easily idenitified bottlenecks have been removed, the remaining performance improvements often come mainly from targeting loops, structures and algorithms.

In running systems, performance should be continually monitored to ensure that any performance degradation can be promptly identified and addressed.

Performance is dependent on data as well as code. Different data can make identical code perform very differently.

Always start tuning with a baseline measurement.

The System.currentTimeMillis() method is the most basic measuring tool for tuning.

You may need to repeatedly call a method in order to reliably measure its average execution time.

Minimize the possibility that CPU time will be allocated to anything other than the test while it is running by ensuring no other processes are runing during the test, and that the test remains in the foreground.

Baseline measurements normally show some useful information, e.g. the average execution time for one call to a method.

Multiplying the average time taken to execute a method or sequence of methods, by the number of times that sequence will be called in a time period, gives you an estimate of the fraction of the total time that the sequence takes.

There are three routes to tuning a method: Consider unexpected differences in different test runs; Analyze the algorithm; Profile the method.

Creating an exception is a costly procedure, because of filling in stack trace.

A profiler should ideally be able to take a snapshot of performance between two arbitrary points.

Tuning is an iterative process: you normally find one bottleneck, make changes that improve performance, test those changes, and then start again.

Algorithm changes usually provide the best speedup, but can be difficult to find.

Examining the code for the causes of the differences in speed between two variations of test runs can be useful, but is restricted to those tests for which you can devise alternatives that show significant timing variations.

Profiling is always an option and almost always provides something that can be speeded up. But the law of diminishing returns kicks in after a while, leaving you with bottlenecks that are not worth speeding up, because the potential speedup is too small for the effort required.

Generic integer parsing (as with the Integer constructors and methods) may be overkill for converting simple integer formats.

Simple static methods are probably best left to be inlined by the JIT compiler rather than by hand.

String.equals() is expensive if you are only testing for an empty string. It is quicker to test if the length of the string is 0.

Set a target speedup to reach. With no target, tuning can carry on for much longer than is needed.

A generic tuning procedure is: Identify the bottleneck; Set a performance target; Use representative data; Measure the baseline; Analyze the method; Test the change; Repeat.

Double.toString(double) is slow. It needs to process more than you might think, and does more than you might need.

Proprietary conversion algorithms can be significantly faster. One such algorithm is presented in the article.

Converting integers to strings can also be faster than the SDK. An algorithm successively stripping off the highest is used in the article.

Formatting numbers using java.text.DecimalFormat is always slower than Double.toString(double), because it first calls Double.toString(double) then parses and converts the result.

Formatting using a proprietary conversion algorithm can be faster than any of the methods discussed so far, if the number of digits being printed is not large. The actual time taken depends on the number of digits being printed.

[Article discusses how to create JDBC wrapers to measure the performance of database calls].

If more than a few rows of a query are being read, then the ResultSet.next() method can spend a significant amount of time fetching rows from the database, and this time should be included in measurements of database access.

JDBC wrappers are simple and robust, and require very little alteration to the application using them (i.e, are low maintenance), so they are suitable to be retained within a deployed application.

A java.util.List object which implements RandomAccess should be faster when using List.get() than when using Iterator.next().

Use instanceof RandomAccess to test whether to use List.get() or Iterator.next() to traverse a List object.

[Article describes how to guard the test to support all versions of Java].

http://www.cs.berkeley.edu/~mdw/proj/java-nbio/
Whoopee!! A non-blocking I/O library for Java. This is the single most important functionality missing from the SDK for scalable server applications. The important class is SelectSet which allows you to multiplex all your i/o streams. If you want a scalable server and can use this class then DO SO. NOTE THAT SDK 1.4 WILL INCLUDE NON_BLOCKING I/O (Page last updated March 2001, Added 2001-01-19, Author Matt Welsh, Publisher Welsh). Tips:

[The system select(2)/poll(2) functions allow you to take any collection of i/o streams and ask the operating system to check whether any of them can execute read/write/accept without blocking. The system call will block if requested until any one of the i/o streams is ready to execute. Before Java, no self-respecting server would sit on multiple threads in blocked i/o mode, wasting thread resources: instead select/poll would have been used.]

http://www.cs.cmu.edu/~jch/java/optimization.html
For years, Jonathan Hardwick's old but classic site was the only coherent Java performance tuning site on the web. He built it while doing his PhD. It wasn't updated beyond March 1998, when he moved to Microsoft, but most tips are still useful and valid. The URL is for the top page, there are another eight pages. Thanks Jonathan. (Page last updated March 1998, Added 2000-10-23, Author Jonathan Hardwick, Publisher Hardwick). Tips:

Don't optimize as you go. Write your program concentrating on clean, correct, and understandable code.

Use profiling to find out where that 80% of execution time is going, so you know where to concentrate your effort.

Hardware traffic managers redirect user requests to a farm of servers based on server availability, IP address, or port number. All traffic is routed to the load balancer, then requests are fanned out to servers based on the balancing algorithm.

Popular load-balancing algorithms include: server availability (find a server with available processing capability); IP address management (route to the nearest server by IP address); port number (locate different types of servers on different machines, and route by port number); HTTP header checking (route by URI or cookie, etc).

Web hits should cater for handling peak hit rate, not the average rate.

You can model hit rates using gaussian distribution to determine the average hit rate per time unit (e.g. per second) at peak usage, then a poisson probability gives the probability of a given number of users simulatneously hitting the server within that time unit. [Article gives an example with gaussian fitted to peak traffic of 4000 users with a standard deviation of 20 minutes resulting in an average of 1.33 users per second at the peak, which in turn gives the probabilities that 0, 1, 2, 3, 4, 5, 6 users hitting the server within one second as 26%, 35%, 23%, 10%, 3%, 1%, 0.2%. Service time was 53 milliseconds, which means that the server can service 19 hits per second without the service rate requiring requests being queued.]

System throughput is the arrival rate divided by the service rate. If the ratio becomes greater than one, requests exceed the system capability and will be lost or need to be queued.

If requests are queued because capacity is exceeded, the throughput must drop sufficiently to handle the queued requests or the system will fail (the service rate must increase or arrival rate decrease). If the average throughput exceeds 1, then the system will fail.

Sort incoming requests into different priority queues, and service the requests according to the priorities assigned to each queue. [Article gives the example where combining user and automatic requests in one queue can result in a worst case user wait of 3.5 minutes, as opposed to less than 0.1 seconds if priority queues are used].

[Note that Java application servers often do not show a constant service time. Instead the service time often increases with higher concurrency due to non-linear effects of garbage collection].

AWT components are not useful as game actors (sprites) as they do not overlap well, nor are they good at being moved around the screen.

Celled image files efficiently store an animated image by dividing an image into a rectangular grid of cells, and allocating a different animation image to each cell. A sequence of similar images (as you would have for an animation) will be stored and transferred efficiently in most image formats.

Examining pixels using PixelGrabber is slow.

drawImage() can throw away and re-load images in response to memory requirements, which can make things slow.

Pre-load and pre-scale images before using them to get a smoother and faster display.

The more actors (sprites), the more time it takes to draw and the slower the game appears.

Use double-buffering to move actors (sprites), by redrawing the actor and background for the relevant area.

Redraw speed depends on: how quickly each object is drawn; how many objects are drawn; how much of each object is drawn; the total number of drawing operations. You need to reduce some or all of these until you get to about 30 redraws per second.

Don't draw actors or images that cannot be seen.

If an actor is not moving then incorporate the actor as part of the background.

Only redraw the area that has changed, e.g. the old area where an actor was, and the new area where it is. Redrawing several small areas is frequently faster than drawing one large area. For the redraws, eliminate overlapping areas and merge adjacent (close) areas so that the number of redraws is kept to a minimum.

Put slow and fast drawing requirements in separate threads.

Bounding-box detection can use circles for the bounding box which requires a simple radii detection.

Load sounds in a background thread.

Make sure you have a throttle control that can make the game run slower (or pause) when necessary.

The optimal network topology for network games depends on the number of users.

If the cumulative downloading of your applet exceeds the player?s patience, you?ve lost a customer.

The user interface should always be responsive. A non-responsive window means you will lose your players. Give feedback on necessary delays. Provide distractions when unavoidable delays will be lengthy [more than a few seconds].

Transmission time varies, and is always slow compared to operations on the local hardware. You may need to decide the outcome of the action locally, then broadcast the result of the action. This may require some synchronization resolution.

Latency between networked players can easily lead to de-synchronized action and player frustration. Displays should locally simulate remote action as continuing current activities/motions, until the display is updated. On update, the actual current situation should be smoothly resolved with the simulated current situation.

Sending activity updates more frequently ensures smoother play and better synchronization between networked players, but requires more CPU effort and so affects the local display. In order to avoid adversely affecting local displays, send actvity updates from a low priority thread.

Discard any out-of-date updates: always use the latest dated update.

A minimum broadcast delay of one-third the average network connection travel time is appropriate. Once you exceed this limit, the additional traffic can cause more grief than benefit.

Put class files into a (compressed) container for network downloading.

Avoid repeatedly evaluating invariant expressions in a loop.

Take advantage of inlining where possible (using final, private and static keywords, and compiling with javac -O)

Profile the code to determine the expensive methods (e.g. using the -prof option)

Use a dissassembler (e.g. like javap) to determine which of various alternative coding formulations produces smaller bytecode.

To reduce the number of class files and their sizes: use the SDK classes as much as possible; and implement common functionality in one place only.

Raycasting is faster than raytracing. Raycasting maps 2D data into a 3D world, drawing entire vertical lines using one ray. Use precalculated values for trignometric and other functions, based on the angle increments chosen for your raycasting.

In the absence of a JIT, the polygon drawing routines fron the AWT are relatively efficient (compared to array manipulation) and may be faster than texture mapping.

Without texture mapping, walls can be drawn faster with one call to fillPolygon (rather than line by line).

An exponential jump search algorithm can be used to reduce ray casts - by quickly finding boundaries where walls end (like a binary search, but double increments until your overshoot, then halving increments from the last valid wall position).

It is usually possible to increase performance at the expense of image quality and accuracy. Techniques include reducing pixel depth or display resolution, field interlacing, aliasing. The key, however, is to degrade the image in a way that is likely to be undetectable or unnoticeable to the user. For example a moving player often pays less attention to image quality than a resting or static player.

Use information gathered during the rendering of one frame to approximate the geometry of the next frame, speeding up its rendering.

If the geometry and content is not too complicated, binary space partition trees map the view according to what the player can see, and can be faster than ray casting.

Calling a remote method that returns multiple values contained in a temporary object (such as a Point), rather than making multiple consecutive method calls to retrieve them individually, is likely to be more efficient. (Note that this is exactly the opposite of the advice offered for good performance of local objects.)

Set the initial StringBuffer size to the maximum string length, if it is known.

StringTokenizer is very inefficient, and can be optimized by storing the string and delimiter in a character array instead of in String, or by storing the highest delimiter character to allow a quicker check.

Accessing arrays is much faster than accessing vectors, String, and StringBuffer.

Use System.arraycopy() to improve performance.

Vector is convenient to use, but inefficient. Ensure that elementAt() is not used inside a loop.

FastVector is faster than Vector by making the elementData field public, thus avoiding (synchronized) calls to elementAt().

Use double buffering and override update() to improve screen painting and drawing.

Use custom LayoutManagers.

Repaint only the damaged regions (use ClipRect).

To improve image handling: use MediaTracker; use your own imageUpdate() method; pre-decode and store the image in an array - image decoding time is greater than loading time. Pre-decoding using PixelGrabber and MemoryImageSource should combine multiple images into one file for maximum speed.

Increase the initial heap size from the 1-MByte default with -ms and -mx [-Xms and -Xmx].

Use -verbosegc.

Take size into account when allocating arrays (for instance, if short is big enough, use it instead of int.

Avoid allocating objects in loops (readLine() is a common example).

Minimize synchronization.

Polling is only acceptable when waiting for outside events and should be performed in a "side" thread. Use wait/notify instead.

Move loop invariants outside the loop.

Make tests as simple as possible.

Perform the loop backwards (this actually performs slightly faster than forward loops do). [Actually it is converting the test to compare against 0 that makes the difference].

Use only local variables inside a loop; assign class fields to local variables before the loop.

Move constant conditionals outside loops.

Combine similar loops.

Nest the busiest loop, if loops are interchangeable.

Unroll the loop, as a last resort.

Convert expressions to table Lookups.

Use caching.

Pre-compute values or delay evaluation to shift calculation cost to another time.

Initially you should identify the probable performance and scalability based on the requirements. You should be asking about: numbers of users/components; component interactions; throughput and transaction rates; performance requirements.

Factor in batch requirements and performance characteristics of dependent (sub)systems. Note that additional layers, like security, add overheads to performance.

Logging and stateful EJB can degrade performance.

After the initial identification phase, the target should be for a model architecture that can be load-tested to feedback information.

Scalability hotspots are more likely to exist in the tiers that are shared across multiple client sessions.

Performance measurements should be from presentation start to presentation completion, i.e. user clicks button (start) and information is displayed (completion).

Use load-test suites and frameworks to perform repeatable load testing.

Note that two database calls are made for each row in a ResultSet: one to describe the column, the second to tell the db where to put the data. PreparedStatements make the description calls at construction time, Statements make them on every execution.

Avoid retrieving unnecessary columns: don't use "SELECT *".

If you are not using stored procedures or triggers, turn off autocommit. All transaction levels operate faster with autocommit turned off, and doing this means you must code commits. Coding commits while leaving autocommit on will result in extra commits being done for every db operation.

Use java.awt.GraphicsDevice.isFullScreenSupported() to determine if full-screen exclusive mode is available. If it is not available, full-screen drawing can still be used, but better performance will be obtained by using a fixed size window in normal screen mode. Full-screen exclusive applications should not be resizable.

Turn off decoration using the setUndecorated() method.

Change the screen display mode (size, depth and refresh rate), to the best match for your image bit depth and display size so that scaling and other image alterations can be avoided or minimized.

Don't define the screen painting code in the paint() method called by the AWT thread. Define your own rendering loop for screen drawing, to be executed in any thread other than the AWT thread.

Use the setIgnoreRepaint() method on your application window and components to turn off all paint events dispatched from the operating system completely, since these may be called during inappropriate times, or worse, end up calling paint, which can lead to race conditions between the AWT event thread and your rendering loop.

Do not rely on the update or repaint methods for delivering paint events.

Do not use heavyweight components, since these will still incur the overhead of involving the AWT and the platform's windowing system.

Use double buffering (drawing to an off-screen buffer, then copying the finished drawing to the screen).

Use page-flipping (changing the video pointer so that an off-screen buffer becomes the on-screen buffer, with no image copying required).

Use a flip chain (a sequence of off-screen buffers which the video pointer successively points to one after the other).

java.awt.image.BufferStrategy provides getDrawGraphics() (to get an off-screen buffer) and show() (to display the buffer on screen).

Use java.awt.BufferCapabilities to customize the BufferStrategy for optimizing the performance of your application.

If you use a buffer strategy for double-buffering in a Swing application, you probably want to turn off double-buffering for your Swing components,

Multi-buffering is only useful when the drawing time exceeds the time spent to do a show.

Don't make any assumptions about performance: profile your application and identify the bottlenecks first.

http://www.devresource.hp.com/JavaATC/JavaPerfTune/index.html
HP Java tuning site, including optimizing Java and optimizing HPUX for Java. This is the top page, but several useful pages lie off it (tips extracted for inclusion below). Includes a nice "procedure" list for tuning apps, and some useful forms for what you should record while tuning. (Page last updated 2000, Added 2000-10-23, Author ?, Publisher HP). Tips:

Have a performance target.

Consider architecture and components for bottlenecks.

Third-party components may have options that cause bottlenecks.

Having debugging turned on can cause performance problems.

Having logging turned on can cause performance problems.

Is the underlying machine powerful enough.

Carefully document any tests and changes.

Create a performance baseline.

Make one change at a time.

Be careful not to lose a winning tune because it's hidden by a bad tune made at the same time.

Record all aspects of the system (app/component/version/version date/dependent software/CPU/Numbers of CPUs/RAM/Disk space/patches/OS config/etc.)

Give the JVMs top system priority.

Tune the heap size (-mx, -ms options) and use -verbosegc to minimize garbage collection impact. A larger heap reduces the frequency of garbage collection but increases the length of time that any particular garbage collection takes.

Rules of thumbs are: 50% of free space available after a gc; set the maximum heap size to be 3-4 times the space required for the estimated maximum number of live objects; set the initial heap to size a little below the space required for the average data set, and the maximum value large enough to handle the largest data set; increase -Xmn for applications that create many short-lived objects [is -Xmn a standard option?]. [These rules of thumb should only be considered as starting points. Ultimately you need to tune the VM heap empirically, i.e. by trial and error].

You may need to add flags to third party products running in the JVM to eliminate explicit calls to garbage collect (VisiBroker has this known problem).

Watch out for bottlenecks introduced from third party products. Make sure you know and use the options available, many of which can affect performance (for better or worse). Document the changes you make so that you will be able to reproduce the performance.

computationally intensive applications should increase the number of CPUs to increase overall system performance and throughput.

Be certain that the application's CPU usage is a factor limiting performance: often, highly contended locks and garbage collections that are too frequent will make the system look busy, but little work is done by the application.

ServletRequest.getRemoteHost() is very inefficient, and can take seconds to complete the reverse DNS lookup it performs.

OutputStream can be faster than PrintWriter. JSPs are only generally slower than servlets when returning binary data, since JSPs always use a PrintWriter, whereas servlets can take advantage of a faster OutputStream.

Excessive use of custom tags may create unnecessary processing overhead.

Using multiple levels of BodyTags combined with iteration will likely slow down the processing of the page significantly.

Use optimistic transactions: write to the database while checking that new data is not be overwritten by using WHERE clauses containing the old data. However note that optimistic transactions can lead to worse performance if many transactions fail.

Use lazy-loading of dependent objects.

For read-only queries involving large amounts of data, avoid EJB objects and use JavaBeans as an intermediary to access manipulate and store the data for JSP access.

"The mythology surrounding the slowness of garbage-collected systems is just that, myth. I can show that the number of instructions executed is the same whether I call malloc() and free() or I only call malloc() and some other code calls free()."

Simple designs can easily run through many unnecessary objects, e.g. data wrapper objects like Integer.

Reuse objects where possible.

Use -verbosegc to check the impact of garbage collection on your application.

Defining a utility class which is applied to data required by its constructor means that you must create a new object for every piece of data to run it on. Instead, do not require data in the constructor.

Do not force methods to provide arguments with input in the form that is convenient rather than efficient. For example, don't require that arguments be passed only as String objects if a byte array or char array would also be functionally equivalent (try to support all formats, especially the efficient ones).

Defining a method signature in terms of an interchange type (the type of object passed from a caller method to the callee method as an argument) reduces the interface's complexity while maintaining its flexibility, but sometimes this simplicity comes at the cost of performance.

Sun recommends you no longer use objects pools [this is rather a sweeping and inappropriate statement. Object pools are still useful even with HotSpot, but presumably not as often as previously].

Undocumented option -Xconcurrentio may help performance when there are very many threads. It uses a lighter thread synchronization model.

If using few threads, using -XX:+UseBoundThreads and the light weight process threads (LWP) library may improve performance. LWP threads are scheduled by the JVM, system threads have kernel scheduling.

Monitor the application. Primary statistics worth analyzing are: the number of concurrent users; number of transactions per unit of time; duration of the longest and shortest transactions; and the average response time.

Specify the performance targets.

Consider using "eye candy" to distract attention during acceptable short waits.

Identify which application tier contains the bottleneck and fix that. It might be hardware or software; low-level or architecture.

Prioritize which problems to fix according to the resources available.

Objects have a space overhead in addition to the space taken by the data held by the object.

Objects have a space overhead in addition to the space taken by the data held by the object. The overhead is dependent on the particular JVM, but there is always some. The space overhead is a per object value, so the percentage of overhead decreases with larger objects. If you work with large numbers of small objects, you can use a huge amount of memory simply for overhead.

Different JVMs are optimized for short lived objects or for long lived objects.

If you're working with a large number of primitive data types, you can avoid the excessive object overhead of wrappers by storing and passing values of the underlying primitive types, and only converting the values into the full objects when necessary for use with methods in the class libraries.

Avoid convenience classes like Point if you can manage the underlying data directly.

Reuse objects where possible.

Use object pools where this is helpful in reusing objects, but be careful that the pool implementation does actually give a performance improvement (dedicated pools within the class can be significantly faster than abstract pool implementations).

Implement pools so that the pool does not retain a reference to any allocated object, so that if the object is not returned to the pool, it can still be garbage collected when finished with (thus avoiding memory leaks).

A website must be easy to navigate and have a quick display and response time.

Bad navigation metrics include: abandoned shopping carts; first time visitors look at one or two pages and disappear; dead ends require the "back" button; less than 5% buy something; any broken links.

Good navigation metrics include: three pages or less from wesbite entry to desired information; no streaming video or Flash introductions; multiple ways to reach the required information; up to date search engines; basic compancy and contact info one click away from the homepage.

Root causes of performance problems come equally from four main areas: databases, Web servers, application servers and the network, with each area typically causing about a quarter of the problems.

The most common database problems are insufficient indexing, fragmented databases, out-of-date statistics and faulty application design. Solutions include tuning the index, compacting the database, updating the database and rewriting the application so that the database server controls the query process.

The most common network problems are undersized, misconfigured or incompatible routers, switches, firewalls and load balancers, and inadequate bandwidth somewhere along he communication route.

[Brilliantly amusing search to make the smallest "Hello World" program.]

Use the -g:none option to strip debugging bytes from classfiles.

Most bytes in Java class files are from the constant pool, then the method declarations. The constant pool includes class and method names as well as strings.

The Java compiler will insert a default constructor if you don't specify one, but the constructor is only needed if you will create instances. You can remove the constructor if you will not be creating instances.

Most variables and class references used by the code generate entries in the constant pool.

Use smart proxies to prevent returning multiple copies of the same remote object to client code.

http://www-4.ibm.com/software/webservers/appserv/ws_bestpractices.pdf
Paper detailing the "Best Practices for Developing High Performance Web and Enterprise Applications" using IBM's WebSphere. All the tips are generally applicable to servlet/EJB development, as well as other types of server development. (Page last updated September 2000, Added 2001-01-19, Author Harvey W. Gunther, Publisher IBM). Tips:

Do not store large object graphs in javax.servlet.http.HttpSession. Servlets may need to serialize and deserialize HttpSession objects for persistent sessions, and making them large produces a large serialization overhead.

Use the tag "<%@ page session="false"%>" to avoid creating HttpSessions in JSPs.

Use the HttpServlet Init method to perform expensive operations that need only be done once.

Minimize use of System.out.println.

Avoid String concatenation "+=".

Access entity beans from session beans, not from client or servlet code.

Reuse EJB homes.

Use Read-Only methods where appropriate in entity-beans to avoid unnecessary invocations to store.

Use the lowest impact transaction level possible for each transaction.

The EJB "remote programming" model always assumes EJB calls are remote, even where this is not so. Where calls are actually local to the same JVM, try to use calling mechanisms that avoid the remote call.

Remove stateful session beans (and any other unneeded objects) when finished with, to avoid extra overheads in case the container needs to be passivated.

Beans.instantiate() incurs a filesystem check to create new bean instances. Use "new" to avoid this overhead.

A size restricted queue (closed queue) allows system resources to be more tightly managed than an open queue.

The network provides a front-end queue. A server should be configured to use the network queue as its bottleneck, i.e. only accept a request from the network when there are sufficient resources to process the request. This reduces the load on an app server. However, sufficient requests should be accepted to ensure that the app server is working at maximum capacity, i.e. try not to let a component sit idle while there are still requests that can be accepted even if other components are fully worked.

The desirable target bottleneck is the CPU, i.e. a server should be tuned until the CPU is the remaining bottleneck. Adding CPUs is a simple remedy to this.

Use connection pools and cached prepared statements for database access.

Object memory management is particularly important for server applications. Typically garbage collection could take between 5% and 20% of the server execution time. Garbage collection statistics provide a useful monitor to determine the server's "health". Use the verbosegc flag to collect basic GC statistics.

GC statistcs to monitor are: total time spent in GC (target less than 15% of execution time); average time per GC; average memory collected per GC; average objects collected per GC.

For long lived server processes it is particularly important to eliminate memory leaks (references retained to objects and never released).

Use -ms and -mx to tune the JVM heap. Bigger means more space but GC takes longer. Use the GC statistics to determine the optimal setting, i.e the setting which provides the minimum average overhead from GC.

The ability to reload classes is typically achieved by testing a filesystem timestamp. This check should be done at set intermediate periods, and not on every request as the filesystem check is an expensive operation.

[The Red book lists and discusses tuning parameters available to Websphere]

Run an application server and any database servers on separate server machines.

JVM heap size: -mx, -ms [-Xmx, -Xms]. As a starting point for a server based on a single JVM, consider setting the maximum heap size to 1/4 the total physical memory on the server and setting the minimum to 1/2 of the maximum heap. Sun recommends that ms be set to somewhere between 1/10 and 1/4 of the mx setting. They do not recommend setting ms and mx to be the same. Bigger is not always better for heap size. In general increasing the size of the Java heap improves throughput to the point where the heap no longer resides in physical memory. Once the heap begins swapping to disk, Java performance drastically suffers. Therefore, the mx heap setting should be set small enough to contain the heap within physical memory. Also, large heaps can take several seconds to fill up, so garbage collection occurs less frequently which means that pause times due to GC will increase. Use verbosegc to help determine the optimum size that minimizes overall GC.

In some cases turning off asynchronous garbage collection ("-noasyncgc", not always available to all JVMs) can improve performance.

synchronization means mutual exclusion (if the same monitor is used), atomicity of the synchronized block (again with respect to other threads using the same monitor) and synchronization of thread memory to main memory.

Because synchronization synchronizes thread memory with main memory, there is a cost to synchronization beyond simply acquiring a lock.

Too little synchronization can lead to corrupt data; too much can lead to reduced performance and deadlock.

The costs of synchronization vary with JVMs, with more recent JVMs being more efficient.

The costs of synchronization differs depending on whether or not threads are actually contending for locks (more expensive, slower), or for uncontended synchronization where the thread is basically acting in single-threaded mode (cheaper, faster).

You need to synchronize or make volatile variables holding data that will be shared between threads.

Composite operations may need synchronizing to make them atomic even if each individual operation is already synchronized.

Response time is affected by: contention and wait times, particularly for shared resources; and software and hardware component performance, i.e. the amount of time that resources are needed.

A well-designed application can increase performance by simply adding more resources (for instance, an extra server).

Use clustered or multi-processing machines; use a JIT-enabled JVM; use Java 2 rather than JDK 1.1;

Use -noclassgc. Use the maximum possible heap size that also is small enough to avoid the JVM from swapping (e.g. 80% of RAM left over after other required processes). Consider starting with minimum initial heap size so that the garbage collector doesn't suddenly encounter a full heap with lots of garbage. Benchmarkers sometimes like to set the heap as high as possible to completely avoid GC for the duration of the benchmark.

Distributing the application over several server JVMs means that GC impact will be spread in time, i.e. the various JVMs will most likely GC at different times from each.

On Java 1.1 the most effective heap size is that which limits the longest GC incurred pause to the longest acceptable pause in processing time. This will typically require a reduction in the maximum heap size.

Too many threads causes too much context switching. Too few threads may underutilize the system. If n=number of threads, k=number of CPUs, then: (n < k) results in an under utilized CPU; (n == k) is theoretically ideal, but each CPU will probably be under utilized; (n > k) by a "moderate amount of threads" is practically ideal; (n > k) by "many threads" can lead to significant performance degradation from context switching. Blocked threads count for less in the previous formulae.

Symptoms of too few threads: CPU is waiting to do work, but there is work that could be done; Can not get 100% CPU; All threads are blocked [on i/o] and runnable when you do an execution snapshot.

Symptoms of too many threads: An execution snapshot shows that there is a lot of context switching going on in your JVM; Your performance increases as you decrease the number of threads.

If many client connections are dropped or refused, the TCP listen queue may be too short.

Try to avoid excessive cycling (creation/deletion or activation/passivation) of beans.

Use connection pools to the database and reuse connections rather than repeatedly opening and closing connections. Optimal pool size is when the connection pool is just large enough to service requests without waits.

Cache frequently requested data in the JVM and avoid the unnecessary database requests.

Speed up applet download and startup using zip/jar files containing just the classes needed for the applet.

Avoid accessing the database wherever possible.

Fetch rows in batches rather than one at a time, using the batch as a read-ahead mechanism (i.e. pre-fetch rows in batches). Tune the batch size and the number of rows pre-fetched. Avoid pre-fetching BLOBs.

Avoid moving data unless absolutely necessary. Process the data and produce results as close to its source as possible. Use stored procedures.

Streamline data before the result crosses the network.

Use stored procedures to avoid extra network transfers.

Use built-in DBMS set-based processing to operate on multiple rows/tables in one request.

Avoid row at a time processing, process multiple rows together wherever possible.

Counting entries in a table (e.g. using SELECT count(*) from myTable, yourTable where ... ) is resource intensive. Try first selecting into temporary tables, returning only the count, and then sending a refined second query to return only a subset of the rows in the temporary table.

Proper use of SQL can reduce resource requirements. Use queries which return the minimum of data needed: avoid SELECT * queries. A complex query that returns a small subset of data is more efficient than a simple query that returns more data than is needed.

Make your queries as smart as possible, i.e. as precise as possible to minimize the data transferred to just that subset that is required.

Try to batch updates: collect statements together and execute them together in one transaction. Use conditional logic and temporary variables if necessary to achieve statement batching.

Never let a DBMS transaction span user input.

Consider using optimistic locking. Optimistic locking employs timestamps to verify that data has not been changed by another user, otherwise the transaction fails.

Use in-place updates, i.e. change data in rows/tables that already exist rather than adding or deleting rows/tables. Try to avoid moving rows or changing their sizes.

Keep your operational data set as small as possible, to avoid having to read through data that is irrelevant.

DBMSs work well with parallelism. Try to design the application to do other things while interacting with the DBMS.

Use pipelining and parallelism. Designing applications to support lots of parallel processes working on easily distinguished subsets of the work makes the application faster. If there are multiple steps to processing, try to design your application so that subsequent steps can start working on the portion of data that any prior process has finished, instead of having to wait until the prior process is complete.

Choose the right driver for your application, i.e. the fastest JDBC driver.

Minimize the data retrieved from the database, both columns and rows. Use setMaxRows, setMaxFieldSize, and SetFetchSize.

Use the most efficiently handled data type: character strings are faster than integers, which are in turn more efficient than floating-point and timestamps.

Use programmatic updates: updateXXX() calls on updatable resultsets. The resultset is already postioned at a row, so eliminating the usual overhead of finding the row to be updated when using an UPDATE statement.

Cache any required metadata and use metadata methods as rarely as possible as they are quite slow.

Avoid using null parameters in metadata queries.

Use a dummy query to get the metadata for a column, rather than use the getcolumns()

Use parameter markers with stored procedures, rather than embedding data literally in the statement, to minimize parsing overheads.

Use prepared statements for repeatedly executing SQL statements

Choose the optimal cursor: forward-only for sequential reads; insensitive for two-way scrolling. Avoid insenstive cursors for queries that only return one row.

Before SDK 1.4, servers had a number of performance problems: i/o could easily be blocked; garbage was easily generated when reading i/o; many threads are needed to scale the server.

Many threads each blocked on i/o is an inefficient architecture in comparison to one thread blocked on many i/o calls (multiplexed i/o).

Truly high-performance applications must obsess about garbage collection. The more garbage generated, the lower the application throughput.

A Buffer (java.nio.*Buffer) is a reusable portion of memory. A MappedByteBuffer can map a portion of a file directly into memory.

Direct Buffer objects can be read/written directly from Channels, but nondirect Buffer objects have a data copy performed for read/writes to i/o (and so are slower and may generate garbage). Convert nondirect Buffers to direct Buffers if they will be used more than once.

Scatter/gather operations allow i/o to operate to and from several Buffers in one operation, for increased efficiency. Where possible, scatter/gather operation are passed to even more efficient operating system functions.

Channels can be configured to operate blocking or non-blocking i/o.

Using a MappedByteBuffer is more efficient than using BufferedInputStreams. The operating system can page into memory more efficiently than BufferedInputStream can do a block read.

Use Selectors to multiplex i/o and avoid having to block multiple threads waiting on i/o.

Type 1 drivers are JDBC-ODBC bridges, plus an ODBC driver. Recommended only for prototyping, not for production. Not suitable for high-transaction environments. Not well supported, and limited in functionality.

Type 2 drivers use a native API, and are part-Java drivers. Have a binary-code client loading overhead, and may not be fully-featured.

Type 3 drivers are a pure Java driver which connects to database middleware. Can be server-based which is frequently faster than types 1 and 2.

Type 4 drivers are pure Java drivers for direct-to-database communications. This can minimize overheads, and generally provides the fastest driver.

JDBC 3.0 has additional features to improve performance such as advancements in connection pooling, statement pooling, RowSet objects.

Opening a connection is the most resource-expensive step in database transactions. Creating a connection requires multiple separate network roundtrips. However, once the connection object has been created, there is little penalty in leaving the connection object in place and reusing it for future connections.

Connection pooling, keeps open a cache of database connection objects, making them available for immediate use. Instead of performing expensive network roundtrips to the database server to open a connection, a connection attempt results in the re-assignment of a connection from the local cache.

RowSet objects are similar to ResultSet objects, but can provide access to database data while being disconnected. This allows data to be efficiently cached in its simplest form.

Prepared statement pooling (available from JDBC 3.0) caches SQL queries that have been previously optimized and run so that, should they be needed again, they do not have to go through optimization pre-processing again (avoiding optimization steps, such as checking syntax, validating addresses, and optimizing access paths and execution plans). Statement pooling can be a significant performance booster.

Statement pooling and connection pooling in JDBC 3.0 can cooperate to share statement pools, so that connections that can use a cached statement from another connection, thus incurring statement preparation overheads only once on the first execution of some SQL by any connection.

Database drivers developed by vendors other than the the database vendor can be better performing and more feature full. (Driver vendors concentrate on the driver, database vendors have many other things to consider).

Try to use a driver that supports JDBC 3.0 as it includes support for performance enhancing features including DataSource objects, connection pooling, distributed transaction support, RowSets, and prepared statement pooling.

Type 3 and Type 4 drivers are the drivers to use when performance is important.

Optimizing code is one of the last things that programmers should be thinking about, not one of the first.

Don't optimize code that already runs fast enough.

Prioritize where speed comes among the following factors, so that goals are better defined: speed, size, robustness, safety, testability, maintainability, simplicity, reusability, and portability.

The most important factors in looking for code to optimize are fixed overhead and performance on large inputs: fixed overhead dominates speed for small inputs and the algorithm dominates for large inputs (a program that works well for both small and large inputs will likely work well for medium-sized inputs).

Operations that take a particular amount of time, such as the way that memory and buffers are handled, often show substantial time variations between platforms.

Users are sensitive to particular delays: users will likely be happier with a screen that draws itself immediately and then takes eight seconds to load data than with a screen that draws itself after taking five seconds to load data.

Give users immediate feedback: you do not always need to make your code run faster to optimize it in the eyes of your users.

Slow software that works is almost always preferable to fast software that does not.

If you don't require powerful search capabilities, using flat files may be faster than dealing with a database.

Basic file operations (deletion, creation, renaming) are atomic. Other operations and combinations of operations are not atomic. Atomicity can be built but comes at a performance cost. You will have to determine whether the increase in robustness is worth the slowdown in your application.

Do the I/O in a background thread to mitigate the performance impact of adding atomicity to file transactions.

[Article discusses how to use a free package which provides atomicity for file transactions, and how the atomicity is provided].

Local interfaces in EJB 2.0 is one attempt to improve overall performance: local interfaces provide for beans in the same container to interact locally without involving RMI.

The most effective way to improve the overall performance of EJB-based applications is to minimize the amount of method invocations, making the communications overhead negligible compared with the execution time. This can be achieved by implementing coarse-grained methods.

Entity beans should not be simply mapped to database tables. Treating entity beans as such fine-grained objects which are effectively wrappers on table rows leads to increased network communications and heavier database communications than if entity beans are treated as coarse-grained components.

For optimal performance, entity beans should be designed to: have large granularity, which usually means they should contain multiple Java classes and support multiple database tables; be associated with a certain amount of persistent data, typically multiple database tables, one of which should define the primary key for the whole bean; support meaningful business methods and encapsulate business rules to access the data.

Don't use client transactions in the EJB environment since long-running transactions that can cause database lockup.

Entity beans are transactional resources due to their stateful nature, but application server vendors often rely on the underlying database to lock and resolve access appropriately. Although this approach greatly improves performance, it provides the potential for database lockup.

[Article discusses several design patterns: Intercepting Filter, Front Controller, View Helper, Composite View, Service To Worker, Dispatch View. Performance is not explicitly covered, but at least a couple are relevant to getting good performance].

If optimization and performance tools are used throughout development rather than tacked on at the end as a final "optimization phase," time to market and costs can actually be decreased by speeding up the process of locating problems and bottlenecks in code.

Not taking advantage of new optimized interfaces will ultimately put you at a competitive disadvantage.

The large number extra features and increased cross-platform compatibility added to the Java Graphics framework in SDK 1.2 made the graphics slower than the 1.1 Graphics. SDK 1.4 targeted these performance issues head on.

VolatileImage allows you to create hardware-accelerated offscreen images, resulting in better performance of Swing and gaming applications in particular and faster offscreen rendering.

When filling a shape with a complex paint, Java 2D must query the Paint object every time it needs to assign a color to a pixel whereas a simple color fill only requires iterating through the pixels and assigning the same color to all of them.

The graphics pipeline (from SDK 1.4) only gets invalidated when an attribute is changed to a different type of value, rather than when an attribute is changed to a different value of the same type. For example rendering one opaque color is the same rendering another opaque color, so this would not invalidate the pipeline. But changing an opaque color to a transparent color would invalidate the pipeline.

Smaller font is rendered faster than larger font.

Hardware-accelerated scaling is currently (1.4.0 release) disabled on Win32 because of quality problems, but you can enable it with a runtime flag, -Dsun.java2d.ddscale=true.

From SDK 1.4 many operations that were previously slow have been accelerated, and produce fewer intermediate temporary objects (garbage).

Alpha blending and anti aliasing adversely affect performance.

Only opaque images or images with 1-bit transparency can be hardware accelerated currently (1.4.0).

Use 1-bit transparency to make the background color of a sprite rectangle transparent so that the character rendered in the sprite appears to move through the landscape of your game, rather than within the sprite box.

Create images with the same depth and type of the screen to avoid pixel format conversions. Use either Component.createImage() or GraphicsConfiguration.createCompatibleImage(), or use a BufferedImage created with the ColorModel of the screen.

Rectangular fills--including horizontal and vertical lines--tend to perform better than arbitrary or non-rectangular shapes whether they are rendered in software or with hardware acceleration.

If your application must repeatedly render non-rectangular shapes, draw the shapes into 1-bit transparency images and copy the images as needed.

If you experience low frame rates, try commenting out pieces of your code to find the particular operations that are causing problems, and replace these problem operations with something that might perform better.

You can trace graphics performance using the flag -Dsun.java2d.trace=<optionname>,<optionname>,... where the options are log (print primitives on execution); timestamp (timestamp log entries); count (print total calls of each primitive used); out:<filename> (send logs to filename); verbose (whatever); help (help);

The point when garbage collection kicks in is out of the control of the application. This can cause a sequential overhead on the application, as the garbage collector suspends all application threads when it runs, causing inconsistent and unacceptable application pauses, leading to high latency and decreased application efficiency.

verbosegc provides detailed logs of the garbage collector activities

The live "transient memory footprint" of an application is the (Garbage generated per call) * (duration of the call) * (number of calls per second).

GC pause time caused by two-space collection of short-lived objects is directly proportional to the size of the memory space allocated to holding short-lived objects. But smaller available space can mean more frequent GCs.

Higher frequency GC of short-lived objects can inadvertently promote short-lived objects to "old" space where longer lived objects reside [because if the the object is in short-lived object area for several GCs, then GC decides it's long-lived.] The promoteAll option will force the GC to assume that any object surviving GC of young space is long-lived, and is immediately promoted to old space..

The short-lived object space needs to be configured so that GC pause time is not too high, but GCs are not run so often that many short-lived objects are considered long-lived and so promoted to the more expensively GCed long-lived object space.

The long-lived object space needs to be large enough to avoid an out-of-memory error, but not so high that a full GC of old space pauses the JVM for too long.

[Article covers 1.2 and 1.3 GC memory space models].

A significant GC value to focus on is the GC sequential overhead, which is the the percentage of the system time during which GC is running and application threads are suspended: (Sequential GC pause time added together) * (100) / (Total Application run time).

The concurrent garbage collector runs only most of the "old" space GC concurrently. Some of the "old" space GC and all the "young" space GC is sequential.

GC activity can take hours to settle down to its final pattern. Fragmentation of old space can cause GC times to degrade, and it may take a long time for the old space to become sufficiently fragmented to show this behavior.

GC options can reduce fragmentation (such as bestFitFirst).

The promoteAll option produced a significant improvement in performance [which I find curious].

Prototype to determine the performance of your device. Wireless transmissions require testing to determine if the transfer rates and processing times are acceptable.

Attempt to create applications that can accomplish 80% or more of their operations through the touch of a single key/button or the "tap" or touch of the stylus to the screen.

Trying to manipulate a very small scroll bar on a small screen can be an exercise in hand-eye coordination. Horizontal scrolling should be avoided at all costs. Use "jump-to" buttons rather than scrollbars.

Try to avoid having the user remember any data, or worse, having to compare data across screens.

Performance will always be a concern in J2ME.

Avoid garbage generation: Use StringBuffer for mutable strings; Pool reusable instances of objects like DateFormat; Use System.gc() to jump-start or push the garbage collection process.

Compile the code with debugging information turned off using the -g:none switch. This increases performance and reduces its footprint.

Avoid deep hierarchies in your class structure.

Consider third-party JVMs, many are faster than the Sun ones.

Small XML parsers and micro databases are available for purchase where necessary.

Avoid inner classes: make the main class implement the required Listener interfaces and handle the callbacks there.

Use built-in classes if functionality is close enough, and work around their limitations.

Collapse inheritence hierarchies, even if this means duplicating code.

Shorten all names (packages, classes, methods, data variables). Some obfuscators can do this automatically. MIDP applications are completely self-contained, so you can use the default package with no possible name-clash.

Convert array initialization from code to extract data from a binary string or data file. Array initialization generates many bytecodes as each element is separately initialized.

Different versions of the Sun JVM support different optimization flags. Some flags may allow you to configure the garbage collector generational spaces.

Configure heap space using -Xms and -Xmx [-ms and -mx for 1.1.x JVMs] to optimize the JVM heap memory for improved performance.

If the JVM supports configuring the garbage collector generational spaces (-Xgenconfig in 1.2.2; -XX:newSize, -XX:MaxNewSize, -XX:SurvivorRatio in 1.3), then you can improve performance by specifying generation spaces more appropriate for your application [you can start with some appropriate configuration depending on the ratios of short-lived to medium-lived to long-lived objects, then test multiple configurations to determine the optimal config].

The 1.3 JVM appears to be faster when run with the -server flag.

The -Xoptimize flag seems to improve performance on those 1.2.x JVMs that support it.

Be methodical to ensure that changes for performance do actually improve performance.

Eliminate memory leaks before tuning execution speed.

Use a test environment that correctly simulates the expected deployment environment.

Simulate the expected client activity, and compare the performance against your expected goals.

Consider which metrics to measure, such as: Max response time under heavy load; CPU utilization under heavy load; How the application scales as additional users are added.

Profile the application to find the bottlenecks. Correct bottlenecks by making one change at a time and testing for improvement.

Generate stack traces to look for bottlenecks which are multi-thread conflicts (waiting for locks).

Improving the performance of a method that is called 1000 times and takes a tenth of a second on average each call, is better than improving the performance of a method that is only called 10 times but takes 1 second each call.

Don?t cache data unless you know how and when to invalidate the cached entries.

Declare method arguments final if they are not modified in the method. In general declare all variables final if they are not modified after being initialized or set to some value.

Declare methods private and/or final whenever that makes sense. This can help the compiler inline methods. [final methods are of dubious value]

Buffer i/o. Use BufferedReaders.

DON?T create static strings via new().

Use String.intern() to reduce the number of strings in your runtime. [but this is an expensive operation]

Use char[] arrays for all character processing in loops, rather than using the String or StringBuffer classes.

StringBuffer default size is 16 chars. Set this to the maximum expected string length.

StringTokenizer is inefficient. It can be optimized by storing the string and delimiter in a character array instead of in a String, or by storing the highest delimiter character to allow a quicker check.

Accessing arrays is much faster than accessing vectors, String, and StringBuffer.

Use System.arraycopy() to improve performance.

Initialize expensive arrays in class static initializers, and create a per instance copy of this array initialized with System.arrarycopy().

Vector is convenient to use, but inefficient. For best performance, use it only when the structure size is unknown, and efficiency is not a concern.

When using Vector, ensure that elementAt() is not used inside a loop.

Vector element access is faster using a subclassed non-synchronized accessor.

Re-use Vectors by using Vector.removeAllElements().

Initialize Vector to the maximum expected size.

Re-use Hashtables by using Hashtable.clear().

Set the Hashtable size to be large enough to hold the expected elements. Use a prime number for table size.

Use CMP except in specific cases when BMP is necessary: fields use stored procedures; persistence is not simple JDBC (e.g. JDO); One bean maps to multiple tables; non-standard SQL is used.

CMP can make many optimizations: optimal locking; optimistic transactions; efficient lazy loading; efficiently combining multiple queries to the same table (i.e. multiple beans of the same type can be handled together); optimized multi-row deletion to handle deletion of beans and their dependents.

Stateless session beans are much more efficient than stateful session beans.

Stateless session bean have no state. Most containers have pools of stateless beans. Each stateless bean instance can serve multiplw clients, so the bean pool can be kept small, and doesn't need to change in size avoiding the main pooling overheads.

A separate stateful bean instance must exist for every client, making bean pools larger and more variable in size.

[Article discusses how to move a stateful bean implementation to stateless bean implementtaion].

The 'new' operator is not object oriented, and prevents proper polymorphic object creation.

Constructors must be made non-public and preferably private to limit the number of objects of a class.

The Singleton pattern and the Flyweight (object factory) pattern are useful to limit numbers of objects of various types and to assist with object reuse and reduce garbage collection.

The real-time specification for Java allows 'new' to allocate objects in a 'current memory region', which may be other than the heap. Each such region is a type of MemoryArea, which can manage allocation.

Using variables to provide access to limited numbers of objects is efficient, but a maintenance problem if you need to change the object access pattern, for example from a global singleton to a ThreadLocal Singleton.

A non-static factory method is polymorphic and so provides many advantages over static factory methods.

The Abstract Factory design pattern uses a single class to create more than one kind of object.

An alternative to the Flyweight pattern is the Prototype pattern, which allows polymorphic copies of existing objects. The Object.clone() method signature provides support for the Prototype pattern.

Prototypes are useful when object initialization is expensive, and you anticipate few variations on the initialization parameters. Then you could keep already-initialized objects in a table, and clone an existing object instead of expensively creating a new one from scratch.

Immutable objects can be returned directly when using Prototyping, avoiding the copying overhead.

Current Web-application architectures consists many small servers that are accessed through a load balancer, providing a front-end to a powerful database server. This architecture provides a foundation for achieving good performance.

Load testing of web applications should include: State machine testing (entries in a shopping basket, should still be there when checked out); Really long session testing (session started then continued several hours later); Hordes of savage users testing (users do lots nonsensical activity); Privileged testing (only some users should be able to access some functionality); Speed testing (do tasks complete within the required times?). Each type of test should be run with several different user loads.

Test suites should be automated and easily changed.

[Article discusses Load, an open-source set of tools with XML scripting language]

Use the EXPLAIN PLAN facility to explain how the database's optimizer plans to execute your SQL statements, to identify performance improvements such as additional indexes.

If more than one SQL statement is executed by your program, you can gain a small performance increase by turning off auto-commit.

It takes about 65 iterations of a prepared statement before its total time for execution catches up with a statement, because of prepared statement initialization overheads.

Use PreparedStatements to batch statements for optimal performance.

The Thin driver is faster than the OCI driver. This is contrary to Oracle's recommendation.

A SELECT statement makes two round trips to the database, the first for metadata, the second for data. Use OracleStatement.defineColumnType() to predefine the SELECT statement, thus providing the JDBC driver with the column metadata which then doesn't require the first database trip.

Given a simple SQL statement and a stored procedure call that accomplishes the same task, the simple SQL statement will always execute faster because the stored procedure executes the same SQL statement but also has the overhead of the procedure call itself. On the other hand complex tasks requiring several SQL statements can be faster using stored procedures as fewer network trips and data transfers will be needed.

Thoughtful page design makes for a better user experience by enabling the application to seem faster than it really is.

Use the flush method associated with the out object to display static text and graphics on the browser page before the database query returns, to prevent the user from having to look at a blank page for a long time.

ResultSet types affect updates. TYPE_FORWARD_ONLY: no updating allowed; TYPE_SCROLL-SENSITIVE: update immediately; TYPE_SCROLL_INSENSITIVE: update when the connection is closed. (Concurrency type must be set to CONCUR-UPDATABLE to allow the table to be updated.)

Performance can be better if changes to the database are batched: turn off autocommit; add multiple SQL statements using the Statement.addBatch() method; execute Statement.executeBatch().

Scaled systems need optimized SQL calls, querying the right amount of data, and displaying pages before the query is complete.

Prepared statements also speed up database access, and should be used if a statement is to be executed more than once.

The higher the level of transaction protection, the higher the performance penalty. Transaction levels in order of increasing level are: TRANSACTION_NONE, TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ, TRANSACTION_SERIALIZABLE. Use Connection.setTransactionIsolation() to set the desired tansaction level.

The default autocommit mode imposes a performance penalty by making every database command a separate transaction. Turn off autocommit (Connection.setAutoCommit(false)), and explicitly specify transactions.

Batch operations by combining them in one transaction, and in one statement using Statement.addBatch() and Statement.executeBatch().

Savepoints (from JDBC3.0) require expensive resources. Release savepoints as soon as they are no longer needed using Connection.releaseSavepoint().

Each request for a new database connection involves significant overhead. This can impact performance if obtaining new connections occurs frequently. Reuse connections from connection pools to limit the cost of creating connections. [The tutorial lists all the overheads involved in creating a database connection].

The ConnectionPoolDataSource (from JDBC3.0) and PooledConnection interfaces provide built-in support for connection pools.

Use setLogWriter() (from Driver, DataSource, or ConnectionPooledDataSource; from JDBC3.0) to help trace JDBC flow.

For read-only access to a set of data that does not change rapidly, use the Fast Lane Reader pattern which bypasses the EJBs and uses a (possibly non-transactional) data access object which encapsulates access to the data. Use the Fast Lane Reader to read data from the server and display all of them in one shot.

When you need to access a large remote list of objects, use the Page-by-Page Iterator pattern which sends smaller subsets of the data as requested until the client no longer want any more data. Use the Page-by-Page Iterator to send lists of simple objects from EJBs to clients.

When the client would request many small data items which would require many remote calls to satisfy, combine the multiple calls into one call which results in a single Value Object which holds all the data required to be transferred. Use the Value Object to send a single coarse-grained object from the server to the client(s).

Use double buffering: draw into an offscreen buffer, then copy into the display buffer. Copying buffers is very fast on most devices, while directly drawing to a display sometimes causes users to see a flicker, as individual parts of the display are updated. Double buffering avoids flickering by combining multiple individual drawing operations into a single copy operation.

Use the Canvas.isDoubleBuffered() method, to determine if double buffering is already automatically used: on some implementations the Canvas object's paint method is already a Graphics object of an offscreen buffer managed by the system. (The system then takes care of copying the offscreen buffer to the display.)

Use javax.microedition.lcdui.Image class to create an offscreen memory buffer, and use Graphics to draw to the offscreen buffer and to copy the contents of the offscreen buffer onto the display. The offscreen buffer is created by calling one of the Image.createImage methods.

Double buffering does have some overhead: if only making small changes to the display, it might be slower to use double buffering.

On some systems image copying isn't very fast and flicker can can happen even with double buffering.

Keep the number of offscreen buffers to a minimum. There is a memory penalty to pay for double buffering: the offscreen memory buffer can consume a large amount of memory.

Free the offscreen buffer whenever the canvas is hidden (use the canvas' hideNotify() and showNotify() methods.)

A Vector may be convenient and generalized, but it's almost always overkill, and you pay the price for it in speed and other ways. --Greg Guerin on the MRJ-dev mailing list

A lot of speed (or memory) can go down the drain if the underlying structure is a poor fit to the problem, or is inefficient for a particular program's common actions. --Greg Guerin on the MRJ-dev mailing list

It is perfectly legal for available() to always return 0, even when there are a zillion bytes available, and in fact the default implementation in Inputstream.available() does just that. --Thomas Maslen on the mrj-dev mailing list

Seeing the wrong solution to a problem (and understanding why it is wrong) is often as informative as seeing the correct solution. --W. Richard Stevens

You need to run your full QA cycle on _all_ platforms you plan on supporting your app on ... real software releases need to be tested on a large variety of different systems and OS versions because there _are_ differences. Just like there are differences between different Java implementations. --Jens Alfke on the mrj-dev mailing list

I often find with Java that if you run the same program twice, the second run is significantly faster, presumably because the JVM is remembering something. --Michael Kay on the xsl-list mailing list

Java isn't inherently slow, it just encourages a "create and forget" [objects] type of programming which is. --Oren Ben-Kiki on the XSL mailing list

Java does not expose many of the I/O capabilities that are synonymous with high performance. Examples include memory mapped files and asynchronous I/O. Heck, it doesn't even expose non-blocking I/O. --Gabe Beged-Dov on the xml-dev mailing list

I/O performance issues, usually overshadow all other performance issues making them the first area to concentrate on when tuning performance. Unfortunately, optimal reading and writing can be challenging in Java. --Daniel Lord and Achut Reddy, http://www.sun.com/workshop/java/wp-javaio/

Streamlining the use of I/O often results in greater performance gains than all other possible optimizations combined. --Daniel Lord and Achut Reddy http://www.sun.com/workshop/java/wp-javaio/

Modern super-scalar processors with deep memory hierarchies and complex compiler optimization stages make it *extremely* difficult to predict which code or data structure variant is more efficient. Old rules of thumb and "common sense" are not of much use any more for distinguishing more and less performant algorithms of comparable complexity on a late 1990s processor. Surprises are frequent. Design decisions on performance grounds should today only be made after real measurements and much of what you learned 10 years ago about manual optimization is obsolete these days. --Markus Kuhn on the Unicode mailing list

Most Java VM implementations search the interface list back to front so that most often used interface should be the last interface in the 'implements' list. --Don Park on the xml-dev mailing list

http://www.theparticle.com/javadata2.html
Particle's pretty good coverage of the main Java data structures. Only a few tuning tips: reuse, pools, optimized sorting. But knowing which structure to use for a particular problem is an important performance tuning technique. (Page last updated April 2000, Added 2000-12-20, Author J. Particle, Publisher Particle). Tips:

Make linked lists faster by having dummy first and last nodes.

Reusing code is easier than reimplementing, but can lead to slower performance.

Use node pools to reduce memory impact.

Sorting elements on insertion means they don't need to be sorted later.

[Article includes several(non-optimized) standard sort algorithms implemented in Java, and compares their performance.]

[Article discusses optimizing a quicksort.]

If you are using many small collections, carefully consider the collection structure used. Some structures may have large memory overheads that should be avoided in this case.

Use the builder pattern: break the construction of complex objects into a series simpler Builder objects, and a Director object which combines the Builders to form the complex object. Then you can use Recycler (a type of Director) to replace only the broken parts of the complex object, so reducing the amount of objects that need to be recreated.

Minimize the use of Metadata: Cache all metadata as they will not change; Avoid using null arguments in metadata methods; Use a dummy query with getMetadata() rather than getColumns().

Retrieve data as efficiently as possible: Minimize the amount of data returned by the query; Don't make average users pay the same query cost of the users with extensive query requirements; Remember that users seldom want to see too much data in one go; Use setMaxRows(), setMaxFieldSize(), and SetFetchSize(); Decrease the column size; Use the smallest packet size that will meet your needs (if the driver supports packet sizing).

Use a parametrized remote procedure call (RPC) rather than passing parameters as part of the RPC call, e.g. use Connection.prepareCall("Call getCustName (?)").setLong (1,12345) rather than Connection.prepareCall("Call getCustName (12345)")

Minimize connections; try to reuse connections.

Turn autocommit off.

Avoid using distributed transactions.

Use getBestRowIndentifier() to determine the optimal set of columns to use in the Where clause for updating data. (The columns returned could be pseudo-columns that can provide pointers to the exact location of the data, and are not obtained by getColumns().)

EJB calls are expensive. A method call from the client could cover all the following: get Home reference from the NamingService (one network round trip); get EJB reference (one or two network roundtrips plus remote creation and initialization of Home and EJB objects); call method and return value on EJB object (two or more network rountrips: client-server and [mutliple] server-db; several costly services used such as transactions, persistence, security, etc; multiple serializations and deserializations).

If you don't need EJB services for an object, use a plain Java object and not an EJB object.

Use Local interfaces (from EJB2.0) if you deploy both EJB Client and EJB in the same JVM. (For EJB1.1 based applications, some vendors provide pass-by-reference EJB implementations that work like Local interfaces).

Wrap multiple entity beans in a session bean to change multiple EJB remote calls into one session bean remote call and several local calls (pattern called SessionFacade).

Change multiple remote method calls into one remote method call with all the data combined into a parameter object.

Use the fastest driver available to the database: normally type 4 (preferably) or type 3.

Tune the defaultPrefetch and defaultBatchValue settings.

Get database connections from a connection pool: use javax.sql.DataSource for optimal configurability. Use the vendor's connection pool; or ConnectionPoolDataSource and PooledConnection from JDBC2.0; or a proprietary connection pool.

Batch your transactions. Turn off autocommit and explicitly commit a set of statements.

The SessionFacade Pattern reduces network calls by combining accesses to multiple Entity beans into one access to the facade object.

The MessageFacade/ServiceActivator Pattern moves method calls into a separate object which can execute asynchronously.

The ValueObject Pattern combines remote data into one serializable object, thus reducing the number of network transfers required to access multiple items of remote data.

The ValueObjectFactory/ValueObjectAssembler Pattern combines remote data from multiple remote objects into one serializable object, thus reducing the number of network transfers required to access multiple items of remote data.

The ValueListHandler Pattern: avoids using multiple Entity beans to access the database, using Data Access Objects which explicitly query the database; and returns the data to the client in batches (which can be terminated) rather than in one big chunk, according to the Page-by-Page Iterator pattern.

The CompositeEntity Pattern reduces the number of actual entity beans by wrapping multiple java objects (which could otherwise be Entity beans) into one Entity bean.

Switching audio streams from one piece of sound to another requires some fiddly managing of the transition delay in order to avoid a gap in the audio output.

To avoid the transition delay, you need to: flush the output buffer; find out how much data was dumped; add a fudge factor; and combine these values to determine from where to start playing the new audio stream.

If a complex interpreted procedure is expected to be used more than once, it can be more efficient to convert the procedure into an expression tree which will apply the procedure optimally.

Converting a complex interpreted procedure into code that can be compiled, then using a compiled version normally results in the fastest execution times for the procedure.

Sun's javac is not a very efficient compiler. Faster compilers are available, such as jikes.

Compiling code at runtime can take a significant amount of time. If the compile time needs to be minimized, it is important to use the fastest compiler available.

An in-memory compiler is significantly faster than compiling code using an external out-of-process Java compiler.

Generating bytecode directly in-process is significantly faster than compiling code using an external out-of-process Java compiler, and is also faster than using an in-memory compiler. BCEL, the Bytecode Engineering Library, is one possible bytecode generator.

The Service Locator pattern improves performance by caching service objects that have a high-lookup cost.

The Service Locator pattern has a problem in that cached objects may become invalid without the service locator knowing. The Verified Service Locator pattern periodically tests the validity of the caches objects to avoid providing invalid service objects to requestors.

For very large transactions, use transaction attribute TX_REQUIRED for EJB methods to have all the method calls in a call chain use the same transaction.

Make tightly coupled components local to each other. Put remote beans primarily as facades across subsystems.

The page-by-page pattern is designed to handle cases where the result set is large, and the end-user is not interested in seeing all of the results. There is really no upper threshold for the size of result set in the pattern.

A hardware- or software-based HTTP load-balancer usually sits in front of the application servers within a cluster. The load balancer can decrypt HTTPS requests and distribute load.

HTTP session replication is expensive for a J2EE application server. If you can live with forcing a user to log in again after a server failure, then an HTTP load-balancer probably provides all of the fail-over and load-balancing functionality you need.

If you are storing things other than EJB Home references in your JNDI tree, then you may need clustered JNDI.

24/7 availability needs the ability to hot-deploy and undeploy new applications and new versions, and to apply patches, without bringing down the application server for maintenance.

Smart proxies can be used to implement load-balancing and fail-over for EJB remote clients. These proxies manage a list of available RMI connections one of which it will use to service an invocation.

Databases analyze query statements to decide how to process them most optimally, then cache the resulting query plan, keyed on the full statement. Reusing identical statements reuses the query plan.

Altering the statement causes a new query plan to be generated for each new statement. However statements with parameters can have the query plan reused, so use parameters rather than regenerating the statement with different values.

Using a new connection requires a prepared statement to be recreated. Reusing connections allows a prepared statement to be reused.

Connection pools should have associated PreparedStatement caches so that the PreparedStatements are automatically reused.

Redraw events can easily be generated faster than the redraw can execute. Ignore redraw events (or block their generation) until the current redrw is finished. Don't up redraw events.

Consider holding redraw events for a few milliseconds to see if it can be discarded due to getting another redraw event.

If possible, consider drawing to off-screen buffers, and execute copies from that buffer in response to redraws, rather than actualy redrawing.

Extend from JPanel, not Canvas; override paintComponent(), not paint().

Action listeners are all executed in the one event-dispatching thread. Time-consuming listeners should execute their work in a separate thread and should avoid blocking the event-dispatching thread. (To reenter the event-dispatching thread calling SwingUtilities.invokeLater() or invokeAndWait()).

Use the latest version of Swing available, as the Swing development team have an ongoing project tp improve performance.

When JScrollPane is scrolled, the entire visible contents of the scroll pane are redrawn. A backing store (off screen buffer) can be enabled using setBackingStoreEnabled(true) to speed up redraws, but this has some limitations: an extra buffer to copy can be significant for simple drawing operations; the backing store doesn't work when scrollRectToVisible() is called directly by the programmer (depends on Swing version); extra RAM is needed to maintain the extra backing buffer.

Web application scalability is the ability to sustain the required number of simultaneous users and/or transactions, while maintaining adequate response times to end users.

The first solution built with new skills and new technologies will always have room for improvement.

Avoid deploying an application server that will cause embarrassment, or that could weaken customer confidence and business reputation [because of bad response times or lack of calability].

Consider application performance throughout each phase of development and into production.

Performance testing must be an integral part of designing, building, and maintaining Web applications.

There appears to be a strong correlation between the use of performance testing tools and the likelihood that a site would scale as required.

Automated performance tests must be planned for and iteratively implemented to identify and remove bottlenecks.

Validate the architecture: decide on the maximum scaling requirements and then performance test to validate the necessary performance is achievable. This testing should be done on the prototype, before the application is built.

Have a clear understanding of how easily your configurations of Web, application, and/or database servers can be expanded.

Factor in load-balancing software and/or hardware in order to efficiently route requests to the least busy resource.

Consider the effects security will have on performance: adding a security layer to transactions will impact response times. Dedicate specific server(s) to handle secure transactions.

Select performance benchmarks and use them to quantify the scalability and determine performance targets and future performance improvements or degradations. Include all user types such as "information-gathering" visitors or "transaction" visitors in your benchmarks.

Perform "Performance Regression Testing": continuously re-test and measure against the established benchmark tests to ensure that application performance hasn?t been degraded because of the changes you?ve made.

Performance testing must continue even after the application is deployed. For applications expected to perform 24/7 inconsequential issues like database logging can degrade performance. Continuous monitoring is key to spotting even the slightest abnormality: set performance capacity thresholds and monitor them.

When application transaction volumes reach 40% of maximum expected volumes, it is time to start executing plans to expand the system

The only reliable way to determine a system?s scalability is to perform a load test in which the volume and characteristics of the anticipated traffic are simulated as realistically as possible.

It is hard to design and develop load tests that come close to matching real loads.

Characterize the anticipated load as objectively and systematically as possible: use existing log files where possible; characterize user sessions (pages viewed - number and types; duration of session; etc). Determine the range and distribution of variations in sessions. Don't use averages, use representative profiles.

The user's view of the response time for a page view in his browser depends on download speed and on the complexity of the page. e.g. the number of graphics. A poorly-designed highly graphical dynamic website could be seen as 'slow' even if the web downloads are individually quite fast.

No web application can handle an unlimited number of requests; the trick in optimization is to anticipate the likely user demand and ensure that the web site can gracefully scale up to the demand while maintaining acceptable levels of speed.

Profile the server to identify the bottlenecks. Note that profiling can be done by instrumenting the code with measurement calls if a profiler is unavailable.

One stress test methodology is: determine the maximum acceptable response time for getting a page; estimate the maximum number of simultaneous users; simulate user requests, gradually adding simulated users until the web application response delay becomes greater than the acceptable response time; optimize until you reach the desired number of users.

Pay special attention to refused connections during your stress test: these indicate the servlet is overwhelmed.

I/O reads are normally faster than writes. This means that I/O performance can be improved by decoupling reading and writing to dedicated threads, rather than interleaving reads and writes.

NOTE THE TIP "volatile primitive datatypes have atomic ++ operations" HAS BEEN SHOWN TO BE INVALID

[The chapter describes implementations for lock objects (wait until unlocked), counting semaphore objects (wait until positive), barrier sempahore objects (wait until last thread is finished), future objects (wait until a variable is first set). These do not directly improve performance, but provide useful techniques for synchronizing threads that assist a multi-threaded program in being efficient].

Use resource enumeration (acquire resources in a set order) to avoid deadlocks.

Java monitors are not necessarily the most efficient synchronization mechanism, especially if transferring the lock can lead to a race condition [chapter discusses a more complete Monitor class].

volatile fields can be slower than non-volatile fields, because the system is forced to store to memory rather than use registers. But they may useful to avoid concurrency problems.

[The chapter discusses various policies for synchronizing threads trying to read from or write to shared resources, which provide different scheduling policies: one thread at a time; readers-preferred (readers have priority); writers-preferred (writers have priority); alternating readers-writers (alternates between a single writer and a batch of readers); take-a-number (first-come, first-served)].

Scaling middleware exposes a number of issues such as threading contention, network bottlenecks, message persistence issues, memory leaks, and overuse of object allocations.

[Article dicusses questions to ask when setting up benchmarks for messaging middleware].

Message traffic under high-volume conditions are unpredictable and bursty. Messages can be produced far faster than they can be consumed, causing congestion. This condition requires the message sends to be throttled with flow control (could be an exception, or an automatic resend).

When testing performance, run overnight and over weekends to generate longer term trends. Some concerns are: testing without a real network connection can give false measures; low user simulation can be markedly different from high user simulations; network throughput may be large than the deployed environment; nonpersistent message performance is dependent on processor and memory; disk speed is crucial for persistent messages.

Watch out for method interfaces which force unnecessary or inefficient object creation.

Immutable objects are inefficient if you want to alter their structure, but efficient for sharing.

One way to avoid creating objects simply for information is to provide finer-grained methods which return information as primitives. This swaps object creation for increased method calls.

A second technique to avoid creating objects is to provide methods which accept dummy information objects that have their state overwritten to pass the information.

A third technique to avoid creating objects is to provide immutable classes with mutable subclasses, by having state defined as protected in the superclass, but with no public updators. The subclass provides public updators, hence making it mutable.

Don't try to speed up the application if there is no performance problem.

Caching data on the client can improve performance, reduce communication overheads and increase the scalability of an application.

Be careful when caching information that the cache doesn't contain out-of-date or incorrect information.

Servlet sessions expire after a settable timeout, but screens that automatically refresh can keep a session alive indefinitely, even when the screen is no longer in use.

Database connection pools can take one of two strategies: a limited size pool, where attempts to make connections beyond the pool size must wait for a connection to become idle; or a flexible sized pool with a preferred size which removes idle connections as soon as the preferred size is exceeded (i.e. temporarily able to exceed the preferred size). The fixed size pool is generally considered to be the better choice.

A time-based expiration strategy is appropriate for most types of cache elements. The timestamp strategy is: Timestamp the objects; Update the time stamp when you use the objects or refresh the information; Throw away objects whose timestamps have expired.

Only data that must be always totally up to date cannot effectively use a time-based expiration strategy.

J2ME device memory and speeds are very limited which affects everything from the data-loading speed to the frame/refresh rate, and seriously limits the ability to animate characters or otherwise rapidly change the screen.

Smart graphics is important: you need to draw clear, concise images at extremely low resolutions and with very small palettes. Animated characters need dynamic, easily-read poses which avoid kicks looking like a dance steps, or punches looking like an arm waves.

Use public variables in your classes, rather than using accessors. This is technically bad programming practice but it saves bytecode space.

Be extra careful to place things in memory only when they are in use. For example, discard an introduction splash screen after display.

Try to reduce the number of classes used. Combine classes into one if they vary only slightly in behavior. Every class adds size overheads.

Remember that loading and installing applications into J2ME phones is a relatively slow process.

LinkedHashMap preserves various ordering information, optionally including access ordering which makes LinkedHashMap appropriate for a least recently used (LRU) cache.

ArrayList has fast random access of elements, LinkedList has slow random access of elements. List classes that implement the RandomAccess interface have fast random access and using get() to iterate their elements is efficient. If RandomAccess is not implemented, use an Iterator to iterate the elements.

Painting pixel by pixel by repeatedly calling fillRect() is slow. Instead create the offscreen image as a decorator for a java.awt.image.MemoryImageSource object containing a byte array in RGB format with the pixel data. The rendering code updates the byte array and then calls MemoryImage-Source.newPixels() to notify the object that the data has been updated.

Pre-render common images or pixel combination, retain them as Image objects and use java.awt.Graphics.drawImage() (Java 1) or java.awt.image.BufferedImage.setRGB() (Java 2) to render the image to the graphics buffer.

If your dataset is small enough, read it all into memory or use an in-memory database (keeping the primary copy on disk for recovery).

An in-memory datavase avoids the following overheads: no need to pass data in from a separate process; less memory allocation by avoiding all the data copies as it's passed between processes and layers; no need for data conversion; fine-tuned sorting and filtering possible; other optimizations become simpler.

Pre-calculation makes some results faster by making the database data more efficient to access (by ordering it in advance for example), or by setting up extra data in advance, generated from the main data, to make calculating the results for a query simpler.

Pre-determine possible data values in queries, and use boolean arrays to access the chosen values.

Pre-calculate all formatting that is invariant for generated HTML pages. Cache all reused HTML fragments.

Caching many strings may consume too much memory. IF memory is limited, it may be more effective to generate strings as needed.

Write out strings individually, rather than concatenating them and writing the result.

Extract common strings into an identical string object.

Compress generated html pages to send to the user, if their browser supports compressed html. This is a heavier load on the server, but produces a significantly faster transfer for limited bandwidth clients.

Some pages are temporarily static. Cache these pages, and only re-generate them when they change.

Web services best practices are mainly the same as guidelines for developing other distributed systems.

Stay away from using XML messaging to do fine-grained RPC, e.g. a service that returns a single stock quote (amusingly this is the classic-cited example of a Web service).

Do use course-grained RPC, that is, use Web services that "do a lot of work, and return a lot of information".

When the transport may be slow and/or unreliable, or the processing is complex and/or long-running, consider an asynchronous messaging model.

Always take the overall system performance into account. Don't optimize until you know where the bottlenecks are, i.e., don't assume that XML's "bloat" or HTTP's limitations are a problem until they are demonstrated in your application.

Take the frequency of the messaging into account. Replicate data as necessary.

For aggregation services, try to retrieve data during off-hours in large, course-grained transactions.

String concatenation '+' is implemented by the Sun compiler using StringBuffer, but each concatenation creates a new StringBuffer so is inefficient for multiple concatenations.

Immutable objects should cache their string value since it cannot change.

Operating systems can keep files in their own file cache in memory, and accessing such a memory-cached file is much faster than accessing from disk. Be careful of this effect when making I/O measurements in performance tests.

Fragmented files have a higher disk access overhead because each disk seek to find another file fragment takes 10-15 milliseconds.

Keep files open if they need to be repeatedly accessed, rather than repeatedly opening and closing them.

Use buffering when accessing file contents.

Explicit buffering (reading data into an array) gives you direct access to the array of data which lets you iterate over the elements more quickly than using a buffered wrapper class.

Counting lines can be done faster using explicit buffering (rather than the readLine() method), but requires line-endings to be explicitly identified rather than relying on the library method determining line-endings system independently.

Quality of service requirements for web services are: availability (is it running); accessiblity (can I run it now); integrity/reliability (will it crash while I run/how often); throughput (how many simultaneous requests can I run); latency (response time); regulatory (conformance to standards); security (confidentiality, authentication).

HTTP is a best-effort delivery service. This means any request could simply be dropped. Web services have to handle this and retry.

Web service latencies are measured in the tens to thousands of milliseconds.

Asynchronous messaging can improve throughput, at the cost of latency.

SOAP overheads include: extracting the SOAP envelope; parsing the contained XML information; XML data cannot be optimized very much; SOAP requires typing information in every SOAP message; binary data gets expanded (by an average of 5-fold) when included in XML, and also requires encoding/decoding.

Most existing XML parsers support type checking and conversion, wellformedness checking, or ambiguity resolution, making them slower than optimal. Consider using of stripped down XML parser which only pe4rforms essential parsing.

DOM based parsers are slower than SAX based ones.

Compress the XML when the CPU overhead required for compression is less than the network latency.

A server that caters to hundreds of clients simultaneously must be able to use I/O services concurrently. Prior to 1.4, an almost one-to-one ratio of threads to clients made servers written in Java susceptible to enormous thread overhead, resulting in both performance problems and lack of scalability.

The Reactor design pattern demultiplexes events and dispatches them to registered object handlers. (The Observer pattern is similar, but handles only a single source of events where the Reactor pattern handles multiple event sources).

[Articles covers the changes needed to use java.nio to make a server efficiently muliplex non-blocking I/O from SDK 1.4].

Executing a search against the database calls one of the finder() methods. finder() methods must return a collection of remote interfaces, not ValueObjects. Consequently the client would need to make a separate remote call for each remote interface received, to acquire data. The SessionFacade pattern suggests using a session bean to encapsulate the query and return a collection of ValueObjects, thus making the request a single transfer each way.

The Value Object Assembler pattern uses a Session EJB to aggregate all required data as various types of ValueObjects. This pattern is used to satisfy one or more queries a client might need to execute in order to display multiple data types.

Applications with high screen performance needs, like games, need finer control over MIDP screens and should use the javax.microedition.lcdui package which provides the low-level API for handling such cases.

Always check the drawing area dimensions using Canvas.getHeight() and Canvas.getWidth() [so that you don't draw unnecessarily off screen].

Not all devices support color. Use Display.isColor() and Display.numColors( ) to determine color support and avoid color mapping [overheads].

Double buffering is possible by using an offscreen Image the size of the screen. Creating the image: i = Image.createImage(width, height); Getting the Graphics context for drawing: i.getGraphics(); Copying to the screen g.drawImage(i, 0, 0, 0);

Check with Canvas.isDoubleBuffered(), and don't double-buffer if the MIDP implementation already does it for you.

To avoid deadlock paint() should not synchronize on any object already locked when serviceRepaints() is called.

Entering alphanumeric data through a handheld device can be tedious. If possible, provide a list of choices from which the user can select.

The out-of-the-box configuration for Entity EJB engines, such as WebLogic, are designed to handle read-write transactional data with the best possible performance.

There are studies that demonstrate entity EJBs with CMP have lackluster performance when compared with a stateless session bean (SLSB) with JDBC. [Author points out however that SLSB/JDBC combination is less robust, less configurable, and less maintainable].

Local entity beans do not need to be marshalled, and do not incur any marshalling overhead for method calls either: parameters are passed by reference.

Local entity beans are an optimization for beans which it is known will be on the same JVM with their callers.

Facade objects (wrappers) allow local entity beans to be called remotely. This pattern incurs very little overhead for remote calls, while at the same time optimizing local calls between local beans which can use local calls.

http://developer.java.sun.com/developer/Books/programming/performance/eperformance/eJavaCh01.pdf
Chapter 1 of "Enterprise Java Performance", "Performance in General". Includes the infamous sentences "It is likely that the code will not meet the performance requirements the very first time it runs. Even if it does, it may be worthwhile to look for some ways to improve it." NO NO NO! If the code meets the performance requirements, DON'T CHANGE IT. Next time guys, ask me to review your book before you publish. (Page last updated 2000, Added 2000-10-23, Authors Steven Halter & Steven Munroe, Publisher Sun). Tips:

The simplest code usually performs best.

Consider performance requirements before coding.

Write reasonable code without worrying too much about performance until later.

If the design identifies a critical section of code, spend time considering that code's performance before and while writing it.

MarshalledObject lets you postpone deserializing objects. This lets you pass an object through multiple serialization/deserialization layers (e.g. passing an object through many JVMs), without incurring the serialization/deserialization overheads until absolutely necessary.

Don't optimize unless necessary. Optimizing can: introduce new bugs; make code harder to understand and maintain; reduce the extensibility of the code.

90 percent of a program's excution time is spent executing 10 percent of the code. (Some people use the 80 percent/20 percent rule). Optimizing the other 90 percent of the program (where 10 percent of the execution time was spent) has no noticeable effect on performance.

All the following affect embedded Java performance: hardware processor selection; (real-time) operating system selection; supported Java APIs; application reliability and scalability; graphics support; and the ability to put the application code into ROM.

Various approaches for boosting bytecode execution speed include: a JIT compiler (usually too big for embedded systems); an ahead-of-time compiler (requires more ROM, may disallow or slowdown dynamically loaded classes); a dynamic adaptive compiler (half-way house between last two options); putting the Java application code into ROM; rewriting the JVM interpretation loop in assembly; using a Java hardware accelerator.

Use the lightweight graphical toolkit.

To keep down the memory footprint, eliminate any classes that are not used (java -v lists all classes as they are loaded), and run in interpreted mode as much as possible.

Benchmark results are not necessarily applicable to your application [article reviews the applicability of standard and proprietary benchmarks].

Thoroughly test any framework in a production-like environment to ensure that stability and performance requirements are met.

Each component should be thoroughly reviewed and tested for its performance and security characteristics.

Using the underlying EJB container to manage complex aspects such as transactions, security, and remote communication comes with the price of additional processing overhead.

To ensure good performance use experienced J2EE builders and use proven design patterns.

Consider the impact of session size on performance.

Avoid the following common mistakes: Failure to close JDBC result sets, statements, and connections; Failure to remove unused stateful session beans; Failure to invalidate HttpSession.

Performance test various options, for example, test both Type 2 and Type 4 JDBC drivers; Use a load-generation tool to simulate moderate loads; monitor the server to identify resource utlization.

Perform code analysis and profiling.

Performance requirements include: the required response times for end users; the perceived steady state and peak user loads; the average and peak amount of data transferred per Web request; the expected growth in user load over the next 12 months.

Note that peak user loads are the number of concurrent sessions being managed by the application server, not the number of possible users using the system.

Application server caching should include web-page caches and data access caches. Other caches include caching servers which "guard" the application server, intercepting requests and either returning those that do not need to go to the server, or rejecting or delaying those that may overload the app server.

Application servers should use connection pooling and database caching to minimize connection overheads and round-trips.

Load balancing mechanisms include: round-robin DNS (alternating different IP-addresses assigned to a server name); and re-routing mechanisms to distribute requests across multiple servers. By maintaining multiple re-routing servers and a client connection mechanism that automatically checks for an available re-routing server, fault tolerance is added.

Using one thread per user can become a bottleneck if there are a large number of concurrent users.

Distributed components should consider the proximity of components to their data (i.e., avoid network round-trips) and how to distribute any resource bottlenecks (i.e., CPU, memory, I/O) across the different nodes.

The include directive (<%@ include file="filename.inc" %>) is faster than the include action (<jsp:include page="pagename.jsp" flush="true"/>).

redirects are slower than forwards because the browser has to make a new request.

Database access is typically very expensive in terms of server resources. Use a connection pool to share database connections efficiently between all requests, but don't use the JDBC ResultSet object itself as the cache object.

Pessimistic locking, where database data is locked when read, can lead to high lock contention.

Optimistic locking only checks data integrity at update time, so has no lock contention [but can have high rollback costs]. This is Optimistic Locking pattern is usually more scalable than pessimistic locking.

Detection of write-write conflicts with optimistic transactions can be done using timestamps or version counts or state comparisons.

Reduce compiled code size by using implicit instruction bytcodes wherever possible. For example, limiting a method to four or fewer local variables (three on non-static methods as "this" takes the first slot), allows the compiler to use implicit forms of instructions (such as aload, iload, fload, astore, istore, fstore, and so on).

Similarly numbers -1, 0, 1, 2, 3, 4 ,5 have special bytecodes

Java class files are standalone - no data is shared between class files. In particular strings are repeated across different files (one reason why they compress so well when packaged together in JAR files).

An empty class compiles to about 200 bytes, of which only 5 bytes are bytecode.

There are no instructions for initializing complete arrays in the Java VM. Instead, compilers must generate a series of bytecodes that initialize the array element by element. This can make array initialization slow, and adds bytecode to the class.

You can reduce bytecode bloat from array initialization by encoded values in strings and using those strings initialize the arrays.

Explicitly set references to null when they are no longer needed to ensure that the objects can be garbage collected.

Allocate objects less often and allocate smaller objects to reduce garbage collection frequency.

Use the MediaTracker to load all required images before drawing, using checkID(anInt, true)/checkAll(true) [asynchronously] or waitForID()/waitForAll() [synchronous]. [example code included in article]

Combine images in a single file (e.g. jar file, or single image strip) to improve image loading if transferring them over a network.

Large RAM requirements can force the OS to use virtual memory, which slows down the application.

Most JVM implementations will not dereference temporary objects until the method has gone out of scope, even if the object is created in an inner block which has gone out of scope. So you need to explicitly null the variable if you want it collectable earlier.

Adding a finalizer method extends the life of the object, since it cannot be collected until the finalize() method is run.

DNS round-robin sends each subsequent DNS lookup request to the next entry for that server name. This provides a simple machine-level load-balancing mechanism, but is only appropriate for session independent or shared-session servers.

DNS round-robin has no server load measuring mechanisms, so requests can still go to overloaded servers, i.e. the load balancing can be very unbalanced.

Hardware load-balancers solve many of the problems of DNS round-robin, but introduce a single point of failure.

A web server proxy can also provide load-balancing by redirecting requests to multiple backend webservers.

Every network communication has several overheads: the distance between the sender and the receiver adds a minimum latency (limited by the speed the signal can travel along the wire, about two-thirds of the speed of light: London to New York would take about 3 milliseconds); each network router and switch adds time to respond to data, on the order of 0.1 milliseconds per device per packet.

Part of most network communications consists of small control packets, adding significant overhead.

One RMI call does not generally cause a noticeable delay, but even tens of RMI calls can be noticeable to the users.

Beans written with many getXXX() and setXXX() methods can incur an RMI round trip for every data attribute.

Messaging is naturally asynchronous, and allows an application to decouple network communications from ongoing processing, potentially avoiding threads from being blocked on communications.

Generative programming is a class of techniques that allows for more flexible designs without the performance overhead often encountered when following a more traditional programming style. JSP engines are one example. java.lang.reflect.Proxy is another.

More advanced code obfuscations (such as control-flow obfuscation) can produce slower programs as the obfuscated bytecode is more difficult to optimize by the JIT or HotSpot compiler.

A reflective lookup [obtaining the method reference from its name] is much slower than a reflective invoke [invoking the method from the reference] once you have a method reference.

[Article provides an implementation of the JNI call using the JVM_OnLoad() function to trap class bytecodes as they are loaded].

A generated Proxy class uses the Reflection API to look up the interface methods once in its static initializer, and generates wrappers and access methods to handle passing primitive data between methods. [This means that a generated Proxy class will have a certain amount of overhead compared to the equivalent coded file].

GC is single-threaded (at least to 1.3.x), so cannot take advantage of multiple-CPUs (i.e. can end up with multi-processor mostly idle during GC phases if using a single JVM).

Too many threads can lead to thread "starvation" [presumably thrashing].

Use at least one thread per CPU, more if any threads will be i/o blocked. On Solaris use the mpstat utility to monitor CPU utlization.

1.4 will include concurrent GC that should avoid large GC pauses.

The biggest performance problem is bad design.

Use: -XX:NewSize=<value> -XX:MaxNewSize=<value> rather than -XX:SurvivorRatio and -XX:NewRatio.

Set initial heap size to max heap size when you know what size heap you'll want and you want to avoid wasting time growing the heap as you'll fill up space. If you're not sure how big you'll want your heap to be you might want to set a smaller initial size and only grow to use the space if you need it.

-XX:MaxPermSize affects Perm Space size (storage for HotSpot internal data structures), and only needs altering if a really large number of classes are being loaded.

[The session also discussed some Solaris OS parameters to tune].

For JDK 1.3, the heap is: TotalHeapSize = -Xmx setting + MaxPermSize; with -Xmx split into new and old spaces [i.e. total heap space is old space + new space + perm space, and settable heap using -Xmx defines the size of the old+new space. -XX:MaxNewSize defines how much of -Xmx heap space goes to new space].

When dealing with large numbers of active listeners, multicast publish/subscribe is more efficient than broadcast or multiple individual connections (unicast).

When dealing with large numbers of listeners with only a few active, or if dealing with only a few listeners, multicasting is inefficient. This scenario is common in enterprise application integration (EAI) systems. Inactive listeners require all missed messages to be resent to them in order when the listener becomes active.

A unicast-based message transport, such as message queuing organized into a hub-and-spoke model, is more efficient than multicast for most application integration (EAI) scenarios.

GatheringByteChannel lets you to write a sequence of bytes from multiple buffers, and ScatteringByteChannel allows you to read a sequence of bytes into multiple buffers. Both let you minimize the number of system calls meade by combining operations that might otherwise require multiple system calls.

Selector allows you to multiplex I/O channels, reducing the number of threads required for efficient concurrent I/O operations.

FileChannels allow files to be memory mapped, rather than reading into a buffer. This can be more efficient. [But note that both operations bring the file into memory in different ways, so which is faster will be system and data dependent].

Compression techniques have efficiencies that vary depending on the data being compressed. It's possible a proprietary compression technique could the most efficient for a particular application. For example, instead of transmitting a compressed picture, the component objects that describe how to draw the picture may be a much smaller amount of data to transfer.

ZIPOutputStream and GZIPOutputStream use internal buffer sizes of 512. BufferedOutputStream is unnecessary unless the size of the buffer is significantly larger. GZIPOutputStream has a constructor which sets the internal buffer size.

Zip entries are not cached when a file is read using ZipInputStream and FileInputStream, but using ZipFile does cache data, so creating more than one ZipFile object on the same file only opens the file once.

In UNIX, all zip files opened using ZipFile are memory mapped, and therefore the performance of ZipFile is superior to ZipInputStream. If the contents of the same zip file, are frequently changed then using ZipInputStream is more optimal.

Compressing data on the fly only improves performance when the data being compressed are more than a couple of hundred bytes.

An object is only counted as being unused when it is no longer referenced. If objects remain referenced unintentionally, this is a memory leak.

If you get a java.lang.OutOfMemoryError after a while, memory leakage is a strong suspect.

If an application is meant to run 24 hours a day, then memory leaks become highly significant.

Most JVMs grow towards the upper heap limit (-Xmx/-mx options) when more memory is required, and do not return memory to the operating system, even if the memory is no longer needed, until the JVM process terminates.

BigDecimal provides arbitrary-precision floating point number arithmetic, at the cost of performance.

Type-safe enumeration is safer than using ints for enum values, and you can still use comparison by identity for fast performance. But you lose the performance potential of using the enum values directly as array indices, switch constants and bitmasks.

Graphics performance in 1.2 is worse than 1.1. 1.3 is better, and 1.4 should be the fastest yet.

From 1.2 direct access to image pixels was available, but was too slow to be usable because it involved copying many bits around in memory.

Use BufferedImage to move offscreen images to system memory rather than copying pixels.

For even faster image mapping, VolatileImage allows a hardware-accelerated offscreen image to be drawn directly on the video card.

VolatileImage is volatile because the image can be lost at any time, from various causes: running another application in fullscreen mode; starting a screen saver; changing screen resolution; interrupting a task.

Only constantly re-rendered images need to be explicitly created as VolatileImage objects to be hardware accelerated. Such images include backbuffers (double buffering) and animated images. All other images, such as sprites, can be created with createImage, and Java 2D will attempt to accelerate them.

If an image, such as a sprite, is drawn once and copied from many times, Java 2D makes a copy of it in accelerated memory and future copies from the image can perform better.

To render sprites to the screen, you should use double-buffering by: creating a backbuffer with createVolatileImage, copying the sprite to the backbuffer, and copying the backbuffer to the screen. If content loss occurs, Java 2D re-copies the sprite from software memory to accelerated memory.

Only some graphics operations (e.g. curved shapes) are accelerated on some platforms. Use profiling to determine what works best for your situation.

From 1.4 Swing uses VolatileImage for its double buffering.

VolatileImage.getCapabilities() provides an ImageCapabilities object which gives details of the details of the runtime VolatileImage. The ImageCapabilities allows the application to decide to use less images, images of lower resolution, different rendering algorithms, or various other means to attempt to get better performance from the current situation and platform.

Define the life cycles of objects and the duration of object interrelationships. Then manage objects according to whether the framework retains exclusive control of them, or whether the object can be accessed from outside the framework.

Minimize the number of objects that can be accessed from outside the framework.

In general, the creator of an object should be responsible for the objects' life cycle. Where this is not the case, the transfer of ownership of the object should be explicit and emphasized. Similarly object relationship management should be explicit and reversible: for every add() action, there must be a remove(); for every register() action, there must be a deregister().

Obtain and release pooled conections within each method that requires the resource if the connection is very short (termed "Quick Catch-and-Release Strategy" in the article). However do not release the connection only to use it again almost immediately, instead hold the connection until it will not be immediately needed.

The performance penalty of obtaining and releasing connections too frequently is quite small in comparison to potential scalability problems or issues raised because EntityBeans are holding on to the connections for too long.

The "Quick Catch-and-Release Strategy" is the best default strategy to ensure good performance and scalability.

[The compiler concatenates strings where they are fully resolvable, so don't move these concatenations to runtime with StringBuffer.]

Where the compiler cannot resolve concatenated strings at compile time, the code should be converted to StringBuffer appends, and the StringBuffer should be appropriately sized rather than using the default size.

Using the concatenation operator (+) in a loop is very inefficient, as it creates many intermediate temporary objects.

Presizing collections (like Vector) to the expected size is more efficient than using the default size and letting the collection grow.

Removing elements from a Vector will necessitate copying within the Vector if the element is removed from anywhere other than the end of the collection.

Cache the size of the collection in a local variable to use in a loop instead of repeatedly calling collection.size().

Unsynchronized methods are faster than synchronized ones.

[Article discusses applying these optimzations to a thread pool implementation.]

Creating and dereferencing too many objects can adversely impact performance.

Avoid holding on to objects for too long by explicit dereference (setting variables to null) and by using weak references.

Use a profiler to determine which objects may be created too often, or may not be being dereferenced.

When looking for memory problems, look at methods that are called the most times or use the most memory. Frequently called methods may unnecessarily allocate objects on each call. Methods that use a lot of memory may not need to use as much memory or they may be a source of memory leaks.

Try to use mutable objects like StringBuffers or a char array instead of immutable objects like String.

Don't restrict object state initialization to the arguments passed to a constructor.

Provide a zero-argument constructor that creates reasonable default values and include setter methods or an init method to allow objects of that class to be reused.

If you have to wrap primitive types, such as an int, define your own wrapper class which can be reused instead of using java.lang.Integer.

If you need to create many instances of a wrapper class like Integer, consider writing your algorithm to accept primitive types.

Use a factory class instead of directly calling the "new" operator, to allow easier reuse of objects.

Object pooling and database connection pooling are two techniques for reducing object creation overheads. Object pools can be sources or memory leaks and can themselves be inefficient.

A B-tree outperforms a binary tree when used for external sorting (for example, when the index is stored out on disk) because searching a binary tree cuts the number of keys that need searching in half for every node searched, whereas B-tree searching cuts the number of keys that have to be searched by approximately 1/n, where n is the number of keys on a node.

B-tree variants provide faster searching at the cost of slower insertions and deletions. Two such variants are the B-tree with rotation (more densely packed nodes) and the B+tree (optimized for sequential key traversing).

[Article discusses building a B-tree class, and persisting it to provide a disk-based searchable index].

Writing every data block to disk when any part of it changes would be bad for system performance. Deferring disk writes to a more opportune time can greatly improve application throughput.

Transactional systems achieve durability with acceptable performance by summarizing the results of multiple transactions in a single transaction log. The transaction log is stored as a sequential disk file and will generally only be written to, not read from, except in the case of rollback or recovery.

Writing an update record to a transaction log requires less total data to be written to disk (only the data that has changed needs to be written) and fewer disk seeks.

Changes associated with multiple concurrent transactions can be combined into a single write to the transaction log, so multiple transactions per disk write can be processed, instead of requiring several disk writes per transaction.

It's nearly impossible to achieve good performance through optimizations alone, without considering performance in analysis and design stages.

Creating clear system and performance requirements is the key to evaluating the success of your project.

Use cases provide excellent specifications for building benchmarks.

Specify the limitations of the application: well-defined boundaries on the application scope can provide big optimization opportunities.

Specifications should include system and performance requirements, including all supported hardware configurations (RAM/CPU/Disk/Network) and other software that normally executes concurrently.

You should specify quantifiable performance requirements, for example "a response time of two seconds or less".

Scalability is more dependent on good design decisions than optimal coding techniques.

Encapsulation leads to slowdowns from increased levels of indirection, but is essential in large, scalable, high-performance systems. For example, using a java.util.List object may be slower than using a raw array, but allows you to change very easily from ArrayList to LinkedList when that is faster.

Meeting or exceeding your performance requirements should be part of the shipping criteria for your product.

Once you've determined that a performance problem exists, you need to begin profiling. Profilers are most useful for identifying computational performance and RAM footprint issues.

Performance tuning is an iterative process. Data gathered during profiling needs to be fed back into the development process.

Micro-benchmarks (repeatable sections of code) can be useful but may not represent real-world behavior. Factors that can skew micro-benchmark performance include Java virtual machine warm-up time, and global code interactions.

Macro-benchmarks (repeatable test sequences from the user point of view) test your system as actual end users will see it.

Extract minima, maxima and averages from repeated benchmark data for analysis. Use these to compare progress of benchmarks during tuning. [I like to add the 90th-centile value too].

Profilers help you find bottlenecks in applications, and should show: the methods called most often; the methods using the largest percentage of time; the methods calling the most-used methods; and the methods allocating a lot of memory.

The Sun JVM comes with the hprof profiler.

Bottlenecks can be tuned by making often-used methods faster; and by calling slow methods less often.

Backtrace methods to understand the context of the bottleneck. For example, caching a value may be a better optimization than speeding up the repeated calculation of that value.

Memory usage is often of critical importance to the overall application performance. Excessive memory allocation is often one of the first things that an experienced developer looks for when tuning a Java program.

Examine bottlenecks for memory allocation. For example you may be able to replace a repeated object allocation in a loop with a reusable object allocated once outside the loop.

Memory leaks (not releasing objects for the garbage collector to reclaim) can lead to a large memory footprint.

You identify memory leaks by: determining that there is a leak; then identifying the objects that are not being garbage colleted; then tracing the references to those leaking objects to determine what is holding them in memory.

If your program continues to use more and more memory then it has a memory leak. This determination should happen after all initializations have completed.

Identify memory leak objects by marking/listing the objects in some known state, then cycling through other states and back to that known state and seeing which extra objects are now present.

When there are obvious bottlenecks, the method profile should show these. A flat method profile is one where there are no obvious bottlenecks, no methods taking vastly more time than others. In this case you should look at cumulative method profiles, which show the relative times taken by a method and all the methods it calls (the call tree). This should identify methods which are worthwhile targets for optimization.

To avoid loading unnecessary classes (e.g. when the JIT compiles methods which refer to unused classes), use Class.forName() instead of directly naming the class in source. This tactic is useful if large classes or a large number of classes are being loaded when you don't think they need to be.

Combine listener functionality into one class to avoid an explosion of generated inner classes. This technique increases maintenance costs.

Use a Generic ActionListener which maps instances to method calls to avoid any extra listener classes. This has the drawback of losing compile-time checks. java.lang.reflect.Proxy objects can be used to generalize this technique to multiple interfaces.

Run multiple applications in the same JVM. [Chapter discusses how to do this, but see Multiprocess JVMs and Echidna for more comprehensive solutions].

Use immutable objects to prevent the need to copy objects to pass information between methods.

Object pooling small objects is often counterproductive. The overhead of managing the object pool is often greater than the small object penalty. Pooling can also increase a program's memory footprint.

Pooling large objects (e.g. large bitmaps or arrays) or objects that work with native resources (e.g. Threads or Graphics) can be efficient.

Choosing the best algorithm or data structure for a particular task is one of the keys to writing high-performance software.

The optimal algorithm for a task is highly dependent on the data and data size.

Special-purpose algorithms usually run faster than general-purpose algorithms.

Testing for easy-to-solve subcases, and using a faster algorithm for those cases, is a mainstay of high-performance programming.

Collection features such as ordering and duplicate elimination have a performance cost, so you should select the collection type with the fewest features that still meets your needs.

Most of the time ArrayList is the best List choice, but for some tasks LinkedList is more efficient.

HashSet is much faster than TreeSet.

Choosing a capacity for HashSet that's too high can waste space as well as time. Set the initial capacity to about twice the size that you expect the Set to grow to.

The default hash load factor (.75) offers a good trade-off between time and space costs. Higher values decrease the space overhead, but increase the time it takes to look up an entry. (When the number of entries exceeds the product of the load factor and the current capacity, the capacity is doubled).

Programs pay the costs associated with thread synchronization even when they're used in a single-threaded environment.

The Collections.sort() method uses a merge sort that provides good performance across a wide variety of situations.

When dealing with collections of primitives, the overhead of allocating a wrapper for each primitive and then extracting the primitive value from the wrapper each time it's used is quite high. In performance-critical situations, a better solution is to work with plain array structures when you're dealing with collections of primitive types.

Random number generation can take time. If possible you can pre-generate the random number sequence into an array, and use the elements when required.

Swing?s model-view architecture is critical for building scalable programs.

When changing data stored in models, perform the operations in bulk whenever possible. E.g. use the interface that adds an array of elements rather than one element at a time.

Use custom models to handle large datasets. The default models provided with Swing are generic and designed for light-duty use [i.e. are slow].

Custom renderers can sometimes be used to improve performance. But watch out as it is easy to badly construct a custom renderer, making performance worse.

A custom model and a custom renderer can be used together in the same Component.

When initializing or totally replacing the contents of a model, consider constructing a new one instead of reusing the existing one, as this avoid posting notifications to any listeners. [Or reuse the object but deregister the listeners first].

Test response times against average current data/user volumes, then repeat the same test against four times as much volume as you expect in 3 years time. This defines your long term target - getting the response times the same for that latter test.

Response time increasing too much when database is over populated probably indicates lack of or inappropriate indexing on the database.

Response time increasing exponentially as load increases, you need to improve scalability by optimizing the application or adding resources.

Use SQL clause with EXPLAIN or similar (e.g. "Explain select * from table where tablefield = somevalue") to ensure that the database is doing an indexed search rather than a linear searches of large datasets.

Use a profiler to determine object usage, garbage collection behaviour and method bottlenecks in the application.

Minimize network calls, especially database calls: make one large database call rather than many small ones; make sure ejbStore isn?t storing anything for read only operations; use Details Objects to get entity bean state rather than making many trips for each aspect of state.

Use caching where possible.

Use session beans as a fa?ade to your entity beans to encapsulate the workflow of one entire usecase in one network call to one method on a session bean (and one transaction).

Use container-managed persistence when you can. An efficient container can avoid database writes when no state has changed, and reduce reads by retrieving records at the same time as find() is called.

Minimize database access in ejbStores. Use a "dirty" flag to avoid writing tee bean unless it has been changed.

Always cache references obtained from lookups and find calls. Always define these references as instance variables and look them up in the setEntityContext (method setSessionContext for session beans).

Always prepare your SQL statements.

Close all database access/update statements properly.

Avoid deadlocks. Note that the sequence of ejbStore calls is not defined, so the developer has no control over the access/locking sequence to database records.

HotSpot Client VM (JVM 1.3) is optimized for quick startup time and low-memory footprint. The server VM (HotSpot 1.0/2.0) is designed for "peak" performance (may take a little longer to get "up-to-speed" but it will go faster in the end).

Always use System.arraycopy to copy arrays.

Sticky applets available with the 1.3 plugin speeds startup (persistently caches classes on clients). Also put resources together into jar file to reduce download requests.

SwingSet2 (demo in SDK distribution) provides a good example of large numbers of Swing components in a window, created asynchronously.

Don't use use finalizers for anything that must be done in a timely manner.

Use primitives and transients to speed up serialization.

Use a concentrator object to limit the repaint events to once every 100 milliseconds in heavily loaded systems and in multi-threaded swing environments. There is some overhead for context switching (using invokeLater) into the AWT-event thread, which you want to minimize.

The key to high performance code is organization and process. Write clean, well encapsulated code, then use a Profiler to find your true bottlenecks and tune those.

Some application servers can automatically pass parameters by reference if the communicating EJBs are in the same JVM. To ensure that this does not break the application, write EJB methods so that they don't modify the parameters passed to them.

In most current JVMs (prior to 1.4) GC starts off by locking out all other threads in the JVM. GC is a stop-the-world, synchronous operation. Non-generationl GC requires scanning the stacks of every thread and the entire Java heap.

Calling System.gc() explicitly is not good for performance, as it can be called when GC is not necessary, but will still result in a long pause of all JVM operations.

JViewport.BLIT_SCROLL_MODE is the default scrolling mode for JViewport in SDK 1.3 (available since 1.2.2). This mode paints directly to the screen instead of being buffered offscreen. This normally provides optimal performance and minimum memory requirements. However complex images may display some intermediate paint operations if the painting is not fast enough, giving jerky or flashing images. If this is unacceptable, try the alternate modes: setScrollMode(BACKINGSTORE_SCROLL_MODE) (intermediate performance, higher memory requirements); or setScrollMode(JViewport.SIMPLE_SCROLL_MODE) (slowest).

If you use JNI Get* calls (for example, GetStringCritical), you must always use the corresponding Release* call (for example, ReleaseStringCritical) when you have finished with the data, even if the isCopy parameter indicates that no copy was taken.

Target performance for processors that you will run on when the project is deployed.

Implementing the ImageProducer interface and setting an image's pixels directly eliminates one or two steps in the MemoryImageSource option and seems to be about 10 percent to 20 percent faster on average.

Raw frame rate display, without taking account of the time taken to draw an image, runs from 2 frames per second (fps) to 400 fps, depending on processor and JVM. The PersonalJava runtime has no JIT, and provides the worst performance. With a JIT it might be usable.

[Article includes references to a number of hardware based Java implementations, i.e. Java enabled CPUs.]

Multi-threaded programs can allow multiple activities to continue without blocking the user.

Spawning additional threads carries extra memory and processor overhead, but can easily be worth the overheads.

Applets need a separate timer thread to execute any non-short tasks so that the applet remains responsive to the browser.

The volatile modifier requests the Java VM to always access the shared copy of the variable so the its most current value is always read. If two or more threads access a member variable, AND one or more threads might change that variable's value, AND ALL of the threads do not use synchronization (methods or blocks) to read and/or write the value, then that member variable must be declared volatile to ensure all threads see the changed value.

For data that changes infrequently (i.e. rarely enough that a user session will not need that data updating during the session lifetime), avoid transactional access by using a cached Data Access Object rather than the transactional EJB (this is called the Fast Lane Reader pattern).

Don't transfer long lists of data to the user, transfer a page at a time (this is called the Page-by-Page Iterator pattern).

Instead of making lots of remote requests for data attributes of an object, combine the attributes into another object and send the object to the client. Then the attributes can be queried efficiently locally (this is called the Value Object pattern). Consider caching the value objects where appropriate.

The user interface must always be responsive to the user's interaction.

The application should respond to input no later than a tenth of a second after it occurs: longer delays are noticed by the user, and make the user interface seem unresponsive. So don't do more than about a tenth of a second's worth of work in the user-service thread in response to any user interface event.

Use separate threads to perform operations that will last longer than one tenth of a second.

Provide the user with the option to cancel the operation at any time.

[Article provides an example of making an HTTP connection following these suggestions].

Use bigger, better, faster hardware, but there is a limit to the scalability of a single server: most application performance does not scale linearly with increases in the hardware power.

Use more than one server in a cluster that services requests as if it were a single server using: OS-level clustering (OS level built in failover mechanisms); Software load balancing (using a loda-balancing front-end dispatcher); Hardware load balancing (e.g. DNS round-robin to different servers).

A basic load-balancing scheme is achievable by sending documents with different binding addresses (differnent URL hosts)

You need a scheduling mechanism to perform animation, scrolling, updating the display, etc.

The paint() method on the Canvas is called by the system only if it thinks that it needs to repaint it. So we need another timer to repaint the screen on a regular basis. Use a timer to periodically call repaint().

Separate the UI controller logic from the servlet business logic, and let the controllers be mobile so they can execute on the client if possible.

Validate data as close to the data entry point as possible, preferably on the client. This reduces the network and server load. Business workflow rules should be on the server (or further back than the front-end).

You can use invisible applets in a browser to validate data on the client.

Prepared SQL statements get compiled in the database only once, future invocations do not recompile them. The result of this is a decrease in the database load, and an increase in performance of up to 5x.

If an applet parameter's [tags in the webpage] length is too long, the Web page's responsiveness begins to bog down. Move all but the essential parameters from the APPLET tag to a dedicated HTTP link between the applet and the servlet. This allows page loading and applet initialization to occur at the same time over separate connections.

Generating XML produces a large amount of data during communications, but this does not mean that the communication will be the bottleneck.

Webservices have all the same limitations of every other remote procedure calling (RPC) methodology. Requiring synchronous communications across a WAN is a heavy overhead regardless of the protocol.

If "Web services" tend to be chatty, with lots of little round trips and a subtle statefulness between individual communications, they will be slow. That's a function of failing to realize that the API call model isn't well-suited to building communicating applications where caller and callee are separated by a medium (networks!) with variable and unconstrained performance characteristics/latency.

Establishing an initial connection is one of the most expensive database operations. Use a pool of connections that are ready and waiting for use to minimize the connection overhead.

Connection pooling is one of the largest performance improvements available for applications which are database intensive.

Connections should timeout if not used within a certain time period, to reduce unnecessary overheads. Initial and maximum pool sizes provide further mechanisms for fine-tuning the pool.

JDBC 2.0 supports connection pooling, though a particular driver may or may not use the support. If pooling is supported by the driver, it is probably more efficient than a proprietary pooling mechanism since it can leverage database specific features.

[Statistics useful for comparison if you are building a business enterprise site: The architecture can handle 8,000 concurrent user sessions; 85 dynamic page views a second; 250,000 unique daily visitors; 8 million hits a day; 1 to 2 second average response time].

Asynchronous messaging is a proven communication model for developing large-scale, distributed enterprise integration solutions. Messaging provides more scalability because senders and receivers of messages are decoupled and are no longer required to execute in lockstep.

With Statement, the same SQL statement with different parameters must be recompiled by the database each time. But PreparedStatements can be parametrized, and these do not need to be recompiled by the database for use with different parameters.

You might see a performance increase by using multiple connections to your mail server. You would need to get multiple Transport objects and call connect and sendMessage on each of them, using multiple threads (one per connection) in your application.

JavaMail 1.2 includes the ability to set timeouts for the initial connection attempt to the server.

JavaMail tries to allow you to make good and efficient use of the IMAP protocol. Fetch profiles are one technique to allow you to get batches of information from the server all at once, instead of single pieces on demand. Used properly, this can make quite a difference in your performance.

You can test if a particular JIT is able to convert tail-recursive into loops with a dummy tail-recursive method which never terminates. If the JVM crashes because of stack overflow, no conversion is done (if the conversion is managed, the JVM loops and never terminates).

The HotSpot JVM with the 1.3 release does not convert tail-recursive methods into loops. The IBM JVM with the 1.3 release does.

ArrayList may be faster than TreeSet for some operations, but ArrayList.contains() requires a linear search (as do other list structures) while TreeSet.contains() is a simple hashed lookup, so the latter is much faster.

Both auto mode (Session.AUTO_ACKNOWLEDGE) and duplicate delivery mode (Session.DUPS_OK_ACKNOWLEDGE) guarantee delivery of messages, but duplicate okay mode can have a higher throughput, at the cost of the occasionally duplicated message.

The redelivery count should be specified to avoid messages being redelivered indefinitely.

Sometimes output streams are buffered by the operating system for performance. The flush() method forces the data to be written whether or not the buffer is full. This is not the same as the buffering performed by a BufferedOutputStream. That buffering is handled by the Java runtime. This buffering is at the native OS level. However, a call to flush() should empty both buffers

It's more efficient to read multiple bytes at a time, i.e use read(byte[]) rather than read().

The best size for the buffer is highly platform dependent and generally related to the block size of the disk, at least for file streams. Less than 512 bytes is probably too little and more than 4096 bytes is probably too much. Ideally you want an integral multiple of the block size of the disk. However, you should use smaller buffer sizes for unreliable network connections.