Java is a modern object oriented,
strongly typed programming language with an excellent class library
component and can be used for solving a wide variety of software
problems. It is run on a virtual machine, allowing for
cross-platform binary compatibility. Often people cite this as a
drawback from obtaining good performance from a Java application.
This blog post seeks to aid developers by demonstrating some basic
approaches to tuning as well as to counter negative performance
perceptions about Java by demonstrating that is easily possible to
obtain good performance.

A recent paper titled ‘Loop
Recognition in C++/Java/Scala/Go’ by Robert Hundt from Google
looked at a benchmark that implemented a loop recognition algorithm
originally described in a Havlat paper
[research.google.com/pubs/archive/37122.pdf]. The
benchmark is available to public at http://code.google.com/p
in the project ‘multi-language-bench’ and therefore makes a
suitable candidate for our comparison.

The paper by Hundt presents a pair of
versions of the benchmark for each language, the first being the
original benchmark, and a subsequently highly tuned source code
version of the benchmark. For several reasons, the tuned version of
the benchmark is unsuitable for this comparison. Firstly, the highly
tuned version of the C++ benchmark is unavailable to the public.
Secondly, some of the tunings involve algorithmic changes or data
structure changes, which, in general, can be done regardless of the
language. The focus of this blog post is not to compare performance
of specific algorithms or data structures. This blog post will make
some suggestions on how to tune the JVM, as well as some tips on
writing Java code for performance.

There are many inherent differences
between C++ and Java which makes comparing performance between
languages inherently difficult. C++ is static, whereas Java is
dynamically compiled with a managed run time consisting of an
interpreter, garbage collector, class libraries and a Just in Time
Compiler. The JIT compiler is invoked to optimize and compile parts
of a user’s Java application at run time into native machine
instructions for performance. Generally, this cost will be paid
during application startup. As such, many Java applications have a
‘ramp up’ period, where performance is less than optimal.
However, the benefit of this approach allows for the compiler to
obtain run time information and perform optimizations specific to the
run-time characteristics of the application.

Benchmarking methodology

Very often, the important measurement
is the steady state performance–the performance of the application
once the warm-up period is complete. For others, the important
metric is ‘how long does it take for my application to complete’,
also known as ‘wall clock time’. This can be a tricky
measurement to perform properly, since wall-clock time can be
affected by everything the computer is doing during the performance
run. Unrelated processes, for example, can evict cache lines and TLB
(Translation Lookaside Buffer) entries, making an application
consume more clock cycles to perform the same work. If done naively,
wall-clock time can even include time spent running antivirus
software or rendering GUIs for unrelated applications. This blog
post will use the later, wall-clock time, in order to be consistent
with the original benchmarking paper done by Hundt.

Machine Selection

While the original paper by Hundt used
a Pentium IV machine for benchmarking purposes, this blog post used a
4-way Intel Nehalem, a fairly typical machine for many machines
purchased at present. The effect of using a single core on Java
performance can be seen by pinning the process using the ‘taskset’
command on GNU/Linux. Hardware technology continues to advance, and
one advantage of Java over C++ is that Java is quite successful in
customizing itself to the features of each machine it runs
on—particularly multiple cores (as will be explained below). This
advantage vanishes if the benchmarks are run on older, simpler
hardware.

Finally, there are various JVMs
available presently. This blog post uses IBM’s latest JVM
implementation in various forms, with focus on the 32 bit version.
Since the original paper also included a 64 bit JVM using the
–XX:+UseCompressedOops, this blog post includes some scores using
IBM’s equivalent option –Xcompressedrefs to demonstrate its
performance. Since (as will be shown) there is little difference
between 64 bit scores using compressed references and 32 bit scores,
the 32bit JVM will be used for the remainder of the blog post.

Initial scores

Benchmark

Relative Score (lower is better)

C++ *

1.0

C++ pinned to 1 cpu

1.0

Java 32bit**

3.0

Java 32bit pinned to 1 cpu

8.7

Java 64bit CR

3.1

*C++ was compiled with gcc using the
–O3 option

**Java required a larger stack size
than default due to the recursive nature of the algorithm. –Xss4096K
was used to increase stack size to 4096K. (gcc compiles with a
larger default stack size and hence did not experience this issue.
However later on, it will be shown that other C++ compilers are
required to increase their default stack size allocations)

As seen from the table, initially C++
outperforms the Java implementation by a factor of 3 – indeed not a
good score for Java. One can also see that pinning the C++ program
has no impact on performance; however, it will significantly degrade
Java performance. The program as provided is single threaded, and as
evidenced by the C++ runs, pinning it to a single CPU should make
little difference. The explanation for this is that the JVM itself
is multi-threaded. There are compilation threads which compile
methods, GC threads that do garbage collection, etc. By forcing
these all on a single CPU, a user inhibits the progress of their
program because the single CPU must now perform all these tasks.

The conclusion to be drawn here is that
the JVM (and as a result, even a single threaded Java program) is
inherently multi-threaded, and a multi-cored CPU provides performance
benefits. Since modern machines are multi-cored, it makes sense to
exploit these cores.

The goal of this blog post is to obtain
C++ like performance, not ‘factor of 3’ performance. To
determine an area of focus, insight in to what the JVM is doing
during a run must be obtained. There are many available tools for
assisting Java developers in analyzing their applications, including
IBM Health Center, and various other tools that take advantage of the
JVMTI interface. IBM Health Center is available at
http://www.ibm.com/developerworks/java/jdk/tools/healthcenter/

Among various other data points, Health
Center records and monitors garbage collection events, and provides
the information in an easily consumable format. The initial heap size
and maximum heap size have defaults that are dependent on the
available memory in the system. To view the default on your system,
issue the ‘java –verbose:sizes’ command. Often, one way to
reduce time doing garbage collection is to reduce the frequency of
garbage collection by increasing the size of the Java heap. Health
center can help in determining if this is a good course of action.

Initial Look at Health Center – GC

Allocation failure
count

287

Concurrent
collection count

57

GC Mode

Default (gencon)

Minor collections -
Mean garbage collection pause

129 ms

Minor collections -
Mean interval between collections

256 ms

Minor collections -
Number of collections

288

Proportion of time
spent unpaused

66.60%

Time spent in
garbage collection pauses

33.40%

The first important piece of
information is Time spent in garbage collection pauses was 33% of the
application run time! This number is calculated by Health Center by
taking the total time spent during gc pauses and dividing it by the
total time application run. In any Java program, spending this much
time doing garbage collection suggests there could be opportunities
to tune both at the JVM level and at Java source code level. Another
run was done with a heap size of 1024M (by setting Xmx and Xms) for
comparison:

Allocation failure
count

120

Concurrent
collection count

6

GC Mode

Default (gencon)

Minor collections -
Mean garbage collection pause

273 ms

Minor collections -
Mean interval between collections

361 ms

Minor collections -
Number of collections

119

Proportion of time
spent unpaused

56.80%

Time spent in
garbage collection pauses

43.20%

For the run with a 1024 megabyte heap,
the proportion of time in GC pauses increased. However, one must be
careful to conclude that increasing heap size was not the correct
thing to do. In this case, the overall application time decreased by
around 30%, which was a larger proportion than the GC pause time
decrease. This resulted in a higher proportion of time spent in GC.
Trends to focus on include the allocation failure count and the
number of collections, both of which decreased by over a half. And
although the mean garbage collection pause time increased, they
occurred less frequently.

Following this trend, a heap size of
3072 megabytes was chosen as the initial size and the maximum size
(This is done via the Xmx and Xms options)

Some people may note at this point that
by doing this, our memory footprint has become greatly increased.
This may indeed not be possible for all applications, but since the
objective is solely to maximize throughput performance, this is an
appropriate thing to do. Looking at reducing the footprint will be
done later.

This single option radically improved
our performance (by over a factor of 2), closing the gap to within
30% of C++ performance! For some applications, this alone may be all
that is required to meet desired performance characteristics. For
those interested in maximizing throughput performance, there are
several other options to consider, which is done next.

Further GC Tuning

IBM’s latest Java release uses a
generational garbage collector. A generational garbage collector
splits the java heap into regions. Each region consists of objects
of similar age. Since it has been empirically observed that in many
programs, the younger objects become unreachable first, it makes
sense to split objects into regions. Once a region is full,
remaining live objects are promoted to an older region. An
incremental GC can then be done on the newer regions. The nursery is
the region with the newer objects. This size can also be tuned.
Experimentation showed that setting it to 2048M resulted in a slight
(2-5%) improvement in score.

Compaction is the process of moving
live objects closer together in an effort to reduce heap
fragmentation. For short running applications, heap fragmentation
may not be an issue. Turning off compaction with –Xnocompactgc
saves time as well (2-5%). Note that this is not recommended for
normal use.

Finally, another option to consider is
using large pages. Large pages are an OS feature where allocated
memory is done in larger (2048kb on GNU/Linux) increments than the
default. The JVM has been programmed to support this capability and
can be toggled via an option. Large pages reduces TLB misses because
more memory is tracked per TLB entry. (Note that a C++ program that
was not written with large pages in mind would require a rewrite to
take advantage of this OS feature – one of the advantages of having
a managed runtime.)

With these options in effect, a new
best score was obtained:

Benchmark

Relative Score (lower is better)

C++

1.0

Java 32bit – 3072M heap

1.13

-Xss4096K -Xmx3072M -Xms3072M -Xmn2048M
-Xlp –Xnocompactgc

With these selected options, Java’s
performance gap dropped to within 15%. Yet, a timer based profile
using the JVMTI interface will demonstrate that the compiler had not
finished warming up – it was still compiling. What would happen if
the benchmark was run for a longer period of time?

Increased benchmark length

Increasing the benchmark length will
amortize the initial higher cost Java programs incur for dynamic
compilation over a longer period of time. This penalty is often
offset by the benefit of being able to use run time profiling
information and speculative optimizations. These benefits may only
be visible if the program is run for a longer duration.

In both cases, the benchmark was
modified by increasing the main work loop from a count of 50 to 150.
For Java, the same options as the shorter running version were used,
with the exception that heap compaction was re-enabled, as for a
longer duration program will benefit from reducing heap
fragmentation.

Benchmark

Relative Score (lower is better)

C++ long

1.0

Java 32bit* long

0.99

*options: -Xss4096K -Xmx3072M -Xms3072M
-Xmn2048M -Xlp

In a longer running version of the
benchmark, Java was able to match C++ performance. This was achieved
because Java performance had not reached steady state in the shorter
benchmark before it terminated. At steady state, the advantages
previously discussed of a managed runtime begin to show.

Cross Platform Performance

Up to this point, the performance
comparison has been restricted to the GNU/Linux x86 platform.
However, the Windows x86 platform is also commonly used, and
therefore important to consider in the evaluation. How do the
benchmark scores compare on this platform?

In this exercise, the original C++ code
was recompiled using visual studio 2008. A good selection of
performance flags were used when recompiling the code. Java was run
with similar options as the GNU/Linux build, with the exception of
heap size as Windows is not able to allocate as large amount of
contiguous memory as GNU/Linux. An identical spec’d machine with
Windows 2008 was used to obtain the performance numbers.

In this configuration, Java outperforms
C++ by 16% by default and by 35% when using the tuned options. It is
perhaps true that there are further options to improve the C++ score
on windows, yet it is still safe to say that for this application,
Java performs on the same scale as C++ on the Windows configuration.

Benchmark tuning

So far all the Java tuning has been
black-box. Looking at some benchmark internals may also yield some
ways to further improve throughput performance. Since it was noted
that GC time was a big problem in this benchmark, focusing on
minimizing allocations should yield some benefits.

Using an object profiler through the
JVMTI interface, it was noted that many allocations were of iterator
objects – as an example: Java/util/ArrayList$Itr. This object is
created when an array list iterator is used. An example from the
source might be:

for (UnionFindNode iter : nodeList)

While it is prudent to take advantage
of Java Iterators as they simplify code and maintenance, one must be
aware that doing so may result in a ‘hidden’ object allocation of
the iterator type -- generally speaking, the JIT compiler is very
good at identifying these objects and stack allocating them to
essentially eliminate the object allocation. In some very specific
cases, when the compiler is not able to do so, and doing so has
tremendous (negative) performance impact, it may be better for
throughput performance to use an old fashioned for loop to iterate:

for (int i = 0 ; i <
nodeList.size(); i++)

Using this technique on the Java source
code resulted in a new best score on GNU/Linux:

Benchmark

Relative Score (lower is better)

C++ GNU/Linux

1.0

Java 32bit GNU/Linux*

0.99

*-Xss4096K -Xmx3072M -Xms3072M
-Xmn2048M -Xlp -Xnocompactgc

Now with fewer allocations, the Java
heap size can be decreased with a minimal impact on performance:

Benchmark

Relative Score (lower is better)

C++ GNU/Linux

1.0

Java 32bit GNU/Linux*

1.02

*-Xss4096K –Xmx1536M –Xms1536M
–Xmn1024M -Xlp -Xnocompactgc

Note that this was only a single
instance of reducing object allocations in the source file. There
are many other examples which could benefit from the same treatment
to further reduce heap usage.

By being mindful of object allocations
and hidden allocations, the performance gap between Java and C++ has
been eliminated.

Conclusions

In a program that implements an
algorithm which can be used to fairly compare different languages,
Java can obtain C++ like performance and even outperform. It would
be unwise to over-generalize and conclude that there is an order of
magnitude difference between Java and C++.

The amount of Java tuning required to
dramatically decrease the performance gap is minimal. For this
benchmark, a single JVM option was all that was necessary. For those
who demand peak throughput performance, further tuning may be
necessary. However, some basic tuning is something all Java software
developers should consider as part of their development process.

Tuning tips:

1) JVM Tuning

Take advantage of modern
multi-cored hardware as JVMs are inherently multi-threaded.

Look at other GC options available
to fine-tune the garbage collector.

2) Source Code Optimization

Be aware of hidden object
allocations.

Take advantage of the many tools
for analyzing Java applications, such as IBM Health Center to assist
in finding program hotspots, heavy object allocators, and other
performance characteristics of the program.

The amount of memory available to the
Java Heap and Native Heap for a Java process is limited by the Operating
System and hardware. A 32 bit Java process has a 4 GB process address
space available shared by the Java Heap, Native Heap and the Operating System.
32 bit Java processes have a maximum possible heap size which varies according
to the operating system and platform used.

64 bit processes do not have this limit
and the addressability is in terabytes. It is common for many enterprise
applications to have large java heaps (we have seen applications with java heap
requirements of over 100 GB). 64 bit Java allows massive Java heaps (benchmark
released with heaps upto 200 GB)

However the ability to use more memory
is not “free”. 64 bit applications also require more memory as java Object
references and internal pointers are larger. The same Java application
running on a 64 bit Java Runtime may have 70% more footprint as
compared to running on a 32 bit Runtime. 64 bit applications also perform
slower as more data is manipulated and cache performance is reduced. (As
data is larger, processor cache is less effective). 64 bit applications
can be upto 20% slower.

Compressed references allows a 64 bit JVM
to use a pointer smaller than 64-bit wide to reference memory. It uses a
pointer compression technology and optimizes memory references to be just
32-bits wide (i.e compresses all memory references from 64-bit to 32-bit).
Using the Compressed References Technology, WAS instances can allocated heap
sizes up to 25 GB without incurring any significant performance cost.

Compressed References can be enabled by
using the option –Xcompressedrefs.

In WebSphere V7 and above for the 64
bit deployments, if WAS detects that the Xmx (maximum Java heap) requested is
less than 25 GB, it enables the pointer compression by default (this can be
disabled through the –Xnocompressedrefs option with the Java command line)

Java Week - a very informative event being held in this community. Java Week 2 was held from 19th to 23rd September 2011. Every day there are some technical sessions being held on very interesting topics in JAVA. Everyday you get chance to make yourself aware of very interesting technical topics like JEE 6, parallel programming, performance, Java in Cloud, etc.

It was nice to be the part of this event as an speaker in the unconference. I feel privileged to be a speaker in IBM Java Technology Community. It was nice to deliver the talk on Java with Groovy to the interested set of people. As a coordinator of J2EE community in MindTree, it was a great chance to go further in our collaboration with IBM Java Technology Community.

All the sessions in the Java Week 2 were full of information. The session on Java 7 was very interestingly delivered. I came to know about a lot of Java 7 features which can actually be very useful. Garbage collection - one of the very hottest and interesting topic in Java Tech space. The session was on difference policies of garbage collection provided by the IBM JDK. IBM JDK contain some smart GC policies and hence causes better performance.

The unconference was another interesting discussion on different topics like function programming, XML schema design pattern, etc. It was nice to hear from different experts from enterprise development space. Unconference was an brilliant idea introduced in Java Week 2. It was a small discussion on technical topics over 10-15 mins. It was a quick and a heavy dose of information. I really liked it.

I would expect the Java Week 3 coming soon with more exciting topics and events. I am glad to be the part of the Java Week 2 and would be waiting for the next one to come soon.

The general guidance is that the data stored in an HTTP session should only be used to store the necessary data to main state between browser invocations, and that the amount of data stored should be as small as possible. However we often see that, over several iterations of additional development work and new features being added to a web application, the session sizes have grown as more and more data is being stored. This means the corresponding session cache also grows over time, in some cases to the point that it is one of the major contributors to the Java heap memory usage.

So, are there any ways to find out how big the sessions are?

Find the session sizes using the Runtime Performance Advisor of the Tivoli Performance Viewer (TPV)

You can use either the Runtime Performance Advisor in the administrative console, or the Tivoli Performance Viewer (TPV) to look at the session cache size and the average session size, so this will give you an idea if you have a problem with the size of HTTP sessions and the corresponding size of session cache required.Neither the Performance Advisor or TPV give you a good idea of application or which specific user sessions are large, of what specific data is being held in the HTTP sessions, which is useful if you want to understand whether its particular applications or actions that are causing large amounts of data to be stored, and in fact what that data is.

Having information on the sizes of individual sessions, what the user session ID is, and what application the session is for will allow you to understand whether the presence of large HTTP sessions is related to a specific application, or a specific set of user actions. One way of getting this information is using Memory Analyzer.

Memory Analyzer can run using either a PHD format heapdump, or a full operating system dump (eg. a core file) that has been post processed using the "jextract" utility that is present in the jre/bin directory of the IBM Java SDKs. In order have the information relating to application name and session ID for the HTTP sessions, the jextracted system dump is required.You can generate a system dump from a running WebSphere instance by adding the following command line option to the JVM runtime:

-Xdump:system:events=user

and then sending a "kill -3" to the process to cause the dump to be written (you can do this in Windows using a utility such as sendsignal.exe). Alternatively you can generate the system dump on an OutOfMemoryError using the following:

-Xdump:system:events=systhrow,filter=java/lang/OutOfMemoryError

Once you have the system dump, its been processed using jextract, and is loaded into Memory Analyzer, you can begin profiling the HTTP sessions.

Profiling the HTTP session sizes using OQL in Memory Analyzer

You can use the Object Query Language (OQL) in Memory Analyzer to quickly produce a table of all of the sessions that were in the WebSphere instance when the system dump was generated, along with information on: session size, application name, and session id.

To do this you need to:

Select the OQL tab from the analysis panel: This opens the OQL dialog box

Sort by Retained Heap Size by clicking on the column header This will show you the largest session data objects, along with the app name and session ID. The session ID for the user will either be available from the URL, or in a cookie

This means that you now have a sortable table of sessions and how they relate to specific applications and session IDs.

Finding out what values are stored in specific HTTP sessions

If you identify specific sessions that are larger than expected, and you want to understand what data is being stored in that session, you have the option of "drilling down" into the session and looking at the key/value pairs for the data in the sessions.

To do this you need to:

Right click on a row of interest and select "List objects -> with outgoing references" This will bring up the individual MemorySessionData object

Expand the object down the following reference path: "mSwappableData -> table" This gives you the Hashtable of data associated with the session.

You can now browse through the Hashtable, looking at the keys and values to see what data is being stored inside the session, and what is causing the session to be so large!

Method trace is a powerful debug option available to debug java applications. This helps in getting comprehensive information on code flow in the application. Method trace can be enabled to methods in application code or System classes.

Method trace can trace Method Entry and Exit.

Method trace comes with the cost of application throughput. So, we have to use it wisely restricting to the classes/methods of our interest.

This can be controlled by using the command-line option -Xtrace:<option>

To produce method trace, you need to set the trace option for java classes and methods you want to trace along with the destination where you want method trace. You must set following two options

1.Use -Xtrace:methods to select which Java classes and methods you want to

trace.

2. Use either

-Xtrace:print to route the trace to stderr.

-Xtrace:maximal and -Xtrace:output to route the trace to a binary compressed file using memory buffers.

The methods parameter is formally defined as -Xtrace:methods=[[!]<method_spec>[,...]]

Where <method_spec> is formally defined as {*|[*]<classname>[*]}.{*|[*]<methodname>[*]}[()]

For example, to trace methods on String class, set -Xtrace:methods=java/lang/String.*,print=mt

Lets enable method trace for equals method in String class to stderr with HelloWorld program

In this blog post, i am talking about deadlocks (while acquring application resources) in Java code that result in application hangs and the common detection approach using IBM diagnostics. Javacores are quite helpful in understanding the scenario that leads to deadlocks. Java cores can be triggered with a user signal.

Deadlocks can happen in one of the possible scenarios

1. Two threads each are waiting to acquire a resource which is held by the other, causing both the threads to block.

When method foo is invoked by Thread thr1, thr1 acquires a lock on obj1(as foo is synchronized). And it tries to enter obj2 through synchronized block. Similarly, thr2 acquires obj2 lock before waiting to enter obj1. This is a classic deadlock where each thread is trying to enter a lock which is held by other. If a javacore is dumped, you see the following section in it to indicate the scenario

Each of the two classes(resources) that each thread initializes has a static method of another class in its own static section. So when thread t1 is initializing the class first, it gets required to initialise class second as well. And the vice versa to thread t2.

Java cores do not have any particular section to describe these kind of deadlocks as the notifying thread need not necessarily be the thread wich is initializing the resource. However java stacks give adequate information to point to this condition.

For example, "Thread-7" is waiting on class second object while also initializing class first. At the same time, you see in the Thread-8 stack that class second is not yet initialized so far and initialization is under way.

Performance is key factor in any application development.Especially, while building distributed enterprise applications perfomance is a key measure which we need to take in to account.Overall performance of web service application depends on how fast a service can respond to the requestor.The whole cycle includes time required for processing,binding and responding back to the request.

Processing involves XML binding and processing which has a significant role in overall performance of web services.The performance can be analyzed from the perspectives of service consumer,service producer and the service process.

Performance of web services can be measured based on three factors :1)How fast was the execution ?2) How much time did it take to complete the requested task

3) The time taken for the processing under increased traffic .

These can be measured by measuring Response Time, Throughput and Transaction time

i)Response Time:It is the time taken for initiation of request by client ,processing which involves SOAP message marshalling and un-marshalling and to perform business task at end point To measure the response time we need to measure end to end performance and also end point performanceEndpoint performance ; It is nothing but server side performannce

End -to End performance :This includes both the server and client performance

Performance of Web services depends on the following :From above ,we can make few things clear that performance of web services depends up on network,underlying hardware ,binding performance,processing performance.

Processing involves marshalling and un marshalling which is nothing but serialization and de-serialization which can be improved while developing applications.

Here are some points concentrated on how can we improve from application point of view which improves binding performance,processing performance.1)Using primitives appropriately :

This will improve the performance of web services because specifying decimal points increases the serialization and de-serialization which effects performance.It is best to use the xsd:decimal/BigDecimal type to avoid rounding off decimal values for all monetary calculations

2)Binding performance :

Compared to DOM , JAXB has improved binding performance.JAXB has mapping for Java types to and from XML definitions.JAXB provides better binding performance than DOM or any other non-schema binding model. As a result of this at larger payloads the performance of JAX WS is twice than JAX RPC.

3)payload sent: The more payload you send, the more processing is required for the serialization and de-serialization, as well as for binding and parsing.

This has a considerable effect on performance.

4)Using MTOM for tansmitting binary data If the SOAP message body includes large binary objects more than 1MB then it takes more time to parse which consumes more time and also CPU for parsing it.

Loading of dumps from the java process with larger heap size takes more time to load the dump and display the results in GUI mode. To ease the analysis of the larger dumps, Memory Analyzer has the option to use the tool in "Batch" mode which allows us to start the processing of dumps in non-GUI mode on higher-end boxes. It produces standard reports like Leak Suspects, System Overview, Top Components reports. MA generates the "index" files automatically when it parses the dump. We can copy these reports and index files along with the dump to our local machine and we can perform interactive analysis with GUI mode. This helps us to save time on loading and processing larger dumps.

Below is the sample command using which we can invoke MAT in batch mode

One of the key feature of Memory Analyzer is the leak suspects view. Memory Analyzer has capabilities to find the leaksuspects, large tress / deeply nested trees that contribute for large java heap usage.

How MAT suspects something as leaking?

To undertand this we need to get familiar two terms

1. Shallow size of the object

2.Retained size of the object.

Shallow size is the size of the individual object alone whereas Retained size is the total size of the object tree which includes its children (objects referenced by this object) also. Consider A is the root object and it has outgoing references to B and C (which are chil-dren of A). B has incoming reference to A and outgoing references to B1 and B2. Here the size of object A is 100 which is called Shallow size and Total tree size of A is 140 which is called Re-tained size as it includes the size of its children. Consider the size of B1 as 1000 and the total size of A is 1135. Here A is biggest consumer tree and in that B1 is the suspect because it is the biggest consuming child in the A tree.

A significant drop in the retained sizes shows the accumulation point and we can view the chain of objects and references which keep the suspect alive. The largest drop in the total size in the retained sizes helps to provide a relatively analysis to find the leak suspects. A significant drop in the retained size can be due to very large objects in the subtree or too many accumulated objects for example: collections

Memory Analyzer parses the dumpfile (heapdump or the processes system dumps), it compares the shallow and retained size of the objects in the tree and gives you the leak suspect report. This view is the default report generated by the Memory Analyzer . The tool has the in-built capabilities to look for probable leak suspects, large objects or collections of objects that contribute significantly to the Java heap usage and displays this information in the form of a pie chart. This reports memory leak suspects and checks for known anti-patterns. Below the pie chart view we can find the information about the suspects, the objects’ memory utilization, number of instances, total memory usage, and owning class. From the same view we can do more interactive analysis of suspects provided by the memory analyzer.

The Memory Analyzer is a fast and feature-rich Java heap analyzer that helps us find memory leaks and reduce memory consumption. It works with multi-GB heaps of various formats like IBM core dumps, IBM PHD (Portable Heap Dumps) heapdumps, Sun HPROF heapdumps. The Memory Analyzer provides efficient memory leak detection and footprint analysis with powerful reporting capabilities including memory leak suspects, system overview, top components (by retained heap) and more. Memory Analyzer (MA) is based on the open source tool MAT with IBM value add.The Memory Analyzer is a free tool available through the IBM Support Assistant.

Why do we need to use Memory Analyzer?

Because Java provides automatic memory management, a lot of developers believe that Java applications do not have memory leaks. Garbage Collection in Java does relieve us from the mundane duties of allocating, tracking and freeing objects in the Java heap. But Garbage Collection can only free objects that do not have any reference from the active state of the program (stacks and registers) or other objects – and typical memory leaks are caused by objects that are no longer required in the heap, but persist in the heap due to a reference to them.

Memory leak analysis looks into a few key entities.

–What is the object (e.g. a HashMap) holding all the leaking objects i.e. leak container.

–What are the objects getting added to the leak container i.e. leak unit.

–Who is holding the leak container in memory? What are the object types and package names of objects on the chain of references from a root object to the leak container i.e. owner chain.

Debugging Memory leak is not an exact science, analysis will point out suspects. Likelihood of leak depends upon the size of the leaking data structure, drop size, number of leaking units and the number of instances of objects in the ownership context graph nodes.

ForkJoin is one of the neat and cool features introduced in Java 7. ForkJoin Algorithms are highly effective algorithms for resolving problems that follow "Recursive style ("Divide and Conquer Approach)" like FIbanocci, Merge Sort etc, where the tasks that are split to be processed are independent entities, and processed except for may be sharing only the output buffer with no other synchronization needed . Even though Merge sort task for eg, operate on a single buffer, they access different index's, for eg, when a task of 1000 entries, is divided in to 4 tasks, each task is processing different index ranges, and there is no need for synchronization.

Point to be noted is, the problem should be having specific characteristic to use the fork/join framework effectively.

they are

1. The problem data(List/file/map...) to be processed and the computation on the data should be dividable to smaller subtasks.2. Processing each of these sub tasks should be possible without requiring the result of other chunks. (except for waiting on Join for collating)

Fork Join algorithms are best suited for such class of application.

If you go through the documentation for ForkJoinTask, Restrictions are clearly documented. I have tried to list most of them here

1. ForkJoinTasks are intended to be used as "Computation tasks" for calculating pure functions (ie purely isolation objects.2.Computations should avoid synchronized methods or blocks3.Only allowed synchronizers are modern synchronizers such as Phasers that are advertised to cooperate with fork/join scheduling.4. No other synchronization should be required except for synchronization by "Join".5. Tasks should also not perform blocking IO, and should ideally access variables that are completely independent of those accessed by other running tasks.

Minor breaches of these restrictions may be tolerable (as all of these restriction cannot be clearly restricted by rules or thrown exceptions) in practice, but frequent misuse will result in poor performance

Every hardware architecture defines its own instruction set. Instructions, in the simplest sense, are commands to the processor unit to trigger a specific electronic operation on a definite number of operands - which again are memory locations or physical registers. Probably, the creators of Java were inspired by this functioning of physical machines. When they conceptualized a virtual version of the metallic machines (hardware) and called it the Java Virtual Machine, a Virtual Machine Instruction Set also came into existance. They named these Virtual Machine Instructions as Bytecodes.

We do not, however, have virtual registers! The execution is carried out on a stack-machine. The Bytecodes were hence designed to operate on an operand stack.

Consider two simple Java methods written to find the area of a circle.

In this final post, let us strive to figure out whether Java as a programming language is deficient in carrying out any computational task. We have shown a bunch of examples from Clojure in the previous two posts. Now, let us see whether the functional programming feats that the programmers of Clojure accomplishes can be performed equally well in Java.

Let us start with an example using Google's Guava library. Our main goal is to convert temperatures from Celsius to Fahrenheit in a functional way:

We could have achieved the above task with a simple for loop. Is there anything wrong with for loop? Not really. The only issue is that for any slight modification, we need to manually code another for loop. Reaumur would have forced us to code one more for loop! The point that we would like to make is that everything is possible in Java. However, the question remains is how elegantly we can attain our tasks with Java.

Consequently, there is a lot of discussion of bringing a bunch of functional programming aspects into Java. Check out the references at the end of this post.

Finally, let us look at one more strong aspect of Java. JVM is a very well-tested platform. The languages like Clojure, Scala, Jython, JRuby, ETC. run on this platform. As a result, they all can use the Java libraries very easily in their programs. Here is an example in Clojure:

It does not take that much time for us to understand the above code. The point is clear that JVM and Java libraries are extremely valuable. In other words, Java will not die any time in the near future. In addition to that, none of these new JVM languages do not make our Java skills obsolete.

Let us put an end to our feesting on the functional binge for now. There is a lot to learn in this area. We have barely scratched the surface. Please check out the references listed here and the ones mentioned in the previous posts.

You can collect Health Center data for offline
analysis (without the Graphical User Interface) using the Headless mode. The
headless mode is typically useful for monitoring applications behind firewalls.
To enable Headless mode, users should specify the headless option in
healthcenter.properties file. The agent starts collecting data immediately. When the application terminates, the agent creates a file called healthcenter.hcd and will be written to the current working directory if directory is not specified explicitly.

To enable headless mode:

Open the file healthcenter.properties for editing.

Set the option com.ibm.java.diagnostics.healthcenter.data.collection.level=headless

The Health Center uses a sampling
method profiler to diagnose applications showing high CPU usage giving full
call stack information for all sample methods. Health Center works without
recompilation or bytecode instrumentation and shows where the application is
spending its time.

In Method profile view, Self (%) is the percentage of
samples taken while a particular method was being run at the top of the stack
which is a good indicator of how expensive a method is in terms of using
processing resource. Wider, redder bars in the graphical representation of Self
(%) indicate hotter methods.

Tree (%) is the percentage of samples taken while a particular method was
anywhere in the call stack. This value shows the percentage of time that this
method, and methods it called (descendants), were being processed. This value
gives a good guide to the areas of the application where most processing time
is spent.

The
Invocations paths tab shows the methods that called the highlighted method. If
more than one method calls the highlighted method, a weight is shown in
parentheses. For any method, the sum of the percentages of its calling methods
is 100%.

For
example, the selected method testApplication.createLargeObjects() has majority
of invocations (66.6% of samples) made through testApplicationSink.put() and
has 33.3% of samples containing the invocations made through
testApplicationSink.get() method.

The
Called methods tab shows the methods that were called by the highlighted
method. In other words, they show where the highlighted method is doing its
work.