Hi Rolf,
Thank you very much for the performance details!
I definitely think you made me a better java programmer.
Mike Brenner
-----------------------------------------------
From: Rolf Lear [mailto:jdom at tuis.net]
Sent: Thursday, March 22, 2012 6:02 PM
To: Brenner, Mike
Cc: jdom-interest at jdom.org
Subject: Re: [jdom-interest] In the 11th Hour
On 22/03/2012 2:47 PM, Brenner, Mike wrote:
> Hi Rolf,
> I hope you decide to take the time to post how you accomplished this, to make us all better java programmers.
Hi Mike.
Thanks... I think.... but I have to admit, I am still learning things
too (and have been for years... which I think helps).
It helps that I have spent years doing exactly this sort of thing
(high-efficiency Java code) as a regular job. I typically focus more on
the memory side of things because memory typically is the first thing to
break with Java... when you run out of -Xmx it just dies, but if it runs
slow, well, there's always more days to process things in... ;-)
My Java work experience is very useful though... I typically work on
'big' machines, in the finance arena, where setting -Xmx64g is
'routine', and processing tens of millions of records in memory at one
time is also common. Having lots of memory creates bad habits in
programmers, and when you develop your applications with small scale
data, it is easy to set -Xmx2g, build your system, and then, when you
scale the system to big data, it all blows up because you have
non-linear scalability. That's sort of where I get involved... ;-)
So, if you take the specific example of the 'Scan' times, I can go
through the process I took to get it fast (again).
The first trick is to have realistic expectations of what results you
should see. With this code, when I started with JDOM 2.x I was able to
take a benchmark of the code, and I have been able to refer back to
that. That creates a 'minimum' expectation. If I had not done that then
I would not have known there was a problem. I am a believer that
performance should be 'regression tested' as much as the actual
functionality.
So, having identified the scan time as a problem, I just took the scan
benchmark code and wrote a loop around it....
int cnt = 0;
long start = System.currentTimeMillis();
while (true) {
cnt++;
int sz = 0;
for (Content e : doc.getDescendants()) {
sz++;
}
if ((cnt % 1000) == 0) {
long time = System.currentTimeMillis() - start;
start = start + time;
System.out.printf("Looped %d times @ %.3fms each (%d lines)\n", cnt,
time / 1000.0, sz);
}
}
The challenge is then simply to make the time-per-thousand as small as
possible.
Then, while that is running, you launch the trusty
$JAVA_HOME/bin/jvisualvm , and you profile the code.
With some tweaking of the settings, you should be looking for three
things: CPU Hogs, Memory Hogs, and Tight Loops
- Profile the CPU time, and once the profiling has taken 'some time' (20
seconds or so), you create a snapshot of the profiling. Look at the
profile (not the snapshot), and get an idea of the methods that took the
most time. Then, in the snapshot, dig down in to it to find out where
the code is being called from, why it is called, how often, and where it
is spending its time. This is a little bit of 'black magic' to identify
where the 'hot spot' is. You need to be able to look at the source code,
switch to the snapshot, and back again, and so on.
- while you are in this snapshot, look for items that are called very
often. Even a small improvement on 'tight' code will make a big
difference. (I have spent a lot of time working on the 'Verifier' for
example).
Then, once you have identified the 'hot' code you can look at the code,
and decide if things could be done differently. Even a small change can
make a difference. Once you make a small change, re-run the test loop
(no profiling). Check to see whether it is any faster (or slower).
When I start running out of things in the CPU profiles to 'tweak', I
start looking at the memory side of things. Often you see a lot of
memory allocation in strange patterns. Use the Memory profiler in
jvisualvm (make sure you change the setting to enable "Record
allocation's stack trace").
Getting back to the 'scan' loops, I tried a number of different changes
to get things better. Changing the whole iterator around (making
hasNext() do nothing, and putting all the logic in to 'next()') brought
it from 3.2ms to about 2.7ms... which was great, but not as good as JDOM
1.x. So, I after looking at things some more, I decided to try to
replace the LinkedList as a stack. I know this sort of thing makes a
difference (from experience). I was somewhat surprised by how much of a
difference it made though....
For the record, the reason I chose the DescendantIterator to work on is
because I had already gone through this sort of exercise with the
ContentList Iterators before (bringing the XPath code from 90ms to
30ms), so I did not think I would find much there. If I was starting
'fresh' I would have looked at the lower levels some more too.
Once I had make a good change, I reran the full 'benchmark' and checked
the results.
A more interesting fix was the one on the FormatStack. I introduced the
FormatStack itself to solve a performance problem - having to keep
re-calculating the 'state' of the output formats... But, when I ran the
XMLOutputter code through the Memory profiler I found a lot of time was
spent in the Format.escapeText() method. This is a very tight method
call, and I tried really hard to make it faster. I did a few things
which made a difference, but, when I changed from 'Raw' format to
'Pretty' format, it was still very slow. I cold not find the slow-down
until I checked the memory profiler, and identified that a huge amount
of char[] memory was being created in the FormatStack.push() method.
Memory is a double-edged sword in Java (and other languages). For every
byte you allocate, you also need to collect it. Allocating is slow, and
collecting is slow (especially if it is a 'full' collection -
single-threaded).
So, as soon as I saw all that memory, and realized that it was coming
from 'build the indent string' type string concatenation, I found a way
to not do that unless 'things change', so, now the FormatStack reuses
the indents. This made a huge impact on performance. Saving the
allocation of memory pays for itself three times over: not having to do
the calculation of the new value, not having to allocate and store the
result, and not having to GC it later.
Additionally, because JDOM has a comprehensive test harness, it is easy
to make reliable changes, even if they would otherwise be called 'big'.
It becomes 'easy' to play with things and see how it changes.
Yet another interesting one is the 'escapeText()' code in Format. It
used to look like:
public boolean shouldEscape(char ch) {
if (bits == 16) {
if (Verifier.isHighSurrogate(ch)) {
return true; // Safer this way per
http://unicode.org/faq/utf_bom.html#utf8-4
}
return false;
}
if (bits == 8) {
if (ch > 255) {
return true;
}
return false;
}
if (bits == 7) {
if (ch > 127) {
return true;
}
return false;
}
if (Verifier.isHighSurrogate(ch))
return true; // Safer this way per
http://unicode.org/faq/utf_bom.html#utf8-4
if (encoder != null) {
try {
return !encoder.canEncode(ch);
}
catch (Exception ignored) {
// ignore problems.
}
}
// Return false if we don't know. This risks not escaping
// things which should be escaped, but also means people won't
// start getting loads of unnecessary escapes.
return false;
}
That is a 'big' method, especially when it is called for *every*
character that is output by JDOM. The thing is that the nature of the
output (7bit, 8bit, 16bit, Charset, or default) is known when the Format
instance is created. Having to check again for each character is
unnecessary. So, I changed the code to have different escape strategies
for each type, and replaced the 'Default' with one of a set of them.
Now, for example, if you are outputting to a UTF-8 based output, the
method is simply:
public final boolean shouldEscape(final char ch) {
return Verifier.isHighSurrogate(ch);
}
And, since I have different instances of the class for different
'shouldEscape()' implementations, the decision is made once for each
Format, instead of once for each char.
Summary....
Here are some 'tips' I have come to learn, and they influence my coding
style, and how I look at performance problems.
Some of them have bad reputations... and people complain about
'optimizing too early', ... but, it works for me.
Use 'final' as liberally as possible. Classes, methods, variables,
parameters, everything.
I have learned, over the years, to be very wary of the Collections API
when it comes to memory and performance. They are all extreme memory
hogs, and as a result, they are slow... Be careful using them in
performance-critical code. arrays of primitives are often just as easy
to use, and are faster.
while-loops with break/continue statements are faster than conditionals
in the loop.
loops with a constant test are much faster than loops with
conditionals... for example:
for (int i = 0; i < list.size(); i++) {
....
}
the above is fine, but this is better:
final int len = list.size();
for (int i = 0; i < len; i++) {
...
}
When you can, do count-down loops instead:
int idx = list.size();
while (--index >= 0) {
....
}