Concur.next & WF2 — Tuning Concurrent Clojure

Search

I’ve been working on, and writing about, running Clojure Wide Finder code.
But I was never satisfied with the absolute performance numbers. This is a
write-up in some detail as to how I made the code faster and also slower,
including lessons that might be useful to those working on Clojure
specifically, concurrency more generally, and with some interesting data on
Java garbage collection and JDK7.

[Update: My relationship with the JVM is improving, based on good
advice from the Net. This article has been partly re-written and is much more
cheerful.]

In
a
comment on my last piece,
Alexy Khrabrov noted “I saw WF2 has a
pretty good Scala result, apparently better than Clojure's — and that's a
year-old Scala.” Alexy is right; my best times were mostly over 30 minutes,
while
Ray Waldin’s
Scala code ran in under 15. There’s no obvious reason I can see why
Clojure should be significantly slower than Scala, and while I stand by my
Eleven Theses on
Clojure, I was puzzled and irritated.

The following are organized roughly chronologically; if the narrative isn’t
all that coherent, that would be an indicator of me having done considerable
thrashing about.

Low-Hanging Fruit ·
The information you can get from the JVM with the
-Xprof argument is not brilliant, but it’s a whole lot better
than nothing. What you want to see in profiling output is a big fat obvious
culprit, and I did, and it was the code fragment below, which breaks up
buffers into lines; text is the text we’re wrangling,
first-nl and last-nl are the first and last places
where a newline character(whose value, note, is
10) appears in the block. destination is the user-provided
payload function.

This sucker was burning well over half of all the CPU. There are a few
things wrong with it. First of all, since I was already using Java to split
up the text, why not do it all at once like this?

(let [chunks (. #"\n" split text)

Yep, that helped. (I’d actually had it that way in a previous iteration
but had unrolled it while fighting a memory leak that turned out to be
unrelated).

Lesson: Any time you can package up a bunch of work as a single
call into Javaland, that’s probably a good idea.

Reducing ·
The second thing that was wrong was that I was iterating at all. Lisps
want you to think in terms of lists not their members, and I wasn’t.
What I was actually wanted to do was to call the payload function on each line
of text, passing the output of each invocation to the input of the next. And
of course Clojure, like any decent functional language, has a nice
reduce function. So let’s just add a line to that fragment:

Lispers are now nodding their heads in a despairing
of-course-the-moron-should-have-done-it-that-way way. Me, I like it when a profiler
shows me where I’m being stupid. Some non-Lispers are probably thinking
“That’s slick”. Most modern languages, not just Lisps, have some sort of a
reduce equivalent.

I’m a bit amused here: this code has (basically) a map/reduce architecture,
except for I’m using reduce in the map phase.

Lesson: Operate on lists instead of iterating.

Type Hinting ·
The code was now running visibly faster; time for another whack at the
profiler. I saw a whole lot of time being sucked up by
java.lang.Class.forName0. This suggested to me that the Clojure
runtime was spending a bunch of time trying to figure out
what type of thing was crossing some method-dispatch barrier. I already had a
wrapper around the user-provided payload function because it has three
arguments and reduce wants just two, and to protect myself from
Java
issue 4513622.
I seemed to remember from
somewhere that Clojure has “type hints”, so I looked that up and decorated the
wrapper:

See that #^String goober? It’s a hint telling Clojure
to assume that the incoming line argument is a Java string.
It helped too, quite a bit. I was sort of irritated at having to hold
the system’s hand like this. But I think this may be a Release-1.0 symptom
rather than a Clojure design flaw. It feels to me like there’s quite a bit
more scope for type inferencing, maybe even JIT inferencing at run-time.

Lesson: A little static typing can go a long way.

Reading vs. Mapping ·
My first cut at this problem used Java’s NIO subsystem; I’d noted that
java.nio.channels.FileChannel advertises itself as being
thread-safe and furthermore had a map() method for mapping stuff
into memory. So I had this pretty slick (I thought) setup where multiple
threads would just get dealt out an offset and block size and map successive
regions of the file, then do the string-i-fying and splitting.

I was twiddling buffer sizes and thread counts and not observing any real
helpful patterns and wondered if I was maybe overthinking the problem. So I
made another version that uses a Clojure agent to synchronize
sucking in the data with good old-fashioned
java.io.FileInputStream.read() and fires off a new thread for
each block. Right off the bat, it was faster.

Experimenting with this one seemed to suggest that it ran faster and faster
the smaller I made the buffers. I’d started with 64M and eventually went all
the way down to 2M; 1M wasn’t any faster. I hypothesize that with the big
buffers, the big String.split operation illustrated above was
creating too many short-lived transient objects that were overloading the
garbage collector.

Lesson: Disk really is the new tape.

Stabilizing the JVM ·
Early on, when I was fighting a memory leak, Rich Hickey suggested using a
smaller heap size. I did that, and was astounded.

I was now seeing throughput on the order of 50-60M/sec. But as I watched
the program run, I could see that as the program built up its heap to the
ceiling (set via java’s -Xmx argument), it was running much
slower, as slow as a tenth of that speed. When it hit the ceiling, absent
other pathologies, it would ramp up to production speed, I assume as the GC
got into balance with the rest of the program’s work.

Since it can easily take minutes for the JVM to expand to fill
a 10G-or-bigger heap, you really can shave a couple of minutes off your run
time by giving it a whole lot less. Yes, this is profoundly
counter-intuitive.

Lesson: Measure, don’t guess.

Happy Dance! ·
At this point, I was starting to see run-times down well below fifteen
minutes, and profiler output that looked like this (the primitive -Xprof
facility gives you a little dump every time a thread exits, which turns out to
work really well for me as I’m running through hundreds of ’em):

In other words, the program is spending most of its time using Java’s regex
machinery to grind away at all that text input; which is I think what you’d
like to see.

This was around 10PM last Friday, and I was
crowing exultantly
on Twitter, feeling like a rockin’ sockin’
functional-programming’ concurrin’ homoiconic wizard. “Faster than Scala” I
uttered, incautiously.

Lesson: Read some Classics in transition from the archaic Greek and
learn to fear the punishment for hubris, especially public hubris.

Which Wide Finder Was That? ·
Because, you see, I was actually running a
Wide Finder 1 program,
just computing one little statistic, whereas Ray Waldin’s Scala code, and
the others on the
results page,
were working on
the
benchmark from
Wide Finder 2.
It’s a lot more complex and computes five different statistics which
require looking at a much higher proportion of the data.

Well, I thought, How hard can it be?

Lesson: [Never mind. We’ve all been there.]

Attitude Problem ·
I now have one, because it seemed like a lot more work than it should be to
code this up.

The stats you want to compute could be called “hits”, “bytes”, “referers”,
“404s”, and “fetchers”. So I accumulated results in a map whose keys were
:hits, :bytes, :referers,
:404s, and :fetchers; the values were maps whose
keys were the URIs and values the in-progress statistics. So
if you want to add increment the :hits value for
some URI, you say:

...but then I’d be promiscuously mutating state and condemning
myself to single-threadedness. Which is what this series is all about
trying to avoid.
On the other hand, it says exactly what I want to say without any
ceremony or extra arm-waving. Is there a middle way?

Anyhow, here’s my gripe: I started sketching in the code to gather the five
Wide Finder 2 stats Sunday while I was watching football. I spent all day
Monday, until late in the evening, getting it all to work; admittedly,
I’m a relative Clojure newb, but
I’m not that terrible a programmer and I think I’m actually fairly
quick at learning new things. This is way too long.

The result-merging step in particular gave me heartburn. If you can merge
two maps with Clojure’s elegant merge-with and you can turn a
list-or-vector into another list-or-vector with map, why can’t
you run a map on a map and get a
map?!?

Now, the situation wasn’t helped by the fact that Clojure is after all 1.0,
so its tracebacks are not models of helpful transparency; also, the code is
running in dozens of threads in parallel which made for extra work in
tracking down where what provoked that NullPointerException or
ClassCastFailure.

Lesson: Some combination of faults in Clojure and faults in
your humble correspondent is interacting badly.

Performance ·
Improved, but not terribly satisfying.
The wrong collection of JVM options can drive this sucker
into Garbage-Collection hell. This is painfully obvious when it’s going on;
suppose you’re using a simple monitoring tool like top(1)
while you monitor the output, you
can see it trundling along processing 30 to 50 MB/sec and reporting a CPU
burn rate anywhere between 1200% and 3000%, which means you’re keeping all
the cores and threads pretty darn busy. Then all of a sudden the CPU drops
to around 100% and the output just stops. For like 90 seconds or more. If
you’ve picked a bad combination of options, this happens more and more
frequently until you’re spending way more time garbage-collecting than
actually computing.

[Update:] Someone who wishes to remain anonymous read the first
version of this piece had a discussion with me about JVM options and
suggested simply taking all of them out, except for -Xms and -Xmx to ensure
there’s enough heap; letting it manage its own GC strategy.

This person turned out to be smarter than the tribal lore I’d picked up
over the years and around the net, and I’ve managed to stay out of GC Hell
for the last couple of days.

On the other hand, after all the profiling and adjusting, my code
has grown tons of extrusions and decorations; here’s the bit that actually
does the work of processing a line of logfile:

Boy, that ain’t pretty. At one point, I was pulling out the fields with
fragments like (get fields 6) but that led to some
XxxArrayAccessor method bubbling to the top of the profile output.
I attempted to replace that with “destructuring”, as in the commented-out Line
6, but that turns out to be just a macro that apparently generates about the
same calls.

I settled on aget, apparently specialized for fishing around
in Java arrays. But you still have to type-hint what comes out of it.
Feaugh.

In Line 7, you can protect yourself from our old friend 4513622 by either
calling new String on the uri, or by interning it, as above.
The former causes more memory stress but, if you can spare the memory, runs
faster.

Lesson 1: I’m not sure there’s a clear one here, but it makes me nervous
when application code that should be simple is hard to get right, and so far
Clojure’s not making this developer happy the way Matz designed Ruby
to.

Lesson 2: The Concurrency Tax is too high.

Fallback hypothesis: I’m just Doing It Wrong.

GC Explosion ·
I accidentally ran an experiment that showed me where a lot of my GC pain
was coming from. Reminder: this works by having one Clojure agent read
blocks sequentially and handing them off to threads to process, and build
those tables of tables with the results. When each thread is done, it sends
its table off to another agent to merge.

One time, I had a bug that turned the merger into a no-op, the agent was
invoked all right but simply ignored the incoming table. This version ran
very fast; not as fast as the simple Wide Finder 1 code I’d been working
with earlier, but not bad at all, with a lot less GC overhead.
So that code above may be ugly but it actually runs OK.

It’s iterating all right, but only over the keys of the top-level map, of
which, remember,
there are only 5: :hits, :bytes, and so on. Not
much to it, and reasonably idiomatic I think. But boy does it ever generate
garbgae.

I note that in Ray Waldin’s
nicely-performant Wide Finder 2 code he relied on a
java.util.concurrent.ConcurrentHashMap. I suppose it’s
unsurprising that that would outperform the sort of pure functional/dynamic
code above, but it feels a bit like cheating.

Java Versions and Options ·
My best times for the full WF 2 run are now
around 30 minutes. 29:38 with JDK7, 35:10 with Java 6.
The results of tinkering with GC options and generation
sizes and so on seem insignificant-to-damaging. In particular then new “G1”
garbage collector in JDK7 doesn’t seem appropriate for this problem.

What I Could Do Next ·

Use a ConcurrentHashMap like Ray did.

Bring
VisualVM to bear on the problem
and really do a deep-dive on what’s happening inside this code.

Try refactoring the app some more.

I dunno. I still like Clojure and stand by my 11 Theses, but that
impedence mismatch with my conventional procedural object-oriented
programer’s mind grows fatiguing.

Updated: 2009/12/10

Contributions

One of the neat things Ray's solution does (and, indeed, I think most of the top performers) is not bother with line splitting. Once you're dealing with fairly optimized code, the performance is really driven by how many times you have to pass over the data. Java is automatically penalized in this case because it does one pass in order to turn bytes into strings, where languages that natively work with ASCII can just copy the bytes in. Being able to use the bytes directly rather than having a big heavy immutable string object probably helps a lot too (there's a another pass - copying the file buffer memory into the string buffer memory).

Anyway, I suggest looking through your code and seeing how many times you'll end up scanning the bytes/chars during your processing. Squishing that down will squish your time.

Using java.util.concurrent.ConcurrentHashMap is *not* cheating. There are many problems that can't be solved with a ConcurrentHashMap, and the more powerful constructs of refs, maps, transactions, etc. are required. But if they're not, by all means use the more efficient features of java.util.concurrent.

Also, here's an alternate implementation of bump. I don't know if its any faster or produces less memory churn, but at least it's a bit simpler:

Perhaps you should post the entirety of your code now that you've banged on it a little while :) I'm sure the Clojure community would be able to give you not only optimizations but perhaps other ways of tackling the problem.

One issue that you're probably running into is in the creation of many ephemeral objects. I wonder if this could be avoided by using atoms to hold the counts for bumping and not using nested maps?

While Clojure certainly encourages a functional style, Rich has introduced enough constructs where you can code in a concurrency safe yet some what imperative style if that is more natural.

I think it would be better for your readers and the Clojure community if you posted your code somewhere, e.g. GitHub, for them to take a look at and study. Posting benchmarks with incomplete code blocks makes it hard for us to analyze where the problems might be, and your blogged results will remain in the blogosphere for eons, giving people the impression that perhaps Clojure is fundamentally broken or hard to work with and tune. I understand if you feel your code is a work in progress, but hey, that doesn't stop anyone else from posting their stuff.

I think language implementers should think seriously about using disjoint heaps for each thread and copying in between. This is just such an easy way to give programmers a predictable performance model and avoid these voodoo frustrations.

I've been to a few research talks about Java GC and each time my take-away has been "wow, they made the problem much harder than the Erlang guys."

Here's the idiom, offered in the spirit of "let me share" not "why didn't you know":

(into {} (map (fn [[k v]] (do-something k v)) my-hash-map)

The (into {} ...) is the trick. The map function works on sequences, & will coerce anything it can into a sequence, which in the case of (hash) maps is a sequence of [key value] pairs. The (fn ...) operates on these pairs, returning another, modified pair for each, so the return value of the map function is the sequence

( [k v] [k v] [k v] [k v] ... )

So you've processed the original map as a sequence (here's the unifying sequence abstraction) and want to turn the thing back into a map: enter "into", which successively adds (with "conj") items from a sequence into a supplied datastructure. And

I love this project and can't wait to see what your final conclusions are.

Even though I don't completely understand everything that's going on in the proc-line function, I have a suggestion for how to clean it up. In your post "Idiomatic Clojure" you had a version of the batch-pmap-wide-finder code that you liked with functions tally, count-lines and find-widely. I am doing some log processing (not Wide Finder 2) and used this as a template to get started. Here are the good parts:

In tally I use the regular expression to both find what I am looking for and also do a destructuring bind to extract the fields. I then create a nested map that has the stats for a single line.

In count-lines and process I use deep-merge-with from clojure.contrib.map-utils to merge the maps into a single map. My process function is merging results from multiple files but you can do the same type of thing in merging the results from multiple threads.