Friday, April 29, 2011

Largest JGroups cluster ever: 536 nodes !

I just returned from a trip to a customer who's working on creating a large scale JGroups cluster. The largest cluster I've ever created is 32 nodes, due to the fact that I don't have access to a larger lab...

I've heard of a customer who's running a 420 node cluster, but I haven't seen it with my own eyes.

However, this record was surpassed on Thursday April 28 2011: we managed to run a 536 node cluster !

The setup was 130 celeron based blades with 1GB of memory, each running 4 JVMs with 96MB of heap, plus 4 embedded devices with 4 JVMs running on each. Each blade had 2 1GB NICs setup with IP Bonding. Note that the 4 processes are competing for CPU time and network IO, so with more blades or more physical memory available, I'm convinced we could go to 1000+ nodes !

The configuration used was udp-largecluster.xml (with some modifications), recently created and shipped with JGroups 2.12.

We started the processes in batches of 130, then waited for 20 seconds, then launched the second batch and so on. The reason we staggered the startup was to reduce the number of merges, which would have increased the startup time.

Running this a couple of times (plus 50+ times over night), the cluster always formed fine, and most of the time we didn't have any merges at all.

It took around 150-200 seconds (including the 5 sleeps of 20 seconds each) to start the cluster; in the picture at the bottom we see a run that took 176 seconds.

Changes to JGroups

This large scale setup revealed that certain protocols need slight modifications to optimally support large clusters, a few of these changes are:

Discovery: the current view is sent back with every discovery response. This is not normally an issue, but if you have a 500+ view, then the size of a discovery response becomes huge. We'll fix this by returning only the coordinator's address and not the view. For discovery requests triggered by MERGE2, we'll return the ViewId instead of the entire view.

We're thinking about canonicalizing UUIDs with IDs, so nodes will be assigned unique (short) IDs instead of UUIDs. This means reducing the size for having 17 bytes (UUID) in memory in favor of 2 bytes (short).

STABLE messages: here, we return an array of members plus a digest (containing 3 longs) for *each* member. This also generates large messages (11K for 260 nodes).

The fix in general for these problems is to reduce the data sent, e.g. by compressing the view, or not sending it at all, if possible. For digests, we can also reduce the data sent by sending only diffs, by sending only 1 long and using shorts for diffs, by using bitsets representing offsets to a previously sent value, and so on.

Ideas are abundant, we now need to see which one is the most efficient.

For now, 536 nodes is an excellent number and - remember - we got to this number *without* the changes discussed above ! I'm convinced we can easily go higher, e.g. to 1000 nodes, without any changes. However, to reach 2000 nodes, the above changes will probably be required.

Anyway, I'm very happy to see this new record !

If anyone has created an even larger cluster, I'd be very interested in hearing about it !
Cheers, and happy clustering,

17 comments:

Can you describe what each node was doing after the group was formed? Was there messaging going on, etc.? How about state transfer? Was there any attempt to randomly kill/restart nodes or otherwise partition the cluster after the group was formed? Did this include FD, FD_SOCK, FD_ALL?

We tested cluster formation, so this is the Draw demo that we started.There is no messaging going on after startup; instead all instances are killed as soon as the cluster has formed successfully, and then the next round is started.The config is more or less udp-largecluster.xml, which includes FD_SOCK and FD_ALL (transport is UDP). I was a bit concerned about FD_SOCK, and the traffic generated to discover the neighbor's server socket port, but this turned out not to be an issue. We nevertheless commented FD_SOCK.The next step for this particular customer will be to send 5K messages, ca 30% of all nodes will do that. This is to mimic the application that'll run on the devices.Randomly killing nodes was not yet on the menu...

I am late but congratulations!One year has passed... did the company happen to open-source the "collector" tool?By the way, is there any documentation for the scripts in the "bin" directory? If not, I can fork on Github and insert in each file a small header describing what the script does (provided I understand).Cheers, keep up the great work!

Thanks Raoul !No, the tool hasn't been open sourced yet; one issue was that they were using Eclipe RCP which (IMO) is a PITA...

No, the scripts in bin are only partly documented in the manual, for instance probe.sh.Yes, if you'd like to add a small description to each script, be my guest: I'd be happy to apply a pull request.Cheers,