February 08, 2006

Distributed TestNG

And I thought na´vely that nobody would notice the mysterious
"Distributed TestNG" mention in the
TestNG 4.5 change log...
Well, quite a few people took notice and asked for precisions, so here goes.

First of all, the reason why I didn't advertise this feature more is because
it's still a work in progress and it's missing at least one feature that I think
needs to be implemented before it can be released in beta form (see the bottom
of this post).

More and more people have asked me for this feature these past months and I
finally found some time to take a serious shot at it, and I was quite surprised
to see that it came along very well. Here is how it works.

Please keep in mind that there are many open issues (listed below), and I'm
emailing testng-dev first in order to
gather some feedback on these issues before announcing it on
testng-users, so a lot of this is
still in flux, but the good news is: the implementation works, and it's now
just a matter of packaging it nicely.

Overview

The idea is to be able to distribute your tests across several slave
machines in order to accelerate the overall execution time. In order to achieve
this, you can now launch TestNG in
"slave" mode on a remote machine by specifying a port:

java org.testng.TestNG
-slave 5150

At which point, TestNG will just sit
and wait for incoming connections.

Once you have launched all the slaves that you need, you declare them in a
properties file:

# hosts.propertiestestng.hosts=terra:5150 arkonis:5151

And you launch the "master" version of
TestNG by passing it this host file:

The tests will then be dispatched randomly to the various hosts until they have
all run. All the results will be collected and presented in the usual HTML
format (with the addition that these results will include the remote host they
ran on).

Right now, two dispatch strategies are supported: per-test or per-suite (the
default).

By default, each suite (each testng.xml
file) will be sent to a remote host in its entirety. If you need a finer
granularity, you can add the following to your hosts.properties:

# hosts.propertiestestng.strategy=test

In this case, TestNG will parse all
your testng.xml files,
collect all the <test> stanzas and they will be sent individually to each remote
host and then returned after they have run.

Great, what do I need to do to use this?

If you want to take advantage of distribution, there is only one little gotcha
you need to be aware of: your tests need to be static-friendly.

Since the slaves run in the same JVM for maximum performance, you need to make
sure that the static part of your tests (if any) is correctly initialized. For
example, if you run this test twice in a row remotely:

public class T {
public static int m_count = 0;

@Test
public void f() {
m_count++;
assertEquals(1, m_count);
}
}

... the second time will fail because m_count will have kept its value after the
first run.

Instead, you need to move the initialization in a @Configuration method:

The current code in CVS implements everything explained above. You can find an
example hosts.properties in test/ and also master.bat and
slave.bat that are
convenience scripts to launch the various instances of
DistributedTestNG. Please try it and let me
know what you think.

Here are some of the open issues that I'd like to get some feedback on:

Terminology: master / slaves? Any other suggestion? (not a big fan of
these names right now)

Specify the list of slaves in a .properties file? In
testng.xml? In another XML
file? Allow for an ISlaveProvider so that hosts can be dynamically added
with a plug-in?

Slaves can only receive one connection right now, but I am thinking of
moving to java.nio and allow multiple connections so they can be shared by
several developers. The idea is for an entire team to share the same
hosts.properties and so potentially, different masters could hit
the same slave.

Follow-up question: should multi-connection slaves be sequential or
multi-threaded? Should this be configurable per-slave?

A master could be restricted to certain slaves. All masters share the
same properties file but only access a subset of them (would minimize
slave-thrashing because of developers on the same team running all the tests
at the same time). How could we specify this? Regular expression?
ISlaveFilter?

The current version of TestNG (4.5) contains a working implementation of
these features. It is still considered work in progress because I haven't
implemented classloading on the slaves, so the only way you can load a different
test of classes in the slaves (for example after a new check-in) is by killing
the process. A simple strategy could be that when a slave becomes idle for
more than a few minutes, it unloads all its classes.

Please let me know what you think and how useful you think this feature would
be to you...

Posted by cedric at February 8, 2006 12:48 PM

Comments

From two-phase commit terminology, "coordinator" and "participant"?

This is a good idea! Slow test suites discourage testing often.

I think sequential, unless classloaders are used to prevent concurrency issues on statics. I think tests are often written with the unconscious assumption of sequential running.

It strikes me that something similar could be achieved by setting up tests to be run from the command line, and distributing it with conventional grid technology. Those products would handle any resource pool management issues, one would only have to worry about breaking up the request and aggregating the results.

Look great, but it is unclear how do you marshal classes and other configuration to the remote runners?

If it is not there yet, thn maybe it worth to look at some continous integration server. I believe Continuum has multi-server support since recently, so it could be interesting to hook into it, so all the code and required resource could be transparently sent to the remote server for the execution which is already running anyways.

Right now, the only serialization that happens is done by TestNG. The test developer doesn't need to worry about it (no need to make their tests serializable) since only the class names and their results are exchanged between master and slaves.

Posted by: Cedric at February 8, 2006 02:26 PM

As for your other remark, yes, the hard work (making TestNG run remotely) is done, so it should be fairly easy to integrate with any continuous build server.

Posted by: Cedric at February 8, 2006 02:27 PM

Just curious: why did you implement it so that the master has to know its slaves instead of the other way around?

Posted by: Christian at February 8, 2006 02:40 PM

Ah, I *knew* this would come up :-)

No particular reason, I weighed the pros and cons of each and couldn't determine that an approach was blatantly better than the other one, so I just picked one.

The downside of having the clients declare themselves to the master is the risk of creating a "ping storm" on the master, but the advantage of that approach is that the master needs to do a lot less bookkeeping...

I can still be convinced either way... (or even support both models).

Posted by: Cedric at February 8, 2006 02:45 PM

Cedric, are you saying that classes and otherdependencies are not delivered to the slaves? How does that suppose to work for more or less complicated and actively changed code?

Posted by: eu at February 8, 2006 03:18 PM

Only the class names (actually, the testng.xml) is transmitted to the slaves. The slaves then simply load these classes and run the tests in them (it is assumed that they have these classes in their classpath).

Posted by: Cedric at February 8, 2006 03:21 PM

This seems like a great place to support zero configuration. Apple's Bonjour or JINI both solve the detection and discovery problems, which means your users don't have to shuffle IP addresses.

Posted by: Jesse Wilson at February 8, 2006 05:10 PM

"The slaves then simply load these classes and run the tests in them (it is assumed that they have these classes in their classpath)"
That doesn't seem to be very distributed, couldn't you load the classes from the server?

Posted by: Geoff at February 8, 2006 05:57 PM

Yes, eventually, I'd like to be able to distribute classes from the server, but for now, the goal is simply to be able to distribute the CPU load...

Posted by: Cedric at February 8, 2006 06:04 PM

"master / slaves" is just fine and very well established and accepted in the computer science jargon. Granted you live in the US, but do you *really* feel you need to be politically correct on this? :)
Merry... holidays ;)

Also, I would use UDP multicast for auto-discovery.
Each test server will register on specific range of multicast group and your test client when you run it will send ONLY one UDP multicast call and will wait for the reply from available test servers...

Posted by: Ruslan Zenin at February 9, 2006 08:03 AM

In addition, your test client might receive statistics from test server (e.g. current CPU load, number of connected clients, number of the tests to be executed on the queue, etc).

Based on this information you can make "smart" load-balancing decisions

Posted by: Ruslan Zenin at February 9, 2006 08:07 AM

I am surprised working in Google you didn't look up the Google File System design before you decided to see if the clients should identify themselves to Master or the other way around.

Another suggestion, explaning the fact that all the test slaves should all have the jar files and JVM installed with correct version dependencies in your documentation would also tremendously help.

;-)

--Deva.

Posted by: Deva at February 9, 2006 11:48 AM

I have a "special" use case : I would like to have the same tests run on several machines, just to tests several OSes and versions.

Exemple 2 : In fact, I now run my python tests integrated seamlessly into TestNG... and it would be sssssoooooo great to be able to run the python tests on all the OSes I use (W$ / Solaris / Tru64 / Linux / HP / AIX).

Just my 2 cents...

Posted by: Laurent Ploix at February 9, 2006 02:25 PM

How are changed classes distributed to the "slaves"? Is this handled by TestNG or do you manually have to do it with scp or networked file system or something as such?

If it's handled by TestNG I assume there's some form of remote class loader scheme in place. That makes me slightly scared as class loader juggling usually end in tears.

Posted by: Jon Tirsen at February 9, 2006 02:41 PM

How are changed classes distributed to the "slaves"? Is this handled by TestNG or do you manually have to do it with scp or networked file system or something as such?

If it's handled by TestNG I assume there's some form of remote class loader scheme in place. That makes me slightly scared as class loader juggling usually end in tears.

Posted by: Jon Tirsen at February 9, 2006 02:41 PM

Oh, I missed your last paragraph. Sorry. :-)

As I said, if you implement remote class loading be sure to make it optional.

Posted by: Jon Tirsen at February 9, 2006 02:43 PM

"If it's handled by TestNG I assume there's some form of remote class loader scheme in place. That makes me slightly scared as class loader juggling usually end in tears."

I'm not sure I understand your reasoning. If you use the more manual approach with NFS or SCP or whatever, you'll be in the same hell I think?

You've delivered new classes which you need to get into the slave. How do you get it to load those new classes in favour of the old ones? You can restart the slave JVM of course but it's not entirely desirable - what if someone else is running a test at the time? If you don't want to restart the slave that brutally you'll be into classloader games I suspect?

A couple things - one, Laurent has a very good idea. Distributed testing can serve two purposes (1. distribute load over many CPUs and 2. to provide testing over a span of many different OS and hardware platforms). It looks like your approach solves 1. easily but I don't think 2 is supported. Since it seems the master is picking and choosing which tests to send to which slaves. You should have a mode that sends ALL tests to ALL slaves.

The second thing - perhaps use multicasting as someone suggested to help alleviate the prior knowledge of which slaves go to which master and/or vice versa. Slaves/masters need only know the addr/port of the multicast. You could use different addr/ports if you have different master/slave configurations, but in the typical case, both master and slave could default to a commonly agreed upon rendevous point and therefore no pre-configuration of IPs/hostnames would be required. The master merely belts out a call "who wants to take this test" and listens for a slave to reply. First one wins, or whatever strategy you want to use. But that initial handshake is all you would need, once a slave calls in, it can send additional info (like what TCP/IP port the master can use to communicate directly with the slave as an example).

Posted by: John at February 24, 2006 10:16 PM

I'd like to second John's comment on the need/value of being able to distribute tests to different platforms to validate software on different OS/hardware platforms. At the least you should include the ability to run ALL tests on all clients, not just divvy the tests up among them. An alternative system is STAF http://staf.sourceforge.net/index.php) but suffers greatly from difficult test development.

Regarding distributing the clients/slaves, this sounds like good application for JXTA (http://jxta.org), the Java P2P system.