Testing Python and C# Code

By Gigi Sayfan, February 06, 2013

Monkey patching and reflection are just a couple of ways to test complex systems.

Because RabbitMQ was a new third-party piece of software to be used as a critical component of our system, I wanted to test its integration throughly. That involved multiple tests against a local cluster of three nodes (all running on my local machine), as well as the same tests running against a remote RabbitMQ cluster. The tests involved tearing down, recreating, and configuring the cluster in different ways, and then stress-testing it. Setting up and configuring a remote RabbitMQ cluster involves multiple steps, each normally taking less than a second. But, on occasion, one can take up to 30 seconds. Here is a typical list of the necessary steps for configuring a remote RabbitMQ cluster:

Shut down every node in the cluster

Reset the persistent metadata of every node

Launch every node in isolated mode

Cluster the nodes together

Start the application on each node

Configure virtual hosts, exchanges, queues, and bindings

I created a Python program called Elmer that uses Fabric to remotely interact with the cluster. Due to the way RabbitMQ manages metadata across the cluster, you have to wait for each step to complete for every node in the cluster before you can execute the next step; and checking the result of each step requires parsing the console output of shell commands (yuck!). Couple that with node-specific issues and network hiccups and you get a process with high time variation. In my tests, in addition to graceful shutdown and restart of the whole cluster, I often want to violently kill or restart a node.

From an operations point of view, this is not a problem. Launching a cluster, or replacing a node, are rare events and it's OK if it takes a few seconds. It is quite a different story for a developer who want to run a few dozen cluster tests after each change. Another complication is that some use cases require testing unresponsive nodes, which can lead to the halting problem (is it truly unresponsive or just slow?). After suffering through multiple test runs where each test was blocked for a long time waiting for the remote cluster, I ended up with the following approach:

Elmer (the Python/Fabric cluster remote control program) exposes every step of the process

A C# class called Runner can launch Python scripts and Fabric commands and capture the output

A C# class called RabbitMQ utilizes the Runner class to control the cluster

A C# class called Wait can dynamically wait for an arbitrary operation to complete

The key was the Wait class. The Wait class has a static method called Wait.For() that allows you to wait for an arbitrary operation to complete until a certain timeout. If the operation completes quickly, you will not have to wait for the time to expire, and Wait will bail out quickly. If the operation doesn't complete in time, Wait.For() will return after the timeout expires. Wait.For() accepts a duration (either a TimeSpan or number of milliseconds), and a function returns bool. It also has a Nap member variable that defaults to 50 milliseconds. When you call Wait.For(), it calls your function in a loop until it returns true or until the duration expires (napping between calls). If the function returns true, then Wait.For() returns true; but if the duration expires, it returns false.

I call Wait.For() with a duration of 10 seconds, which I wouldn't want to block on every time I check whether a node is down (since it happens all the time). The anonymous function I pass in calls the rmq() method with the status command. The rmq() method runs the status command on the remote cluster, then returns the command-line output as text. Here is the output when the Rabbit is running:

The function is making sure that the mnesia and rabbit components don't show up in the output. Note that if the node is still up, the function will return false and Wait.For() will continue to execute it multiple times. Wait.For() decreases the sensitivity of my tests to occasional spikes in response time (I can Wait.For() longer without slowing down the test in the common case), and has reduced the runtime of the whole test suite from minutes to seconds.

Conclusion

The sum total of this series of articles has shown a variety of design principles and testing techniques to deal with hard-to-test systems. Nontrivial code will always contain bugs, but deep testing is guarantied to reduce the number of undiscovered issues.

Gigi Sayfan specializes in cross-platform object-oriented programming in C/C++/ C#/Python/Java with emphasis on large-scale distributed systems, and is a long-time contributor to Dr. Dobb's.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Dr. Dobb's has gone green! Dr. Dobb's Journal Digital Edition will only be available in digital format going forward. A complimentary one-year digital subscription to Dr. Dobb's Journal Digital Edition will be sent to applicants who qualify.