Engineering your spark application for testing

I'm sure you know all about the advantages of testing software. You may also know about Apache Spark, a distributed computing platform. Many of the tutorials and pieces of example code that are provided on their site are quite short, and so simple that you can verify them by inspection. This is fine for shorter examples, but as your problems get more complex, you need to take explicit steps to make sure what you're writing can be tested.

Unit testing with Spark

The spark docs allude to the posibility of testing your system. It's quite easy to create a spark context in the test and use that to test out your system. The main thing to be careful of is that you can't reliably run many spark contexts at once. I'm currently instantiating a single TestContext, with a local spark context, and using that amonst all my tests. For example:

This approach is particularly good where you can perform the same operation on the array as you do on the Spark RDD. Notice that this has been tagged as a "SparkTest", so I can turn it off if I want to run a quick test.

Separating Spark and your algorithm

No one decides they need to break out the map-reduce because they want to sum up a lot of numbers, or count the number of lines in their log files. In the end, you'll want to do something interesting. When you reach that point, you need to start building your program for testability. I've been trying to keep my Spark code trivial. Most of the spark code I have is four or five lines long. All the tricky stuff happens in separate, easily tested modules. I use the Cake Pattern here.

What we've got here is a separation between the spark code, which is only within MyAwesomeMapReduceJob, and the tricky algorithm details, which is within MyAwesomeMapReduceJobWork.

The final step is to write a trait, MyAwesomeMapReduceJobWorkImpl, which implements those two methods.

Once I've got this setup, my tests are divided in two as well. The spark code is tested in one test class, and MyAwesomeMapReduceJobWork is replaced with a very simple implementation. This means I only test the spark code. Often, the testing implementation will include assertions and checks to make sure that it's getting the expected data from MyAwesomeMapReduceJob.

The other part of the tests test the implementation, MyAwesomeMapReduceJobWorkImpl, and avoid using the Spark context.

Why all the effort?

It's a lot faster - the spark tests take a long time to run, and being able to test only the implementation means I iterate faster. All the bugs should be in the implementation class, because the spark class should be trivial.

It's better software engineering - I could switch out spark very easily, because I haven't tied myself to Spark, as long as the spark code stays simple.

I can test that the spark code is working by putting assertions within the mocked implementation. This is a lot easier than doing behavioural tests.

As a final note, I've always found unit testing to be incredibly important within data mining and machine learning. It's very frustrating to try to track through bugs in these sorts of applications because I often don't know exactly what went wrong, only that I didn't get a successful prediction. Aiming for good testing and good test coverage is one way I make sure that the results I get are correct.