A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Thursday, March 11, 2010

Playing Director

Last weekend was the Academy Award presentations. Fittingly, just before I had my first theatrical success -- with Actors.

Actors are a Scala abstraction for multiprocessing. I've only really played with multiprocessing once, back in my waning days at Harvard. I tried writing some multithreaded Java code, and the results were pretty ugly. The code soon became cluttered with locks and unlocks and synchronized keywords, but my programs still locked up consistently. Multiple processes can be a real headache.

But, there's also the benefit -- especially since I have a brand new smoking fast oligoprocessor box (I keep some mystery in the precise number). Tools such as bowtie and BWA are multithreaded, but it would be useful to have some of the downstream data crunching tools enabled as well.

Actors are a high level abstraction which relies on message passing. Each Actor (or non-Actor process) communicates with other Actors by sending an object. The Actor figures out what sort of object has been thrown its way and acts on it. A given Actor will execute its tasks in the order given, but across the cast there is no guarantees; everything is asynchronous. Each Actor behaves as if it has its own thread, though in reality a pool of worker threads manages the execution of the Actors -- threads tend to be heavyweight to start up, so this scheme minimizes that overhead and thereby encourages casts of millions -- but I won't emulate de Mille for some time.

My first round of experiments left me with new bruises -- but I did come out on top. Some lessons learned are below.

First, get the screenplay nailed down as much as possible before involving the Actors. Debugging multithreaded code brings on its own headaches; don't bring down that mess of trouble before you need to. For example, in the IDE I am using (Eclipse with the Scala plug-in, which some day I will rant about) if you hit a debugging breakpoint in one thread the others keep going. In my case, that meant a println statement from my master process saying "I'm waiting for an Actor to finish" -- which kept printing and thereby prevented me from examining variables (because in Eclipse, if Console is being written to it automatically pops to center stage).

A corollary to this is after several iterations I had improved the algorithm so much it probably didn't need Actors any more! I really should time it with 0, 1 and 2 Actors (and both permutations of 1 actor -- the code runs in 3 stages and a single actor can do either the last one or the last two) -- the code is about 2/3 of the way to enabling that. Actually, one reason I went through a final bit of algorithmic rethink was the fact that the Actor enabled code was still a time pig -- the rethought version ran like a greased pig.

Second, remember the abstraction -- everything is passed as messages and these messages may be processed asynchronously. More importantly, always pretend that the messages are being passed by some degree of copying. An early version of my code ignored this and had code trying to change an object which had been thrown to an Actor. This is a serious no-no and leads to unpredictable results. You give stage commands to your Actors and then let them work!

Third, make sure you let them finish reciting their lines! My first master thread didn't bother to check if all the Actors were done -- which led to all sorts of interesting run-to-run variation in output which was mystifying (until I realized it was the synchrony problem). Checking for being done isn't trivial either. One way is to have a flag variable in your Actor which is set when it runs out of things to do. That's good -- as long as you can easily figure out how to set it. You can also look to see if an Actor is done processing messages -- except checking for an empty mailbox doesn't guarantee it is done processing that last message, only that it has picked it up. One approach that worked for my problem, since it is a simple pipelining exercise, is to have Actors throw "NOP" messages at themselves prior to doing any long process -- especially when the master thread sends them a "FLUSH" command to mark the end of the input stream. Such No OPeration messages keep the mailbox full until it gets done with the real work.

So, I have a working production. I'll be judicious in how I use this, as I have discovered the challenges (in addition to the problems I solved above, there is a way to send synchronous messages to Actors -- which I could not get to behave). But, I am already thinking of the next Actor-based addition to some of my code -- and my current treatment is pretty complicated. But, a plot snip here and a script change there and I should be ready for tryouts!

No comments:

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.