A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Tuesday, May 29, 2007

Taking out the garbage

Tonight is garbage & recycling night at the home office, which is a good way to free up some space. Lately, I've been thinking about the same subject in my work.

My first two serious programming languages, Basic & APL, both had garbage collection, but only for resizable arrays. The challenge in garbage collection is figuring out what isn't needed, and arrays make that simple.

Later when I dealt with Pascal & C and C++, all languages lacking garbage collection, I cheated and just allocated gross amounts of memory every time. The problems were simple enough & the memory large enough I could get away with not managing memory.

Then I moved to two garbage collecting languages, Perl and Java (plus a proprietary language called BTL). Since I was now writing object-oriented programs that threw off lots of big objects (to hold big genomic sequences and their features), I needed to start thinking about garbage collection.

Computer programs need memory and can either slurp it in two general fashions: static and dynamic. Static allocation is my old trick: ask for more than you think you will ever need, and hope that (a) you never need more and (b) that it doesn't cause other problems by being such a hog. Dynamic memory allocation requests memory as needed, but you need to free that memory again or it ends up just being indistinguishable from progressive static allocation. Some dynamic memory allocation is quite painless, such as the allocation of local variables in a recursive routine (such as a recursive implementation of Smith-Waterman). Variables are created as you descend the recursion, but then freed again as your unroll it on your way back up.

Far more complex problems arise if your dynamic process throws off things which persist. For example, suppose each step of the recursive descent throws out a variable which may or may not be captured by another object. Now we have a problem: the recursion may unwind, but the memory for those objects can't be reclaimed unless they are truly going unused.

A very common form of garbage collection is reference counting, in which the language implementation keeps track of the number of times each object is being pointed to by something else. If the reference count drops to zero, then the object can be reclaimed. There are two catches. First, such systems are more inefficient than might be obvious, because you do a lot of incrementing and decrementing counters. My object might have one billion other objects pointing to it, but I still must keep track of any changes who is referencing it. The bigger problem is cycles: if a set of objects form a circular chain of references to each other, then most reference counting garbage collectors will never collect them. The simplest case of this is an object which points to itself. Now, my favorite computer scientist (in part because he shares both my mitochondrial and Y-chromosome sequences, as well as quite a few of my SNPs and other genetic variants) points out that there are reference counting garbage collectors which can collect cycles, but these seem to not have entered common usage.

Perl uses a simple reference counting garbage collector, and so one must be careful to either avoid forming cycles or explicitly break them. You can avoid cycles, but it tends to be a pain -- lots of storing ids and running lookups on them instead of just having objects refer to one another directly. It's the equivalent of trying to converse with someone by saying "could you tell me your spouse's SSN so I can look her up?", instead of the person just pointing your directly at the spouse.

Many languages sport better garbage collectors. Python seems to have reference counting as default but a more sophisticated scheme as an option (perhaps the default now?). When I wrote Java it was reference counted, but apparently moved out of that world long ago. C# & Ruby use advanced approaches.

Of course, having garbage collection isn't a panacea. I tend to think in terms of rich networks of objects, and if you build such a rich network no garbage collector can collect it. So some manual management is inevitable: just as an overloaded ship might toss away empty barrels, I end up having to sever parts of the network when I am done with them. Some of my current code is particularly in need of these techniques because it is depth-first searching a rich object hierarchy to get to very large amounts of data at the leaves. Once a result is generated for a leaf, the pile of data for the leaf is superfluous and can be discarded. But since I'm in Perl, I need to be careful about the discards or all will be for naught: leave a circular reference in, and all that debris clings to the mother ship. So the flush() methods I write often hack mercilessly -- and I hope don't do any collateral damage.

Garbage collection is just one of the seismic forces that seems to be compressing my programming style in Perl. Will the rock fold or fracture? It still isn't clear, but it's clearly a subject for another post.

No comments:

Follow by Email

Search This Blog

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.