Bioinformatics day-by-day

Menu

Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).