Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

Unfortunately, my genome of interested in not hosted at UCSC, so I needed to alter bam_to_bigwig slightly. In addition, I’ve replaced the read counting function mapped_read_count with my own implementation that rely on BAM index and therefore return number of alignments nearly instantly. This reduces the time by some minutes for large BAMs.

I was often challenged with accessing thousands/millions files from network file system (NFS). As I update some of the stored files once in a while, I have decided to store these files in multiple TAR archives. The data complexity was therefore reduced. But still, there was an issue with random access to the files within each archive.

First, I had a look at tar indexer. Its simplicity is brilliant. Yet, it stores index in raw text file and it can handle only single tar file. Therefore, I have ended up writing my own tar_indexer tool using sqlite3 for index storing and allowing indexing of multiple tar archives. This can be easily incorporated into any Python project.

Note, only raw (uncompressed) tar files are accepted as native tar.gz cannot be random accessed. But you can compress each file using zlib before adding it to tar. At least, this is what I do.