Vogon Poetry, Computers and (some) biology

With the MinHashcraze currently going on in the lab,
we started discussing how to calculate signatures efficiently,
how to index them for search and also how to distribute them.
As a proof of concept I started implementing a system to read public data available on the Sequence Read Archive,
as well as a variation of the Sequence Bloom Tree using Minhashes as leaves/datasets instead of the whole k-mer set (as Bloom Filters).

Since this is a PoC,
I also wanted to explore some solutions that allow maintaining the least amount of explicit servers:
I'm OK with offloading a queue system to Amazon SQS instead of maintaining a server running RabbitMQ,
for example.
Even with all the DevOps movement you still can't ignore the Ops part,
and if you have a team to run your infrastructure,
good for you!
But I'm a grad student and the last thing I want to be doing is babysitting servers =]

Going serverless: AWS Lambda

The first plan was to use AWS Lambda to calculate signatures.
Lambda is a service that exposes functions,
and it manages all the runtime details (server provisioning and so on),
while charging by the time and memory it takes to run the function.
Despite all the promises,
it is a bit annoying to balance everything to make an useful Lambda,
so I used the Gordon framework to structure it.
I was pretty happy with it,
until I added our MinHash package and,
since it is a C++ extension,
needed to compile and send the resulting package to Lambda.
I was using my local machine for that,
but Lambda packaging is pretty much 'put all the Python files in one directory,
compress and upload it to S3',
which of course didn't work because I don't have the same library versions that Amazon Linux runs.
I managed to hack a fix,
but it would be wonderful if Amazon adopted wheels and stayed more in line with the Python Package Authority solutions
(and hey, binary wheels even work on Linux now!).

Anyway,
after I deployed the Lambda function and tried to run it...
I fairly quickly realized that 5 minutes is far too short to calculate a signature.
This is not a CPU-bound problem,
it's just that we are downloading the data and network I/O is the bottleneck.
I think Lambda will still be a good solution together with API Gateway
for triggering calculations and providing other useful services despite the drawbacks,
but at this point I started looking for alternative architectures.

Back to the comfort zone: Snakemake

Focusing on computing signatures first and thinking about other issues later,
I wrote a quick Snakemake rules file and started calculating signatures
for all the transcriptomic datasets I could find on the SRA.
Totaling 671 TB,
it was way over my storage capacity,
but since both the SRA Toolkit and sourmash have streaming modes,
I piped the output of the first as the input for the second and... voila!
We have a duct-taped but working system.
Again,
the issue becomes network bottlenecks:
the SRA seems to limit each IP to ~100 Mbps,
it would take 621 days to calculate everything.
Classes were happening during these development,
so I just considered it good enough and started running it in a 32-core server hosted at Rackspace
to at least have some signatures to play with.

Offloading computation: Celery + Amazon SQS

With classes over,
we changed directions a bit:
instead of going through the transcriptomic dataset,
we decided to focus on microbial genomes,
especially all those unassembled ones on SRA.
(We didn't forget the transcriptomic dataset,
but microbial genomes are small-ish,
more manageable and we already have the microbial SBTs to search against).
There are 412k SRA IDs matching the new search,
totalling 28 TB of data.
We have storage to save it,
but since we want a scalable solution (something that would work with the 8 PB of data in the SRA,
for example),
I avoided downloading all the data beforehand and kept doing it in a streaming way.

I started to redesign the Snakemake solution:
first thing was to move the body of the rule to a Celery task
and use Snakemake to control what tasks to run and get the results,
but send the computation to a (local or remote) Celery worker.
I checked other work queue solutions,
but they were either too simple or required running specialized servers.
(and thanks to Gabriel Marcondes for enlightening me about how to best
use Celery!).
With Celery I managed to use Amazon SQS as a broker
(the queue of tasks to be executed,
in Celery parlance),
and celery-s3 as the results backend.
While not an official part of Celery,
using S3 to keep results allowed to avoid deploying another service
(usually Celery uses redis or RabbitMQ for result backend).
I didn't configure it properly tho,
and ended up racking up \$200 in charges because I was querying S3 too much,
but my advisor thought it was funny and mocked me on Twitter (I don't mind,
he is the one paying the bill =P).
For initial tests I just ran the workers locally on the 32-core server,
but... What if the worker was easy to deploy,
and other people wanted to run additional workers?

Just to keep track: we are posting Celery tasks from a Rackspace server
to Amazon SQS,
running workers inside Docker managed by Kubernetes on GCP,
putting results on Amazon S3
and finally reading the results on Rackspace and then posting it to IPFS.
IPFS is the Interplanetary File System,
a decentralized solution to share data.
But more about this later!

HPCC workers

Even with Docker workers running on GCP and the Rackspace server,
it was progressing slowly and,
while it wouldn't be terribly expensive to spin up more nodes on GCP,
I decided to go use the resources we already have:
the MSU HPCC.
I couldn't run Docker containers there (HPC is wary of Docker,
but we are trying to change that!),
so I used Conda to create a clean environment and used the requirements
file (coupled with some PATH magic) to replicate what I have inside the Docker container.
The Dockerfile was very useful,
because I mostly ran the same commands to recreate the environment.
Finally,
I wrote a submission script to start a job array with 40 jobs,
and after a bit of tuning I decided to use 12 Celery workers for each job,
totalling 480 workers.

This solution still requires a bit of babysitting,
especially when I was tuning how many workers to run per job,
but it achieved around 1600 signatures per hour,
leading to about 10 days to calculate for all 412k datasets.
Instead of downloading the whole dataset,
we are reading the first million reads and using our streaming error trimming
solution to calculate the signatures
(and also to test if it is the best solution for this case).

Clever algorithms are better than brute force?

While things were progressing,
Titus was using the Sequence Bloom Tree + Minhash code to categorize the new datasets into the 50k genomes in the [RefSeq] database,
but 99\% of the signatures didn't match anything.
After assembling a dataset that didn't match,
he found out it did match something,
so... The current approach is not so good.

(UPDATE: it was a bug in the search,
so this way of calculating signatures probably also work.
Anyway,
the next approach is faster and more reasonable,
so yay bug!)

Yesterday he came up with a new way to filter solid k-mers instead of doing
error trimming (and named it... syrah?
Oh, SyRAh...
So many puns in this lab).
I created a new Celery task and refactored the Snakemake rule,
and started running it again...
And wow is it faster!
It is currently doing around 4200 signatures per hour,
and it will end in less than five days.
The syrah approach probably works for the vast majority of the SRA,
but metagenomes and metatranscriptomes will probably fail because the minority members of the population will not be represented.
But hey,
we have people in the lab working on that too =]

Future

The solution works,
but several improvements can be made.
First,
I use Snakemake at both ends,
both to keep track of the work done and get the workers results.
I can make the workers a bit smarter and post the results to a S3 bucket,
and so I only need to use Snakemake to track what work needs to be done and post tasks to the queue.
This removes the need for celery-s3 and querying S3 all the time,
and opens the path to use Lambda again to trigger updates to IPFS.

I'm insisting on using IPFS to make the data available because...
Well, it is super cool!
I always wanted to have a system like bittorrent to distribute data,
but IPFS builds up on top of other very good ideas from bitcoin (bitswap),
and git (the DAG representation) to make a resilient system and,
even more important,
something that can be used in a scientific context to both increase bandwidth for important resources (like, well, the SRA)
and to make sure data can stay around if the centralized solution goes away.
The Cancer Gene Trust project is already using it,
and I do hope more projects show up and adopt IPFS as a first-class dependency.
And,
even crazier,
we can actually use IPFS to store our SBT implementation,
but more about this in part 2!

Two weeks ago,
during the CS graduate orientation at UC Davis,
the IT person was showing that we have access to 50 GB on Box and UNLIMITED storage on Google Drive
(I'm currently testing the limits of UNLIMITED, stay tuned).
I promptly forgot about Box because I've been using Syncthing for file syncing and it's been working quite well,
or at least way better than owncloud (which took infinite amounts of time to sync my stuff).

Anyway,
today I was checking PaperShip (wonderful app, by the way)
and noticed my Zotero free account was almost full (it is only 300 MB).
I knew Zotero had WebDAV support,
but I don't want to maintain a WebDAV server.
Turns out Box has WebDAV support,
and even found instructions on how to set everything up.

Since last month I'm a PhD student in MSU at Tituslab, and my
research is focused on building infrastructure for exploring and
merging multiple read types and multiples assemblies.

Titus and all the labbies are awesome mentors, and I'm making some
progress while learning how do deal with this brave new world.

One thing I'm doing is checking how good is a Gallus gallus Moleculo
sequencing dataset we have, which is being used for the
chicken genome sequence improvement project. The first question was:
How many of Moleculo raw reads align to the reference genome, and how much is
the coverage?

To answer these questions we are using bwa-mem to do the alignments,
samtools to work with the alignment data and a and a mix of Bash and
Python scripts to glue everything together and do analysis.

With the reference genome available, we need to prepare it for the BWA algorithms by constructing its FM-index. The command to do this is

In [ ]:

!bwa index galGal4.fa.masked.gz

SAMtools also require preprocessing of the original FASTA file:

In [ ]:

!samtools faidx galGal4.fa.masked.gz

I had 10 files with Moleculo reads, varying from 500 bp to 16 Kbp. In this example let's assume all reads are in the same file, reads.fastq, but in the original analysis I ran the next commands inside a Bash for-loop.

Let's align reference and reads using the BWA-MEM algorithm:

In [ ]:

!bwa mem galGal4.fa.masked.gz reads.fastq > reads.fastq.sam

Next we are going to optimize the reads.fastq.sam file, transforming it into the BAM format (a binary version of SAM). We also sort based on leftmost coordinates and index the file for faster access:

We can query our alignments using the view commands from samtools. How many reads didn't align with the reference?

In [ ]:

!samtools view -c -f 4 reads.fastq.sorted.bam

In my case I got 7,985 reads that didn't align to the reference, from a total of 1,579,060 reads.

In [ ]:

!samtools view -c reads.fastq.sorted.bam

There were 4,411,380 possible alignments.

But how good is the coverage of these alignments in the reference genome? To do this calculation I refactored calc-blast-cover, resulting in bam-coverage. The idea is to create arrays initialized to zero with the size of the sequence for each sequence in the reference. We go over the alignments and set the array position to 1 if there is an alignment
matching this position. After doing this we can calculate the coverage by summing the array and dividing by the sequence size.

To make it easier to use I've started a new project (oh no! Another bioinformatics scripts collection!)
with my modifications of the original script. This project is called bioinfo and it is available
in PyPI.
Basic dependencies are just docopt (which is awesome, BTW) and
tqdm (same). Additional dependencies are needed based on which command
you intend to run. For example, bam_coverage needs PySAM and screed. At first
this seems counter-intuitive, because the user need to explicitly install new
packages, but this way avoids another problem: installing all the packages in
the world just to run a subset of the program. I intend to give an informative
message when the user try to run a command and dependencies are missing.

If you've never used Python before, a good way to have a working environment is to use Anaconda.

Since bioinfo is available in PyPI, you can install it with

In [ ]:

!pip install bioinfo

To see available commands and options in bioinfo you can run

In [ ]:

!bioinfo -h

Running the bam_coverage command over the alignment generated by BWA-MEM:

Since 2009 I was a Mac user. I was working with iOS development, and it made
sense to have a MacBook for the SDK. I was curious too, because I've been
using Linux distros (Debian, then Ubuntu, then Gentoo when Ubuntu was getting
too heavy for my old laptop) for some time and was a bit tired of making
everything work. Losing control was discomfortable at first, but so many
things working out of the box (like sleep and hibernation!) was worth it. And
Mac apps were much more polished (oh, Garageband).

When I arrived at INPE I got a Linux workstation, the mighty Papera (all the
computers there have Tupi names, Tupi being a language spoken by native
indians here in Brazil). And I tested some new things, like using Awesome1
as a window manager, and love it. But it lasted just for some months, because
the machines were swapped for some iMacs and Papera was assigned for other
person. I missed a tiling manager, but I also found Homebrew2 and it helped
a lot setting up a dev environment in OSX (I know macports and fink existed,
but writing a Homebrew formula is pretty easy, I even contributed one back),
so no big problems in the transition.

But after some time I was getting uneasy. New OSX versions seemed to remove
features instead of adding then (sigh, matrix-organized Spaces...). Lack of
expansibility on new laptops (despite MacBook Air being an awesome computer)
was pushing me back too, because a maxed one would cost way more than I was
willing to pay. And I was spending most of my time in SSH sessions to other
computers or using web apps, so why not go back to Linux?

At the end of 2012 I bought a used ThinkPad X220 with the dock and everything.
When I was younger I always liked the visual, with its black and red look, and
the durability (MacBooks are pretty, but they are easy to scratch and bend).
And the X220 was cheap and in perfect state, and with a small upgrade when I
went to PyCon (ahem, 16 GB RAM and a 128 GB SSD) it is a BEAST now. And all
these benefits too:

When I need a new package it is as easy to write a PKGBUILD file as it was
to write a Homebrew formula. I wrote some Debian packages in the past and
they worked, but there were so many rules and parts that I don't think I
want to write one again. I recognize that a lot of the rules and parts make
sense with a project as big as Debian (and Ubuntu and everyone else), but
it could be simpler.

Last weekend I went with some friends to Maracanã watch Italy vs Mexico
at the first round of the Confederations Cup.
We got some cheap tickets (R$ 57, about US$ 25) on an area meant for
Brazilians only (and that's why they were so cheap, usually tickets cost
at least double). And it was so good to go again to a stadium, it's a very
different experience from watching a game on TV, where the camera give you
a limited perspective. Our seats were behind one of the goals, and we were
lucky: we saw two goals on our side, one from Mexico (a penalty kick) and
one from Italy, a beautiful goal by Balotelli, who was a bit of a diva, by
the way, always complaining and making drama.

It was the third time I went to a stadium. Previous ones were Grêmio matches,
one versus River Plate in 2002 and other versus Figueirense in 2008. These
games were on Grêmio's last stadium, Olímpico Monumental, which gave way to
a new one, the Arena, built at the entrance of the city (and far from the
old one). This is happening a lot around here, given that next year we have
the World Cup and there are at least ten new stadiums built or reformed for
the competition. At first they should have been prepared with private
funding, but as time passed and they were all late public funding came into
the picture, and only in Maracanã more than one billion brazilian reais were
spent.

They are now much closer to developed countries' stadiums, like those we used
to see on TV. By brazilian standards they aren't even stadiums anymore,
looking more like theaters, were you just sit and watch the game. It
shouldn't be bad, but I couldn't avoid a comparison between this weekend
game and my previous experience. OK, this time the crowd didn't have a
prefered side and so it wasn't cheering up as much as I saw before, but it
was worrisome because it was obvious that almost everyone at the match was
people that could pay for the expensive tickets, and sometimes didn't even
have a strong connection with football, something that always gave a match
that catarsis aura.

Last week I tried to go with my family to see the new Grêmio stadium and we
weren't so lucky. Cheapest tickets cost almost R$ 100 each, and for the four
of us this meant R$ 400 less, not out of league but way too expensive. This
high cost ticket means the usual fan can't go to the stadium anymore,
and is replaced by this new fan, a consumer above everything else. And
maybe it's just a romantic vision, and of course there were a lot of problems
before, but the price paid for comfort might be too expensive in the long run.