So today started out as a pretty frustrating morning due to random (not really random)
failure in some analysis pipelines on some data I am trying to work on for a collaborator.
The analysis has already taken far longer than expected for various reasons, some of
which are my fault, and some of which aren’t. But given some of the issues that crop
up I was inspired to post a little bit of a vent concerning things that end up just annoying
me as a researcher in bioinformatics. Some of these are specific to today, some arent
but I am putting them all here today anyway. In some cases I may call out specific
software. In all cases I appreciate the work put into the tool by its developers. Often
it is a tool I use a lot. Sometimes it is a specific case where it is symptomatic of
larger issues in Bioinformatics software design. After all, if I thought the tool was
complete garbage I wouldn’t care enough to vent about it (unless it was really widely
used and terrible, but thats something for other posts). Ok, so in no particular
order:

In brief data was aligned to the human reference genome (GRCh37.75) with BWA[1] and
processed using Picard[2], the GATK[3], and VT[4]. Variants were called using six
different variant callers, each with different error profiles and performance
characteristics: MuTect[5], FreeBayes[6], VarDict[7], Pindel[8], Platypus[9],
and Scalpel[10] and then combined into a unified call set using bcbio-ensemble[11].
Variants were annotated with snpEff[12] and VCFAnno[13] from a variety of data
sources including dbSNP[14], 1000 Genomes[15], The Exome Sequencing Project’s EVS[16],
Ensembl[17], ClinVar[18], and COSMIC[19].

Like many scientists I can be a bit ‘scatter-brained.’ Stereotypes are sometimes,
true after all. My brain is usually always ‘on’, thinking about a million different
things (or sometimes locked onto something specific) and I can easily loose track of
things. I started really working on my organization, time management, and productivity skills
a few years ago during my Post-Doc. It started out of necessity, I had a lot of
standing meetings to go to for our project because it was a large multi-group,
Genome Canada funded project and I was the sole Bioinformatician. So I had my subset
of Exome sequencing projects that I was following all the way through, but I was also
doing all of the initial data analysis for all projects before it got passed on
to another post-doc or graduate student for further study. As the only Bioinformatician
(even among PIs), I was also involved in lots of the higher-level planning and
meetings as well. Coupled with normal post-doc life I really needed to start living
by my calendar. I also needed to start learning some work-life balance skills in
terms of answering emails at 1am that could easily wait for morning. Later during
my post-doc I was also involved with some friends in getting a start up going. I’m
not involved with that anymore, except in an occasional advisory capacity, but
it definitely made organization even more important. I was doing a few hours in the
offices every morning before heading to the lab, a few hours in the evening at home,
and some random meetings either on Google Hangouts or in person on some days. My calendar
became even more important, but so did things like time tracking and task tracking tools.

I have been using Ubuntu for a long time, and while I don’t hate the Unity desktop
manager, I was growing increasingly disillusioned with it. I’ve also always been irritated
at resource usage. Especially with Compiz turned on, RAM usage is fairly substantial. My
workstation has 16GB of RAM, so I’m not that concerned for general usage, but I also tend
to do a lot of heavy computation on this system and testing for development. When your
processes use RAM in the GB range you want to keep as much free as possible so you don’t
run into any issues. Further I’m usually runnin a Virualbox instance of Windows, because
within the hospital we have managed desktops that are all that can have access to
Clinical Applications and the Shared drive. I don’t NEED to run this all of the time,
as the most important thing (Outlook) also runs on my phone. But it is easier if it
is running in the background as much as possible. I give it 4 GB of RAM because otherwise
it tends to run pretty slowly and I hate any sort of lag in my program response.
Like I said, I can turn it off or reduce it’s RAM usage whenever I need. And if I have Cassandra
running for my testing database, it uses a good 4GB of RAM as well. Anyway, long story short
I found Unity uses a fair number of resources as well so I decided to do some experimenting.

It’s been awhile and all of the cluster components are here and have been running
through burn-in at the datacentre for a few weeks. We had some minor hiccups
waiting on a power cable to come in for the CISCO 10G switch, because apparently
they had to use a power connection just different enough from all other computer equipment
that you need to buy theirs, and of course it was back-ordered.

Over the last few years I have been doing a lot of experimentation and development
work (mostly unpublished) surrounding things like genomic pipelines and ways of
managing and exploring genomic level data (focused on rare variant analysis in humans).
While there are plenty of exceptional programs and tools out there for this (particularly
pipelines), we bioinformaticians do like to tinker and re-invent the wheel a lot. Sometimes
this is bad (I’ve been guilty of this in the past), and sometimes it isn’t. We all also tend to come
at how to execute and configure things (again, particularly pipelines) in our own particular ways
so sometimes even the best software can be a chore for us to use, because some early step
just doesn’t seem right to us.

Following on from my previous post about why I think moving towards microservices
type designs will be beneficial to bioinformatics I want to discuss a project I am
currently working on and how I envision the different microservices fitting together
and why I’m going that route.

Last time I discussed a bit about the needs we had identified and outlines for supporting
next-generation sequencing in a clinical setting from a bioinformatics perspective.
I focused a bit at the end on the solution for storage we are using from
based off of their Storinator product. We have three of the 4U units in-house now with some
drives on the way. The 10 GbE switch is also now in the data centre and other than that
we are just waiting on our compute solution to be delivered. For our compute option
we went with the Dell FX2 platform. The FX2 is an example of the recent trend of moving towards converged
architectures to simplify operations and generally reduce costs. This platform comes in a variety of
configurations and densities, we opted for the 4 node compute option with I/O aggregators
to simplify our networking. The 4 compute nodes themselves actually communicate over the backplane of the FX2,
meaning that communication between nodes doesn't need to go to the switch and back,
that is definitely one of the main advantages of the aggregator over the ethernet pass-through module
they offer, and the cost difference is pretty minimal. With the 10GbE between the storage nodes and compute, overall
we should have a blazing fast cluster.

As part of my job as a Clinical Bioinformatician getting a Next-Generation
Sequencing-based Diagnostics test up and running I am designing and building the
small-scale computing cluster that we need to support this. Now we don’t need a
tremendous amount of computing power since we are engaging in targeted sequencing
coming from Illumina MiSeq benchtop sequencers. Of course while the primary purpose
of this computing resource is to support clinical diagnostic needs, by purchasing
these sequencers in the hospital we will also be supporting research using the
equipment and research by other Faculty members who are new to NGS. As the sole
Bioinformatician in the hospital, and one of the few working on human genomics at
the University I tend to collaborate on a lot of diverse projects. So the cluster
needs to support that as well, with all clinical work receiving priority tasking.