Sunday, 20 October 2013

I intended to keep this blog focused on technical topics but in this post I'll extend the scope a bit and will write about my current research interests in bioinformatics and how they relate to software engineering. They are not necessarily the ones I spend most of my working time as my first priority is to provide computational and statistical support to the ongoing research of other scientists. But they are closely related and are the ones that keep me busy when I find some time free of other commitments. There are two topics that I am particularly
interested – robustness and reproducibility of bioinformatics
analysis.

Robustness

Bioinformatics has experienced dramatic growth over
the last 15 years. The rapidly evolving experimental data, the adoption of new
technologies drives the rapid evolution of the computational methods. The fast
speed of development on many occasions comes at the expense of software
engineering rigour. As result, bugs creep in, code that used to work a day ago
no longer does. The results of bioinformatics analysis can change significantly
from one version of a tool to the next. I have experienced this myself on a
number of occasions while using third-party tools to analyze data from genomic experiments. The
consequence is increased uncertainty about the validity of results.

I believe that this situation can be significantly
improved if we borrow from the field of software engineering tools and
techniques which have been developed over the last decade in order to
maintain the quality and robustness of the software code over its entire life
cycle (e.g. unit tests [1][2], static code analysis [3][4], continuous
integration[5][6]). A fortunate circumstance is that over the same period
public repositories of biological data (e.g. at EBI and NCBI) accumulated a
vast amount of experimental data. This opens up exciting opportunities. As an
example, data from a large number of independently produced studies can be
re-used to populate unit tests for many of the popular bioinformatics tools.
Executing such unit tests either ad-hoc or as a part of an automatic routine
would help us identify situations where different versions of the same tool
produce discrepant results over large number of studies. Such automatic meta-analysis would increase
the reliability of the biological studies and would allow us to focus on
studies and results where re-interpretation may be necessary.

Reproducibility

Bioinformatics workflows are becoming increasingly
sophisticated and varied. They tend to apply in succession a growing number of tools
with each tool requiring multiple input paramaters each of which could modify
its behaviour. A tool over the course of its lifetime may evolve into having
multiple versions each leading to some variation in its output. Each tool may also
require reference data (e.g. genomic or proteomic sequences) which itself
evolves to have multiple releases and versions. Thus, in order to reproduce a
certain bioinformatics analysis one needs to capture all the associated
metadata (tool versions and parameters, reference versions) which is not always
provided in the corresponding publications. The field (and EBI in particular)
has worked on establishing standards requiring minimal metadata for reporting a
number of experiments (MIAME [7], MIAPE [8]) but we need to go further and
cover entire bioinformatics workflows. What is needed, in short, is a publicly
accessible system for executable
workflows which would capture all the relevant metadata and allow
straightforward replication of the bioinformatics analysis.

There has been extensive work on systems for composing
re-usable workflows and capturing the associated metadata (e.g. Taverna [9],
Galaxy [10]) but technical limitations in the corresponding implementations have
so far restricted their adoption. In particular, it is non-trivial to set up such
a system on a local infrastructure. Furthermore, in the case of NGS the large
data transfers required limit the usability of the most visible public service
which supports creating and storing of such workflows (the public Galaxy service
[11]). Thus, a system is needed that allows both straight-forward local
deployment and public sharing and execution. It shall make it possible to
attach a persistent DOI [12] to a bioinformatics workflow and refer to it in a
publication so that other scientists are able to reproduce the results using either
locally or a globally accessible computational resource. Fortunately, recent advances
in system virtualisation [13] and in utility computing [14] make both goals
feasible.

Plans

My immediate plans (when I have more time) would be
to focus on two ideas:

(1) Use data at the
EBI and NCBI (e.g. ENA/SRA, EGA, ArrayExpress) to produce test suites for automatic meta-analysis. I am particularly
interested in functional genomics and epigenetics studies involving NGS data (e.g.
RNA-Seq, Chip-seq) but the tools which would be developed are likely to be useful
also for other types of studies.(2) Develop in
collaboration with other research institutions in the EU and elsewhere a platform for composing and sharing
executable workflows which builds upon the latest advances of utility
computing and improves upon existing projects such as Taverna and Galaxy. The
focus of this system would be, again, on NGS data but the tools would likely be
useful for other types of biomedical data.

Both ideas are, actually, synergistic. The automated
meta-analysis would require building of a re-usable testware system which would
exhibit the main features required by the platform for re-usable and share-able
workflows. I imagine that both projects would build upon the same core software
module.

About Me

In my day job I am the Head of Biocomputing at the MRC National Institute for Medical Research in London, UK. I am using this blog mainly as a place to put "notes to self" about various technical issues that I encounter. Needless to say, any opinion in this blog is my own and may not reflect that of my employer.