Thursday, December 27, 2012

We might be closer to killing off the "Just take my word for it - I'm pretty sure I did this right" methods section

There is no shortage of well-reasoned articles filled persuasive arguments about the need for higher reproducible research standards in the scientific literature. With so many good posts about the virtues of reproducible research, they all boil down to one overarching concept:

Why is this even an issue? Biologists in particular seem to be collectively and subconsciously reacting to those awful General Chemistry labs where they had you copy down pages of instructions verbatim into your lab notebook. It should come as no surprise that bioinformatics is ground zero for reproducibility activism.

It is unfortunate reproducible research is tied up with all sorts of other holier-than-thou practices: open access, open source, open data, literate programming, blogging, functional programming. This all-encompassing evangelism tends to polarize people. While wonky über-programmers like C. Titus Brown lay out fundamental practices for reproducibility, most PIs have been publicly giving lip service to the idea of reproducible research, belying a "I don't wanna eat my vegetables"-type disdain. There are now "corsortia" and an "initiative" to compel scientists to actually write their shit down, preferably with door prizes. If you think this has a "posture pals" (video) feel to it, you're not alone. As the number of pro-RR articles has steadily increased, few take these to heart.

This head against wall bashing has been the pattern for many years - better tools are now available (RStudio, knitr, Galaxy, cloud computing, figshare, github, bitbucket) and more rah-rah from the blogosphere - but little enforcement from major journals. But now a recent development has raised my hopes, because it indicates editors have been tightening the screws enough to cause discomfort:

A precursor to the dissenting opinion article is Drummond's "Replicability is not Reproducibility: Nor is it Good Science". A distinction is drawn between reproducibility and replicability, the former being what is advocated and the latter being more generalizable or scientifically provable. The idea we require researchers to submit their data and code, replicable research, is a narrow concept really only useful for ferreting out scientific misconduct.

I would argue that ignorance of biological sequence analysis, and even moreso statistics, is a bigger threat than the outright fraud seen in the Duke case. Most bioinformatics manuscripts feature analysis which is not replicable, which is frightening to consider when GWAS and exome NGS variant papers implicate so many genes in disease, many of them residing along a razor thin p-value threshold tweaked by several incomprehensible cherry picked program parameters.

It is not clear science can efficiently self-correct. So while replicability is not reproducibility, reproducibility is too slow to substitute for replicability. A manuscript that describes real reproducible biological phenomena is essentially conjecture until it can be repeated. The greatest ferret-legger the world has ever known will live in obscurity until they buy a ferret. We have a culture of scientists who refuse to buy a ferret.

Accounting for Tastes

The other dissenting opinion (here) is from UCSC's Kevin Karplus, who replies to Iddo Friedberg's post recommending a panel of white coat mechanics to help biologists get their code ready for publication. Karplus raises two points:

It is difficult to make polished software for others to use and that is not the point of research.

Replicability is not reproducibility.

Regardless of Friedberg's proposal, railing against "polished software" is simply a straw man argument. Reproducible research in 2012 2013 does not mean robust, extensible, or even well documented code. Most sequence analysis papers feature very little compiled code, but rely on using a series of executable programs glued together using scripting languages, producing intermediate data then digested into a report, often written in R.

Getting these sequence analysis workflows to be reproducible will not require a highly skilled platoon of developers. Any willing researcher can submit a shell script or a build script of commands provided they avoid these common pitfalls:

As our toolset and research community matures, these excuses obstacles will eventually disappear. But there is one scenario which will always be true in some of the more competitive arenas of bioinformatics programming (e.g. structure prediction, de novo assembly):

The researcher was perfectly capable of submitting code but decided to retain a competitive advantage.

"Over-CASPed" researchers who are unwilling to divulge their secret sauce should be relegated to appropriate sandboxes.

Replication does not prove a biological truth but we often don't even have the fleeting proof that a scientist did what they said they did.

Which brings us back to those damn chemistry labs. While many public access talk shows find chemists willing argue against evolution, you would be hard pressed to find a one who would argue against writing shit down.

In other words: Not writing shit down is an even worse idea than creationism.

Monday, February 20, 2012

Despite my current ranking of 15th in Biostar, myriad page views of my BAS™ post (albeit mostly misdirected perverts), and positive response for my celebrated campaign against more microarray papers, for some reason I was not "comped" an all-expenses paid trip as honorary blog journalist to this year's Advances in Biology and Genome Technology, which is kind of like CES for sequencing people, except AGBT is still worth attending. Normally the oversight would not bother me, as bioinformatics itself is not the focus of this meeting, but the flood of #AGBT tweets would not let me forget this fact and I was forced to stew and blog in envy.

The first game changing disruptive revolutionary thing from England since 1964

Even from my distant perch it was obvious all the scientific presentations at AGBT were overshadowed by a 17-minute showstopping demo from Clive Brown of Oxford Nanopore, a company that by all appearances would either die, focus on some minor stuff, or bring it. They chose the third option, and in so doing boosted the "Clive index" to unprecedented levels. OxN's recent decision to enlist famed geneticist and serial startup advisor George Church struck me as a huge gamble, as the string of Route 128 flameouts touting his name lead me to assume long ago that Church had stowed away some cursed Tiki idol in his luggage like Bobby in that episode of the Brady Bunch. However, after reading up on OxN, I had to admit I was just bitter about Dr. Church's refusal to invest in my chain of Polonator-based paternity testing clinics, Yo Po'lonatizz!™

Two new sequencer platforms were announced:

A MinION. Forget to hit eject before removing this
and you will instantly lose $900.

The MinION, a $900 "disposable" USB drive which detects minute changes in voltage incurred by the passage of DNA through a
robust and delicious lipid bilayer. Finally a device capable of sequencing filthy rabbit blood right on the spot!

The GridION system, a scalable rackmounted sequencer, which despite some lack of pricing clarity, should produce an actual $1000 15-minute human genome by 2013.

These exotic machines must be truly game-changing because they made properly expanding Albert Vilella's NGS sequencer spreadsheet quite difficult. The MinION, in particular, could be viewed as a free device with $900 of consumables. This effectively lowers the bar to getting high-throughput sequence in the doctor's office to a 100% unamortized billable transaction. These things also claim fucking unlimited read lengths.

Expression microarrays, SAGE, 454, ABI SOLiD, and now Pacific Biosciences have all left bad tastes of uncertainty and dissatisfaction in the mouths of scientists. It is easy to disappoint people on a grand scale with a $700,000 machine, but $900 worth of chemicals in a USB drive is a different animal, and it seems likely this invention will find a following if it even delivers on a fraction of what it promises.

More cringeworthy marketing from the West coast

The Oxford Nanopore machines are so jaw-dropping, in fact, that Jonathan Rothberg is already crying vaporware. His complaints do seem warranted, given disappointments from past year's announcements and the lack of publicly available sequence from these devices.

Unfortunately Ion Torrent has spent all of its goodwill on an inane and hamfisted advertising war against Illumina's MiSeq, an intentionally crippled opponent. Seemingly orchestrated by castoffs from the Celebrity Apprentice, this assault began with cringe-inducing derivations of Apple commercials, and has expanded to include a sort of "feature combover." Through some convoluted logic involving consensus, a professional whiteboard artist attempts to convince the public how the homopolymer error rate is actually lower using Ion Torrent PGM than MiSeq. This is the sequencing equivalent of having your mom try to convince you two apples is better than one devil dog, or some such utter nonsense.

My response was predictably measured and cerebral.

This is not the first time I have tweet-confronted Ion Torrent over its odious approach. All this is rather unnecessary because overall, and despite the homopolymer issues, the utility of the PGM has been more or less within expectations. The MiSeq is also exactly within expectations, since it is basically a transparent, measly 1/50th slice of a HiSeq. The same cannot really be said for the RS, whose error rate is clearly far above what was expected at the outset. So if anyone requires an aggressive smokescreen-type marketing campaign (or a new machine) it is Pacific Biosciences.

Wednesday, January 18, 2012

With bonus R code

It came as a shock to learn from PubMed that almost 900 papers were published with the word "microarray" in their titles last year alone, just 12 shy of the 2010 count. More alarming, many of these papers were not of the innocuous "Microarray study of gene expression in dog scrotal tissue" variety, but dry rehashings along the lines of "Statistical approaches to normalizing microarrays to the reference brightness of Ursa Minor".

It's an ugly truth we must face: people aren't just using microarrays, they're still writing about them.

Reading another treatise on microarray normalization in 2012 would be just tragic. Who still reads these? Who still writes these papers? Can we stop them? If not, when can we expect NGS to wipe them off the map?

#97 is a fair start
df=1997)
mdf
Here I plot both microarray and next-generation sequencing papers (in title). We see kurtosis is working in our favor, and LOESS seems to agree!

Thursday, October 6, 2011

Security Enhanced Linux (SELinux) is a new extra hidden layer of permissions that makes configuring things more difficult, without ever identifying itself as the culprit - kind of like ACLs but more cryptic. Though it may be more secure, it is not an enhancing experience to deal with, and probably not worth it for the average user.

For example to have Apache serve personal websites (i.e. http://server/~leipzig) it is no longer enough to alter httpd.conf, because you will get mysterious 403 errors until you do this (as others have experienced):

chcon -R -t httpd_sys_content_t /home/leipzig

You forget about this change until xauth starts complaining about stuff for no apparent reason:

I have no idea what these things actually mean, nor any real interest in learning. I'm sure this stuff is great for sysadmin cocktail chat but at least for private servers it is just another the brake on the wheel of getting things done. For the time being I have set the level to "permissive", which means it displays warnings but does not interfere, but am leaning toward "disabled" or maybe something else:

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=excoriated
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted

Thursday, June 23, 2011

Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.

Dell PowerEdge r900, available in orange and lemon-lime

Speaking as someone who keeps his copy of CLR safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.

Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.

From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:

The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more memory-bound and IO-bound rather than CPU-bound, so relying on a cluster of smaller machines can quickly overwhelm a network.

NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.

If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:

Myth: Google uses server farms. We should too.

Google has to focus on doing one thing very well: search.

Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.

Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.

Myth: Our development setup should mimic our production setup

An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.

Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows

There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.

Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.

Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™

The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.

Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools

Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.

Crossbow is a Hadoop-based implementation of Bowtie.

Myrna is an RNA-Seq pipeline.

Contrail is a de novo short read assembler.

These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.

The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.

A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.

Here is a particularly telling quote from Biohaskell developer Ketil Malde on Biostar:

In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.

In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford.

Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST

IBM System x3850 X5 expandable to 1536GB, mouse not included

Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.

Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry:

I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!

mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.http://www.mpiblast.org/Docs/FAQ#super-linear