A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Wednesday, March 08, 2017

MinION Leviathan Reads: An Update

Last week I posted a piece on some amazing new nanopore data, only to be red-faced to discover the next morning that I had misread the axes. So I re-posted the piece with the offending data and subsequent analysis in strike-thru font. After I did that, I was informed that the same dataset actually did have leviathan reads, bigger than my misinterpretation.

Another take at that: Josh Quick posted a summary of new nanopore data, I was quick to write it up based on a misunderstanding from a too quick reading of the data, someone was quick to point out my error and then Nick Loman was quick to post more data from their dataset that was consistent with my misreading. Still, errors like that cut me to the quick.

So now rather than re-mangle that other post, I'm going to do a short update here. Josh and Nick generated data by performing a gentle phenol-chloroform DNA isolation of E.coli and then performing the transposase-based rapid 1D nanopore prep directly on the rod-bound spooled DNA on phenol-chloroform purified DNA (somewhere I got the idea the rod was retained, but that appears to be wrong [correction on 03/08/17] ). One takeaway from their data is a 780Kbp alignment to the reference. Another is that the "Robison rule" still is critical: the longest read in the called dataset appears to be over a megabase of line noise (the second longest is over 900Kb but mostly noise). But more importantly, the monster real read wasn't some crazy outlier; the dataset has many huge reads.

For perspective, this one large read is about 1/7th the size of the input genome. Bigger by a large margin than the second bacterial genome ever sequenced. About a third the total genome of the smallest known eukaryotic genome. Or the calculation below

Plus, this was all from a first attempt. The overall yield of data, 5Gb, was a bit low by the current standards -- perhaps 50% lower than experienced nanopore artists such as Josh and Nick are routinely getting. Getting the right ratio of transposase to spooled DNA may take awhile. Indeed, Nick has tweeted out teasers from a second attempt using human input and now has an 882Kb aligning read. Personally, I will be surprised if a megabase read isn't reported at or before the London Calling meeting at the beginning of May; too many hot shots are going to want to repeat and extend this work.

(Nick's cheeky "this week" remark was in response to a James Hadfield twitter poll as to when a veracious megabase read would show up)

It will be most interesting to see how this plays out in utility. For haploid genomes, assembly becomes a bit trivial; Michael Schatz pointed out that these reads are approaching the length of Saccharomyces chromosomes. As another example, given the same diploid (or higher) sample prepared the conventional way versus these megapreps, how different will the final assemblies and how much more sensitive will megaprep assemblies be for structural variant detection and how much more contiguous will be the assemblies? For metagenomes, will the data quality issues and the challenge of differential DNA recovery blunt the impact of single reads of half a megabase or more from rare microbes?

To revive my previous comment: if the folks at mapping companies such as BioNano Genomics aren't thinking hard about the potential impact from this, they're sleeping on the job. Certainly the mapping platforms will have higher yield, since their DNA labeling chemistry doesn't involve breaking the sample as the nanopore prep does. But cutting down the number of platforms required for a platinum genome will be appealing for some groups, particularly since the purchase price of a mapping instrument can pay for an awful lot of MinION flowcells. And should Oxford boost output, by teaching the community whatever magic they use for whopping in house yields, by extending flowcell life, by increasing pore speed or by delivering flowcells with even more active channels, then the advantage of mapping-only platforms will shrink further.

Ditto the growing number of linked read companies. Clearly with ONT's unresolved data quality issues, linked reads on Illumina will potentially have an advantage for SNP calling. But again, the gap is narrowed between their products and an all-MinION solution. Each time Oxford makes a substantial improvement in basecalling, as they claim their in-beta 1D^2 chemistry and Scrappie basecaller do, that gap will tighten further.

The uber-long-read capability of nanopore is an advantage over PacBio, though PacBio still retains a significant edge in terms of near-randomness of errors. I noted in the prior piece that the 10-12Gb seen on MinION by a number of labs is higher than the 5-7Gb that current Sequel flowcells are generating. That sparked some lively and interesting back-and-forth comments from Clive Brown and an anonymous commenter. Certainly there will be a lot of back-and-forth as both companies pursue higher throughput. Should Roche/Genia launch, it could provide a third horse in the long read race. Roche in the past tended to emphasize other applications, but now that they have uncoupled from PacBio that attitude could change.

On-rod preps are probably not the only approach worth exploring. Another means of preparing ultra high molecular weight DNA is lysing cells in agarose blocks. This is a slow and tedious process (according to colleagues who have run it), but Sage Science (the maker of the BluePippin) just launched an instrument, the SageHLS, which it claims can purify DNA up to two megabases in size.

Or perhaps microfluidic processing? Two recent items from Paul Blainey (a Nature Communications paper and a preprint co-authored with his postdoctoral adviser Steve Quake) describe soft lithography-based microfluidic devices for lysing cells within the devices. I need to do proper writeups on each, but such microfluidic processing might be gentle enough to preserve ultra-large DNA. Or perhaps something built atop Oxford's electrowetting VolTRAX device (though that isn't operational in the field yet) or perhaps droplet microfluidics.

So stay tuned. Particularly since ONT's Clive Brown is going to give a webcast next Tuesday with the teaser title "GridION X5: The Sequel", a reference to the GridION sequencing grid instrument which Oxford proposed back at AGBT in 2012. The concept of GridION was rackable sequencers that somehow could work together on one problem, then switch to new samples once questions were answered. If you loudly lament that Clive often makes extravagant promises, remember that an 882kbp read is pretty extravagant.

8 comments:

Question from someone who has never looked at ONT data... First plot - I think I understand the x-axis 'template_length', that's the number of base pairs sequenced but the y-axis 'count' makes no sense. When plotting a histogram 'count' means the number of 'events' in each bin - but I find it absurd that there were ~0.35 billion different reads in the bin width around the N50 value. Is 'count' the read length multiplied by the number of 'events'? And if so, isn't that massively deceiving... making it look like there are tons of reads when in fact there were next to none.

Which applications would this be most useful for? And what is the highest accuracy ONT can get by increasing coverage (10x, 20x, 30x)? Is it the 1 in 500 error rate you mentioned in another post? Or can we do even better with ONT?

ONT currently has an issue with systematic error, particularly at homopolymers of >=6. These cannot be currently solved with coverage. The in beta Scrappie basecaller is touted as significantly solving this problem. That, plus an in beta chemistry called 1D^2 that allows reading both strands of a fragment but doesn't use a hairpin (and therefore should be outside PacBio's patent) should get the error rate down to the 1 in 500 range, according to the data presented at AGBT.

Clive Brian has a webcast Tuesday; presumably updates on Scrappie and 1d^2 will be given then. I plan to summarize the highlights with a new post.

As far as applications, MinION has been applied to a wide range of problem; the publications list at ONT is a good place to start. Key pluses are low buy-in cost, portability, rapid results, direct eukaryotic mRNA sequencing and long reads. Error rate is the worst issue; bigger Illumina boxes have lower cost/bp if you use them enough.

When you are interested in the tails of a distribution, you should use log scales on the y-axis of the histogram. For the sorts of length distributions you see in long-read technologies, log-log plots are best for viewing the shape of the distribution, which is probably either lognormal or Gumbel

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.