A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Friday, February 17, 2012

Oxford Nanopore Doesn't Disappoint

Oxford Nanopore's AGBT presentation should have just finished up, so the embargo is off. Oxford was kind enough to chat with me last night and to share their press release in advance; on the call were CTO Clive Brown, SAB member Ewan Birney and Director of Communications Zoe McDougall. A real challenge posed by Oxford's news is trying to write about it without slipping intoclichéd techspeak about what they will be releasing later this year ("second half").

For example, take a look at the MinION sequencer Oxford announced. Imagine a sequencer which requires minimal sample prep, generates just under a gigabase in less than a day's time for $1000 in consumable cost. Reads are 10s of kilobases long (if your input DNA is) and can read off base modifications, albeit with a 4% raw error rate distributed uniformly along the read. Sounds a bit like a PGM or MiSeq, though trading accuracy for very long reads. Except, that $1000 cost is the entire cost; there's no instrument to buy. And forget the desktop; not only does it fit in your pocket; you could stuff a load of them in there. Brown also says their margins are very healthy; if I actually get to next year's AGBT, will I get a MinION as a standard freebie?

Then there's big brother, the GridION. It also has the no-brainer sample prep, but sequences at 1.4Gb per hour for up a few days per sample. It's also appropriate to talk in terms of bases per hour instead of bases per run, as the system can be configured to run until it gets a desired answer. I didn't get the cost nailed down as well, but the suggestion is that in large numbers the individual GridION processors might sell in the neighborhood of $30K apiece. So, for the price of an Ion Proton setup you could sequence a human genome to 30X in about half a day.

There's all sorts of whiz bang technology crossed with systematic searches. More than a thousand nanopores screened and about 300 studied in detail. Very sophisticated custom circuits embedded in each sensor to enable distinguishing 64 different signal levels. Each GridION chip carries 2K sensors, to be increased to 8K next year. Pores utilize proteins not previously described in the academic literature. Sample prep can be as simple as a partial restriction digest, or perhaps something slightly more elaborate as adding a hairpin. Single stranded DNA reads well too, though it can be hard to keep single stranded. I forgot to ask about RNA, but reading between the lines in the press release suggests that direct RNA reading is off in the future. A wide range of input DNA concentrations are possible; at high concentrations the nanopores reload virtually instantly; at low input concentrations and additional trick can obtain similar speeds (by "tethering" the DNA to the membrane).

GridION also comes with a new pricing model. You can opt to pay a lot upfront for the machine and minimize your run costs or pay little (or perhaps even nothing) up front but pay more for your consumables. All promised to be open and transparent; no specialized discounts for this institution or that company.

Base calling was described to me in its outline, and is fascinating. Much effort has gone in the past to trying to achieve single base resolution in the nanopore, and much of the nay-saying in my comments and elsewhere has trumpeted the difficulty of this task. Oxford slices through this Gordian knot by reading bases in triples (who ever heard of anything useful coming from reading bases in threes?). This requires distinguishing 64 different types of signals, but their electronics can do so. Each base is read three times, and a Vitterbi algorithm is used to trace the most probable basecall path through the series of signals. In a small way, it is reminiscent of colorspace encoding on SOLiD, as that too read each base multiple times. However, colorspace had a minimalist number of signals. Variations on the theme have been explored as well: pores that read 4 or 5 or even 11 bases at a time (echoes of Nigel Tuftnel). The reading in triples also leads to the major error mode, which are small deletions. These appear to be caused by certain nucleotide contexts not moving smoothly through the nanopore.

The other huge twist in the informatics is the real time analysis of data. Alas, I failed to get any real details on this, but it does offer both exciting and potentially scary informatics challenges. How, for example, do you go about designing a real-time de novo assembler or SNP caller? We've gotten used to 50X or 100X or 200X coverage of a genome for short reads; if your reads are tens of kilobases, what fold coverage do you really need to assemble a genome or confidently call SNPs. In general, Oxford's systems will present significant (and exciting) challenges to informatics developers, as the long reads (and ability to detect a myriad of base modifications) are quite different than the sorts of data common previously. The real time informatics also plays to the ability to link multiple GridION instruments on the same problem; with the 20 of next year's 8K sensors the system could deliver a human genome in 15 minutes!

There's also the challenge of the data quality; Ewan emphasized a point I've pondered a bit in terms of other systems: do we need to go to much more sophisticated treatment of the data quality than just phred scores in FASTQ format? As I think I've noted before, phred scores are really only designed to estimate if a base was called correctly, not to estimate if a run of bases were either called correctly or not or to estimate whether a base should have been called at all (or whether a base should have been called that wasn't). For example, I could imagine reporting a sequence as a HMM or other probabilistic model that catches much of this complexity. But would downstream software actually utilize such complexity effectively, or would most programs simply take as their first step reducing that to a FASTQ file (or even more bluntly, to a consensus sequence)?

How will this all play out in the marketplace? I think Oxford is going to cause a lot of discomfort, though initially it will follow Clayton Christensen's path of initially thriving by creating new markets. Super-long reads, despite high error rates, will be invaluable for de novo genome assembly, haplotyping, structural variant analysis, oncogenomics and metagenomics. Imagine being able to routinely haplotype high quality tumor DNA to determine which mutations and variations are consistently found in cis or in trans. Instead of fighting with making mate pair libraries, imagine scaffolding high quality short reads with 40Kb or greater Oxford reads. In a similar vein, imagine pulling out large blocks of sequence from even very rare members of complex microbial communities. Clive suggested the MinION would be the ideal instrument for undergraduate education or for graduate students impatient with waiting for a core to get their samples done.

Longer term though, Oxford will drive into the existing markets. First, there will be applications in which the speed and convenience of nanopore sequencing will more than compensate for reduced quality. Second, the quality is likely to improve. Oxford has determined that much of the error is systematic, involving sequence motifs that don't play well in the current basecalling scheme. They are confident that improvements on pore engineering will improve performance. Alternatively, different pores have different error profiles, and so the possibility of multiple chip models with different error patterns exists. Clive indicated this might be most profitably applied to damaged DNA.

But going further, there are other ways to up the accuracy. The minimalist sample prep for the system is to generate an overhang of 4-25 bases (more than that works, but gives no advantage), but Brown commented that his favorite sample prep is to ligate a hairpin onto one end. This enables sequencing first one strand, then through the hairpin, and then back along the complementary strand. This has been performed for entire lambda phage DNAs, yielding a 100Kb read composed of two subreads. I could imagine taking this approach further: if a fragment were circularized and then replicated by rolling circle amplification, the product molecule would consist of many repeats of the original molecule, enabling a many-pass consensus. This is akin to Pacific Bioscience's circular consensus mode.

But, there will definitely be some companies who felt they paved new paths in genomics who will now find themselves Belgian Block. Severe pressure will be first felt by others who offered long range information but not as much and for a high entry cost. Both PacBio and OpGen have amazing technologies (OpGen has a poster at AGBT showing human genome mapping), but for the cost of either instrument you could buy hundreds and hundreds of MinIONs to generate superior data in the same space. Not many companies get big bucks for mate pair reagents, but that's definitely a technology that isn't long for this world. Hard to compete against significant reductions in cost, labor and input DNA requirements, especially when the data is many fold more useful.

MinION and GridION will also pose a huge challenge to anyone else trying to develop a new sequencing technology. There is still room, particularly for low-cost, rapid, high quality sequencing in the clinical space. There might also be space for ultra-long read lengths, though Oxford appears to have not pushed their system hard in that direction. If extremely high molecular weight DNA is carefully prepared, would the processivity limits of their enzyme be reached at BAC scale? Whole bacterial chromosomes? Whole yeast chromosomes? Now that's a technology development paper just begging to be written! But in general, Oxford is setting the bar very high for any other new entrants in the sequencing arena, which may be real trouble for anyone who is significantly behind them or simply promising faster, cheaper short read sequencing.

MinION in particular also presents a serious challenge to service providers at the low end. For several thousand dollars I can get samples run in a few weeks at a number of high quality organizations. That's great, but hard to compete against $1K in a few hours. Yes, there will be applications which require the higher quality of other platforms, but an awful lot of quick-and-dirty experiments do not. Of course, MinION won't just steal business from the providers; there are a lot of experiments which just don't get done because of the turnaround in cost. Indeed, I'm currently knee-deep in such experiments.

For Illumina and Life/Ion, Oxford will probably represent a modest threat at first. As noted, there will be certain niches which are stolen quickly, such as mate pairs. In some areas, the majors will continue to have nearly complete market share, due to the higher quality. In others, such as de novo genome sequencing, hybrid approaches will likely be common: scaffold large numbers of short reads on a few very long reads. But, as improvements in sample prep, informatics or pore engineering drive accuracy higher, the contribution from the short reads will become critical in fewer and fewer examples.

One other challenge Oxford faces in actually rolling out the system. Given all these positive attributes, there is likely to be high initial demand for the systems. MinION in particular could be extremely popular: if you want to know how nanopore data looks for your favorite project, just try a few MinIONs as a cheap pilot. Manufacturing has not gone smoothly for many of their competitors. Stories abound of issues with each platform; PacBio has experienced a much slower rollout than originally promised. Illumina had a "lost quarter" due to a bad reagent batch. Close to home, I'm discovering that Life is having as bumpy a roll-out of the 318 chip as they did with the 316; in each case my service provider's were stuck waiting for chips long after the official launch (or worse, my one provider has chips right now but not the reagents to run them!). Marrying sophisticated biochemistry to sophisticated chips is not a trivial manufacturing task, and scaling up production might throw some unwelcome surprises.

I'm sure there will be a lot of conversation on Oxford, both illuminating issues and further fleshing out details. For most of us, that's all we'll be able to get for a while. About a dozen early access sites have been selected, and while they were not named there are a number of obvious suspects in big genome centers. Clive spoke of a hope that others who want to try the system might have some means of testing a few samples in exchange for promising to publish rapidly (I'm already scheming a proposal in that department); he strongly believes that independent data is required to establish any platform's credibility. In the nearer term, some of the data the AGBT talk is based on may be released.

So, Oxford has unveiled an amazing pair of sequencers. Not one which completely clears the field of everyone else, but one which will offer a host of new opportunities for genomics. Now it is up to Oxford to deliver the instruments to the field, and for Oxford and its early access sites to start pumping out data for all to evaluate.

18 comments:

Nice post Keith...but I missed your usual critique of company claims. This read a little like a marketing piece...looking forward to a critical followup.

Did ONT publish the 48kb lambda read (I honestly could have missed it)? If it was halfway decent they most certainly would have. Was there any substantive data shown (to you or otherwise in their public unveiling?)

Having lived through hype cycles of sequencing companies from both the inside and the outside, I encourage the community to be more critical of these announcements.

That said, I'll be the first to beg ONT for a beta test if they can do a fraction of what they're promising.

That's a fair criticism of my piece; it's easy to accept the vendor's word, and while Brown has a track record (Illumina) there's nothing like seeing the data -- except some data in the wild. Which is why I fervently believe every vendor should make evaluation models available to those who run genomics blogs & bulletin boards. :-)

I agree with ECO, there seemed to be very little questioning of the company Cool Aid on this one, uncharacteristically little. Not even a jab at ONT's marketing folks for selling minions? :-)

Overall, an impressive set of claims, but I remember the suspicion with which PacBio's 85% raw accuracy was met with. And they had a tool to sell at the time. 95% is better, but for practical purposes not much better. And as far as I can tell the tool exists only in ONT videos on the web. If they are going to ship in a few months they needed to show more. Maybe they did behind closed doors.

Interesting point about the dilemma this poses to informatics software. With super-long noisy reads wouldn't using repeated scans to improve accuracy get expensive fast?

Lastly, I can just envision PacBio's lawyers checking their bank accounts and salivating. The whole hairpin ligation might as well be renamed to hairpin litigation.

Aside from any technical details and of course whether the companies can deliver what they promise: I completely agree with the future outlook of what the sequencing market will look like in the mid-range (3-5 year). On the one hand cheap long-reads with massive throughput and acceptable accuracy. That would satisfy most research settings. On the other hand, the clinical setting of sequencing 20-50 genes for a specific clinical question, which requires high accuracy, little hands-on-time and cost-scalability with sequence output (as you don't usually have many patients to fill a multiplexed run on the current large platforms). Expensive and laborious sequence enrichment and library prep is prohibitive in the clinical setting.

If you look at current mainstream platforms, I think it's quite clear that they don't totally satisfy any of the two above scenarios. So they will certainly be replaced as soon as other platforms fulfill these demands better. We just have to wait and see whether companies like ONT or GnuBIO actually deliver what they promise.....

Correct me if I'm wrong, but this is the first time I've heard a company talk openly about the error rate of their platform at announcement (4%, I believe). That, for me, lends credence to their claims.

Pretty interesting to read this post again now, after the London Calling announcements three+ years later. I'm starting to see a pattern in how Clive talks about things. But also interesting that some of the GridIon plans then are still the plans now for the next sequencer.

Follow by Email

Search This Blog

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.