A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Wednesday, June 15, 2011

E.coli Outbreak Genomics

The E.coli outbreak in Germany continues to be a major news item. It is looking increasingly doubtful that the source of the infection will be conclusively traced, as the German authorities have already named and then backed off two suspects, Spanish cucumbers and German bean sprouts. These activities have not been without repercussions; Spanish agriculture has been hard hit and exports of European produce in general are reportedly hurting.

On the genomics front, the outbreak has demonstrated how quickly bacterial genomics can be run on the current class of instrumentation. BGI Europe knocked off the sequence in a series of Ion Torrent runs in 3 days, and a group at University of Muenster worked at similar speed with the Ion setup as well. Later, sequences have come in from the Illumina and 454 platforms. The public release of this data has engendered a number of public analysis projects.

What is surprising is who is missing from the pack: Pacific Biosciences. PacBio made a splash last winter with a quick run at the Haitian cholera bug and talks from Eric Schmidt Schadt have suggested biosurveillance as a market of great interest to the company. So it is quite surprising that they haven't made an announcement. Now, it is possible that they are holding back with a view towards publishing in a journal which frowns on advance publicity (let's hope PLoS makes sure to encourage the open analysis effort to publish within their pages!), but I suspect not. More likely is the problem that there just aren't enough PacBio RS instruments in the field, and not enough connections to make sure that PacBio headquarters got a sample quickly.

Contrast this with OpGen, which has now generated a physical map of the outbreak bug. OpGen has a cool single-molecule restriction mapping technology, but one which I think is in very great danger of being overtaken by sequencing-based mapping approaches. OpGen's big challenge is convincing researchers to buy an expensive instrument which does exactly one thing. In particular, PacBio could well start threatening OpGen's market if they can straighten out strobe sequencing, and other approaches (such as HAPPy mapping or colony-free large insert cloning) could also push it out of the way. To succeed, they need to claw out a foothold before those other technologies become common. The real nightmare for OpGen would be for one of the nanopore or electron microscopy sequencing companies to start generating long reads with some useful data.

The public deposition of the Ion Torrent runs also gives an opportunity to get more data on the current state of this platform. I've done a quick analysis of the 7 BGI runs and 8 from U Muenster/Ion by mapping them to one of the available assemblies. For a number of applications, this is an important estimate: how many reads can I expect to get. Both groups had averages a bit over the current spec; 109K for BGI and 143K for UM. But, they also had standard deviations of over 40%. BGI actually had only two runs over the 100K spec, but one of these had 218K mapped reads. Their worst run was 72K mapped reads. UM on the other hand showed wider swings; 3 runs generated >190K mapped reads, but one run topped out at 41K.

Useful read lengths, estimated by summing the number of match positions in the CIGAR strings for mapped reads, varied quite a bit as well. BGI had between 40 and 68% of their mapped reads delivering 100+ alignable bases, whereas for UM this range was 27-41%. Neither saw a run where fewer than 60% of the mappable reads had fewer than 80 mappable bases.

I haven't tried to calculate error rates. Yes, that's a huge omission, but the problem is I'm not a specialist in E.coli any more (and haven't been for a long time) and am not sure what to trust in the assemblies. So I'll leave that to others.

Jonathon Rothberg apparently spoke at the Personal Genomics meeting here in Boston (I'm saving my conference time & fees for the big cancer genomics meeting next week, so I must rely on press reports for the Personal Genomics meeting) and sketched more grand vistas for Ion Torrent performance: 400 reads by year end, $1000 human genomes early next year and other tantalizing dreams. But it's too late to dissect those now; perhaps tomorrow night.

[Corrected Jun 16th to fix the stupid name substitution I made in the original, as pointed out by Karol. Very embarrassing, especially with the high frequency that my own name is mutated]

I'm confused by the remark "I haven't tried to calculate error rates. Yes, that's a huge omission, but the problem is I'm not a specialist in E.coli any more (and haven't been for a long time) and am not sure what to trust in the assemblies."

Even a rather crappy assembly can provide very good estimates of error rates for individual reads. Trying to determine how good an assembly is may require knowledge of the organism, but error models for the reads only require an assembly that is substantially better than the reads.

It sounds like you have some pretty good assemblies (454 and Illumina data), so getting error estimates for the Ion Torrent is just a matter of mapping the reads.

One caveat: different mappers report statistics differently. An "unmappable" read for one mapper is a high-error rate read for another.

Keith, It is interesting to find your article and assess it retrospectively. You posted your article on June 15, 2011 and as you stated, almost everyone and everything (BGI, U. of Muenster, HPA, Ion Torrent, OpGen, Illumina, 454) had all generated and released data for the outbreak. We now know that PacBio has released their reads and from their ZMW single read deposit in GenBank (http://www.ncbi.nlm.nih.gov/sra/92212) it is clear that they started data collection on June 10th with 32 closed circular ZMW read runs and June 18th with 24 ZMW long subread runs (http://www.ncbi.nlm.nih.gov/sra/87502) for just the outbreak isolate (not including all the other E. coli they sequenced). They must have had the outbreak isolate before June 10th in order to grow the organism, isolate DNA, shear the DNA, and make the SMRT-bell library for the closed circular reads. Considering BGI obtained a DNA sample somewhere around May 30th (3 days before they released the sequence) and OpGen received an isolate around June 6th (2 days before they released the whole-genome map), it doesn’t sound like PacBio received the isolate as late as they publically stated…really have to read the reports and between the lines to get all the information and to make conclusions on which platform and technology provides the best solution.

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.