Google announced the release of DeepVariant, a deep learning tool for constructing true genome sequences with greater accuracy than classical methods. It only works on somatic calls, but very interesting to see the uses of image recognition in genome reconstruction.

DeepVariant is the first of what we hope will be many contributions
that leverage Google's computing infrastructure and ML expertise to
both better understand the genome and to provide deep learning-based
genomics tools to the community.

2) Yes, it beat GATK, but only barely (don't quote me on the numbers, but it was something like 98% vs 98.5%)

3) The method is insane, in that they actually create millions of images, encoding read information as colors and alpha, and then use their image-processing neural network to do pattern recognition for calling.

4) it is quite computationally expensive for running, not even to mention training the NN.

5) It absolutely requires new training data for each platform that you're going to run it on. Chemistry changed slightly? Got a new type of instrument? Doing targeted regions instead of WGS? You'll need a new gold standard run and you'll need to retrain the algorithm from scratch. They used the Genome in a Bottle dataset. That's limited to ~80% of the genome, and their TPs are only calls validated on at least two sequencing technologies.

Don't get me wrong - it's cool to see someone enter the space with a really crazy orthogonal method, but it's not a panacea, and the hype about AI solving all of our variant calling problems is pretty clearly overblown. That doesn't mean that this won't be useful in the future, just that it's not there yet.

Thanks Chris for such concise & informative review. I have a question/comment about your point 5 about the need to retrain the model every time something changes (bear with me: I haven't read the DeepVariant method in any detail).

First, I wonder to what extent it is that necessary to retrain the parameters even for small changes in the library preparations. Presumably (big if), small changes in, say, chemistry should still give good results.

But most importantly, I don't think DeepVariant is conceptually different from other methods when it comes to using training and test data. I mean, DeepVariant makes the need of training data explicit. But implicitly other methods also need training data that in theory should be re-analysed every time something changes. For example, when we ("we" meaning us or the program we use) decide to filter out variants supported by less than 3 reads, effectively we are saying "given the training data I've seen until now, 3 is a good threshold".

My assumption is that whether running GATK with data produced on a HiSeq 2500, a NovaSeq patterned flow-cell, or an amplicon-based technology, you'll get reasonable results. This is thanks to lots of effort that went into making their model (and it's heuristics) general. (In essence, yes, using all the training data we've seen up until now).

The NN picks out artifact patterns automatically, which is impressive, but that makes it very susceptible to changes. Given a large and diverse training corpus, there's no reason why it can't learn general patterns too! My point is that these large, highly validated training sets don't exist, so if you hop to a new (or older) technology, you can't expect DeepVariant to just work. (again, for now)

It's also a contrast to current callers, where you can often look at your new type of data, see "oh, it looks like I'm overcalling at homopolymer runs", and then tweak some parameters to fix the problem. NN is a total black box and has to be retrained from scratch.

So yeah, I absolutely think that NN-based variant calling will be useful (and probably better!) in the future. I'm just trying to inject some reality into the proceedings here. :)

Just want to be clear - the DV team deserves kudos. Variant calling is a hard problem, their method is interesting, and their performance is admirable. If I'm negative about anything, it's the breathless "Google AI has solved genomics!" press coverage, which you can't blame the authors for!

I totally agree how the hoopla over "Google AI solved genomics!" is on. At the end of the day it is a product they are bringing and pretty sure the buzz will be more than what it actually preaches. Having said that, I will feel it is worth taking a look at it as to how germline calls are made and improved but to what extent it can be useful will be a matter of time. For somatic calls am sure they will bring up something soon. I still need to get an understanding of the algorithm though as to how they implemented. But am happy that this kind of work also pushes one step ahead of making genomics as a research service product, and I support that.