Venter says that he’s sequenced 500 people’s genomes so far, and that volunteers are starting to also undergo a battery of tests measuring their strength, brain size, how much blood their hearts pump, and, says Venter, “just about everything that can be measured about a person, without cutting them open.” This information will be fed into a database that can be used to discover links between genes and these traits, as well as disease.

But that’s going to require some massive data crunching. To get these skills, Venter recruited Franz Och, the machine-learning specialist leading Google Translate. Now Och will apply similar methods to studying genomes in a data science and software shop that Venter is establishing in Mountain View, California.

The hire comes just as Google itself has launched a similar-sounding effort to start collecting biomedical data (see “What’s a Moon Shot Worth These Days”). Venter calls Google’s plans for a biomedical database “a baby step, a much smaller version of what we are doing.”

What’s clear is that genome research and data science are coming together in new ways, and at a much larger scale than ever before. We asked Venter why.

How are we doing in genomics?

In my view there have not been a significant number of advances. One reason for that is that genomics follows a law of very big numbers. I’ve had my genome for 15 years, and there’s not much I can learn because there are not that many others to compare it to.

Why did you hire an expert in machine translation as your top data scientist?

Until now, there’s not been software for comparing my genome to your genome, much less to a million genomes. We want to get to a point where it takes a few seconds to compare your genome to all the others. It’s going to take a lot of work to do that.

Google Translate started as a slow algorithm that took hours or days to run and was not very accurate. But Franz [Och] built a machine-learning version that could go out on the Web and find every article translated from German to English or vice versa, and learn from those. And then it was optimized, so it works in milliseconds.

I convinced Franz, and he convinced himself, that understanding the human genome at the scale that we are trying to do it is going to be one of the greatest translation challenges in history.

How is discovering the connection between genes and disease like translating languages?

Everything in a cell derives from your DNA code, all the proteins, their structure, whether they last seconds or days. All that is preprogrammed in DNA language. Then it is translated into life. People are going to be very surprised about how much of a DNA software species we are.