The Cure for Cancer Is Data—Mountains of Data

Lola Dupre

A few years ago Eric Schadt met a woman who had cancer. It was an aggressive form of colon cancer that had come on quickly and metastasized to her liver. She was a young war widow from Mississippi, the mother of two girls she was raising alone, and she had only the health care that her husband’s death benefits afforded her—an overburdened oncologist at a military hospital, the lowest rung on the health care ladder. The polar opposite of cutting-edge medicine. To walk into such a facility with stage 4 metastatic disease is to walk back in time to the world of the unmapped human genome, when “colon cancer” was understood to have a single cause instead of millions of causes resulting in unique variations, when treatment was the same bag of poison, whether you were in Ocean Springs, Mississippi, or Timbuktu. A time without big data, machine learning, or hope.

Schadt had just started the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai Hospital, and when he heard about the woman in Mississippi, he said, simply, “That’s exactly the kind of patient we take.” By that he meant patients for whom the current standard of care would fail, for whom the future of medicine—one in which supercomputers sift through masses of genetic data for patterns that could lead to new treatments and cures—could not arrive fast enough.

Schadt isn’t a cancer specialist or even a medical doctor. He’s a mathematician and a specialist in molecular and computational biology, and he had never had a single patient in his life. Yet through his new lab at Sinai, Schadt would generate a terabyte of data on this woman’s cancer, thousands of times what she could have expected in a conventional medical setting, in the hope of finding new ways to combat it. Toward the end, Schadt would sit at her bedside, distraught. They had become close, and the scientist who had never had patients before was seeing the implications of scientific ambition and failure. She died last year.

Seated at his desk at Mount Sinai, Schadt is direct and disarming. At 51, he wears a short-sleeved polo shirt and shorts everywhere he goes, even to black-tie galas or in New York winters, which gives him the unassailable air of a true eccentric, or a high-school football coach. For any medical researcher, it’s easier to be bullish when you’re publishing papers or developing drugs, layers removed from the human impact of your work. But living the effect of your work and watching someone slowly die in front of you, well, “that’s a deeper humbling than I’d ever experienced before,” Schadt says today.

“We’re on this exponential growth curve, where your mind naturally projects all the way into the future, and you think: We’re going to figure this out,” he says. “In the end, we will know what all these cells are doing, what all these perturbations do. The humbling part is that as we are on this growth curve, we are continually struck by the increasing complexity that is revealed.”

For a decade we’ve been talking about the potential of gene sequencing and personalized medicine, how advances in computer processing power combined with an increasingly intimate understanding of our individual genomes has put us on the threshold of an age of miracles. With enough data, the theory goes, there’s not a disease that isn’t druggable. But as Schadt has learned, it’s not enough to plumb the depths of an individual’s DNA. It requires a universe of data—exabytes worth—to detect patterns in a population, apply machine learning, find the network of mutations responsible for disease, and do something about it. The bigger these data sets become, the more accurate and powerful the models and the predictors become.

You must convince the medical centers and genetic companies that collect our data to not hoard it for their own profit.

The problem is getting these exabytes of genetic data. Turns out you can’t just walk up to people, millions of them, and say, “Your data, please.” You must first persuade them that you’ll only do good things with it and won’t let it fall into the wrong hands. (We do like our privacy.) You must then convince the medical centers and genetic companies that collect this data that, rather than hoard it for their own profit, they should share it so the entire research community can attain the economies of scale—the critical mass of data, individual sets eventually numbering in the millions—that Schadt and many others believe is necessary to understand the causes of diseases and engineer new treatments and cures.

Right now, that volume of information is simply not available. But companies ranging from tech behemoths to biomedical startups are racing to solve these issues of scale. And Schadt wants in.

If human biological complexity can be likened to an animated movie, then a hundred years ago we had about one pixel’s worth of understanding of that complexity. With a single pixel, you have no idea what the story is. But with more pixels, hundreds or thousands—or say, 1 percent of the whole in pixels—patterns and themes begin to emerge. The beginning of a narrative.

This was the thinking that compelled Schadt to set up the Icahn Institute in 2011 after a decade of developing drugs for Merck. (At one point, half of Merck’s metabolic drugs, which treat ailments like heart disease, diabetes, and obesity, were derived from Schadt’s research.) In the face of widely held assumptions based on the single-gene model of disease and drug development, he came to believe that genes worked not alone but in vast networks to enable disease to penetrate our natural defenses, and we could understand these networks only through deep bioinformatic spelunking. To explore his complexity model, Schadt arrived at Mount Sinai with $150 million of financier–philanthropist Carl Icahn’s money and built a supercomputer named Minerva in the basement to analyze the thousands of genomes collected at Mount Sinai each year. He hired other quants, including Jeffrey Hammerbacher, who had created Facebook’s first-ever data team. According to an esteemed oncologist at the medical school, “All of a sudden you had all these math nerds running around, people who looked like they should be programming videogames.”

“We need 100 Mount Sinais to achieve the scale required to recognize the patterns in patient data that guide you to diagnoses and treatments.”

It didn’t take long for Schadt to realize that he was going to need a bigger boat. In 2014 the Icahn Institute started a joint venture with Sage Bionetworks to try to cure rare childhood diseases—cystic fibrosis, sickle cell anemia, Tay-Sachs—170 in all. They called it the Resilience Project, and researchers set out to find individuals in the population who carried the DNA variants for those diseases but somehow, through some inoculating buffer, didn’t have the disease. In their search for these “resilient individuals,” Schadt and his team amassed a pool of genetic data from 600,000 people, then the largest such genetic study ever conducted, with data assembled from a dozen sources (23andMe, the Beijing Genomics Institute, and the Broad Institute of MIT and Harvard, most notably). But in searching the 600,000 genomes, the researchers found potentially resilient individuals for only eight of the 170 diseases they were targeting. The study size was too small. By calculating the frequency of the disease-causing mutations in the population, Schadt and his team came to believe that the number of subjects they’d need to be useful wasn’t 600,000—it was more on the order of 10 million. For all the computational power behind the Resilience Project and what seemed like a wealth of data, Schadt still lacked the quantity and quality of patient information required to crack the genetic code behind resilience.

“We need 100 Mount Sinais to achieve the scale required to recognize the patterns in patient data that guide you to diagnoses and treatments,” Schadt says. “In the five years that I’ve been here, I’ve realized that’s just not going to happen within the medical centers. They’re too isolated from each other, too competitive, and they’re not woven together into a coherent framework that enables the kind of advancements we’re seeing in nearly all other industries.” Since the major medical centers hold an effective monopoly over their patients’ data and have little economic incentive to collaborate with one another in critical research areas, Schadt says, “the disruption is gonna happen outside the medical establishment.”

So that’s what Schadt is aiming to build by establishing his own genetic data company, Sema4. The New York–based venture will focus on acquiring and expanding companies that specialize in genetic testing—–think cancer-carrier screenings and noninvasive prenatal tests—in order to collect and share millions of individual data sets. On Sema4’s searchable platform, doctors will have instant access to a world of genomes to help diagnose their patients. Pharmaceutical companies will pay to use the system to find patient populations for clinical trials. And scientists, their current analytic arsenals amplified through ever more powerful computers and machine-learning algorithms, will finally possess enough genetic data to fuel ambitious research.

Though a handful of tech giants are venturing into the life sciences (see “Big Bets on Biodata,” below) and the National Institutes of Health is asking for a million volunteers to create its own massive biobank, Schadt believes that Sema4 and other startups like it—Craig Venter’s Human Longevity and Patrick Soon-Shiong’s Nant-Health chief among them—are the most committed to achieving the optimal scale of genetic data. While these companies will compete with one another to collect ever greater stores of high–quality biodata, Sema4 will stand out by making its genetic library accessible and free of charge to academic medical centers and nonprofit researchers around the world. Should any of Sema4’s competitors need to harvest information from a subset of Schadt’s data populations, he says, they could simply pay to access the Sema4 search platform. Or Sema4 and other companies could join forces to assemble large data sets for ambitious endeavors like the Resilience Project—only bigger.

Big Bets on Biodata

How four tech heavyweights are going all-in on life science.

—Gregory Barber

Alphabet

Using machine learning for their Baseline study, Alphabet’s Verily Life Sciences team will pore over genomic, clinical, and imaging data from thousands of healthy volunteers in the hope of better understanding what makes them healthy—knowledge that might help keep people from getting sick in the first place.

IBM

In the 1970s, the World Health Organization used IBM hardware to hunt down the last vestiges of smallpox. Today IBM is partnering with hospitals to funnel health data into Watson, its Jeopardy!-winning AI system. The goal is to predict disease, personalize treatment, and even power virtual medical assistants to sift through records and research.

The company is developing tiny sensors to be worn on the skin that can transmit biometric data to remote health monitors (and, potentially, large-scale data aggregators). Microsoft also just announced its plan to use machine learning and biological data to “solve” cancer.

Still, Schadt argues, the problem of scale can’t be solved by companies simply pooling their data. “It’s about getting the data from the patients themselves.” Based on his experience at Mount Sinai, he’s seen a leap in recent years in the number of people who are coming around to his belief that there is more upside than down to having a physician know their genetic predisposition to certain conditions. He says that when he got to Mount Sinai in 2011, the hospital was screening a few thousand genetic samples a year. This year, they could screen up to 150,000, most of them collected from patients in the New York region, and at Sema4, Schadt says, “we intend to scale that up to 500,000 to a million samples a year.”

That growth will occur by buying and expanding existing genetic testing companies all over the country, most of which are now independent from each other but under Sema4 will combine to create a massive network of genetic information governed by a uniform standard of security and consent. Schadt acknowledges that it’s no simple task to ask a person to give up their biodata to an anonymous corporation. Even though billions of public- and private-sector dollars have been spent to modernize and secure existing data networks, breaches and leaks remain a fact of life. At Sema4, patients will be told, in detail, how their data will be encrypted, anonymized, and scrubbed of identifying information (except for an encryption key). Even in the event of a breach, the chance of someone being identified and exposed is exceedingly low.

There is also the issue of informed consent—the patients’ understanding and approval of the whats, hows, whys, and how longs of whatever they’re asked to endure—which impacts both the quality and the quantity of the data being collected. “There are companies today that claim access to millions of patient records,” Schadt explains. “But from the standpoint of what we intend to do, the data is meaningless. It’s often inaccurate, incomplete, and not easily linked across systems. Plus, that data doesn’t typically include access to DNA or to the genomic data generated on their DNA.” To take the example of the Resilience Project, it wasn’t simply that the universe of data was too small—it was also that the 600,000 genomes were governed under a hash of various consenting arrangements. If something vital was discovered, hundreds of thousands of participants could not be recontacted or tracked, making the data useless from a practical research standpoint.

Today, most consent forms are designed to be as quick and uninformative as possible, but rather than make it easier for researchers to get high-quality data, this approach actually makes it harder. Studies have shown that the more informed the consent, the better the information, since patients are more willing to participate in follow-up exams and interviews when they appreciate the purpose of the research. (This also allows scientists to track health and wellness over time.) At Sema4, Schadt is adopting a multistage informational process—which includes a mandatory, must-pass quiz—so it will be clear that patients understand the full scope of what they’re consenting to. This will require more of a patient’s time, but Schadt is betting that as more patients understand, more of them will consent to sharing their genetic information.

With this digital infrastructure in place, Schadt envisions a future in which more and more patients share not only their genomes but also medical and lifestyle information collected by monitoring devices like glucometers, blood-pressure trackers, and inhalers. The hope is that, ultimately, these increasingly sophisticated, increasingly patient-friendly tests will be so comprehensive that a patient’s microbiome can be regularly sequenced, their RNA frequently examined, and their blood cells constantly monitored for signs of trouble.

The virtual monopoly that medical centers like Mount Sinai now exercise over patient data will be smashed, and researchers will finally have the masses of genetic data that the medical breakthroughs of the future require. “Can we do better for human well-being if information is more broadly accessible, where you’re leveraging the mindshare of the entire planet to evolve the models of disease?” Schadt asks. “Absolutely.” This is medicine as math, not guesswork, and every disease—even stage 4 cancer—might one day be druggable.