Houston, We Have a Data Problem

June 8, 2011 | "There’s too much data coming at us and too little time to be prepared to handle it,” sighs Krishna Sankhavaram, the director for Research Information Systems and Technology services at the prestigious University of Texas MD Anderson Cancer Center.

While working at three of the most prestigious medical centers in the United States, Sankhavaram has done a pretty good job of find solutions for storing and managing molecular and clinical data. But the demands of translational medicine are such that he can never relax.

In the 1990s, he worked at St Jude’s Hospital with Clayton Naeve, building bioinformatics resources including the institute’s first laboratory information management system (LIMS). After a spell at the Moffitt Cancer Center in Florida, he joined MD Anderson in Houston in 2005, reporting to Lynn Vogel (VP and CIO of the information services division).

A nationally top-ranked teaching hospital, MD Anderson employs some 20,000 people (including 2,000 faculty and post docs), conducts 3,000 clinical trials and cares for 80,000 patients per year. Sankhavaram’s team supplies the research informatics infrastructure, new research computing technology, and bioinformatics support for an army of physicians and scientists.

A priority is to “de-silo” the institution. “Researchers tend to operate independently,” he says. “Some faculty groups often have their own IT staff, or might build their own private datacenter. They don’t seem to talk to each other or know what each other is using. This causes several IT resources to be duplicated; it’s very hard to share data for analysis. We now have an institutional directive to centralize, share data and resources. It would also provide security to their data.”

All Together Now

Sankhavaram has developed a free (“no charge back”) centralized system to house everyone’s data, believing that if the data are stored centrally, there’s a chance that researchers will share them. It takes time, but signs are the approach is working. “They come to us now,” he says. “They want to store their data and use our apps.”

As datasets, particularly involving next-generation sequencing (NGS), are frequently duplicated and manipulated by various groups at different stages, Sankhavaram must wrestle with a major question: “What is the source of truth?” he says. “We control the unnecessary duplication of data, and stop wasting disk space. There’s always one copy and everything is backed up and mirrored across town.”

Sankhavaram told the senior researchers he would not touch their data. “What is uploaded in our environment stays right there—appropriately labeled in a tight structure. Only the creator of the data (and his proxies) can see it. This data owner decides who is allowed to see the data and the appropriate level of access.”

His group also serves as a core informatics facility, helping researchers analyze specific problems. “They’re inundated with NGS data and don’t know how to handle it.” Some of the blame lies with the instrument vendors, he says. “They sell instruments to faculty members who are often naïve about how much data they have to store on Day 1.”

Housing data in a central location sounded reasonable until the NGS deluge. “It’s gone through the roof,” says Sankhavaram. “We had to invest a lot of money in a hurry to support those labs.” They have good tools to help users include a homegrown web portal called Research Station, which includes several tools and a LIMS for NGS that is being built by Ocimum Biosolutions. “Ocimum has an open architecture, and we embed it into our architecture.”

All told, MD Anderson has about 1.5 petabytes (PB) of non-research storage serving the entire campus. Thanks to NGS, Sankhavaram is already managing 1.6 PB devoted to research that includes NGS. (All of the research infrastructure managed by just two systems administrators.) Much of that infrastructure is from Hewlett-Packard (see “HP Sauce”). In the next 12 months, Sankhavaram expects to double the research data storage. “There would be six months’ backlog for data analysis if we stopped right now.”

The problem is illustrated by MD Anderson’s three data centers. One is very new, just five years old. “We thought it would last for five years, but it filled up in a year,” he says. Millions of dollars were spent converting a disaster back-up site into a tier 1 production center, which is now full with servers for research computing including a large computing cluster for research. A new data center for clinical data goes live in August, with plans afoot for another data center on campus.

That expansion attempts to solve another problem: electricity. “Our storage and computing solutions need enormous amounts of electricity to run them. Blades run cooler individually than before, so they don’t generate as much heat, but we have a lot of them, so they use much more power. We are running out of electrical power in the data center as our current datacenter is not designed to support such systems.” Meanwhile, more SOLiD and HiSeq 2000 sequencers are on the way, and Sankhavaram admits, “We are scrambling at this point.”

More Storage

Sankhavaram plans to buy more storage and five more servers because he is running out of servers with large memory. “Hopefully Dr. Vogel will get the approval,” he smiles. He is also talking with Larry Smarr’s group at the San Diego Supercomputer Center, hoping to join a high-bandwidth network.

The near-term challenges include handling, storing and replicating 1 Terabyte data per patient—for 10,000 patients. “You can’t just buy $100 Terabyte drives. The data has to be secure, backed-up, mirrored, and many other things the faculty doesn’t always see,” says Sankhavaram. “By MD Anderson clinical guidelines, you have to store everything from a patient. We store data forever.”

Does the genome count as patient data? “Today I don’t know, but it seems to be going in that direction,” he admits.

Some of the MD Anderson faculty members are suggesting storing just analyzed data. “Some say just re-sequence [the samples], but what if we don’t have the sample left?” asks Sankhavaram reasonably.

“There is an initiative to sequence every patient that walks through the door.” That’s why the Institute for Personalized Cancer Therapy, under the direction of Gordon Mills, was created, with the goal to catalogue mutations in 10,000 patients in the first year, reaching 30,000 in five years. An initiative called Project T9 underway aims to “test run” that with 1500 patients. •

HP Sauce

Sankhavaram is a long-time Hewlett-Packard (HP) customer, and despite intense competition in the storage sector, sees little reason to change. “We buy products that fit our existing infrastructure the best; HP seems to work the best.”

Early on, during his initial days at MD Anderson, Sankhavaram saw some atrocious habits in data storage. “I saw data being burned onto DVDs and kept on shelves in the lab; they did not have any IT support.” His storage choice back then was HP’s EVA (Enterprise Virtual Array) system. “We always found HP very reliable, inexpensive, small, and very fast,” he says. His team evaluated offerings from IBM and NetApp, but he says HP still proved the best, most manageable solution.

MD Anderson’s computing cluster is an oversubscribed system that has expanded to 8,000 cores. “We kept adding storage, and as we consolidated everyone’s data, it was easy to use the same HP storage software to manage all of it. Our users didn’t even know [we were adding storage]. We didn’t have to get rid of our old stuff even as we were adding new. Even if we had used IBM storage, we still could have used the same management software.”

But Sankhavaram was running out of space. “We had a problem: Our users did not like it at all” Sankhavaram’s policy is that when researchers upload data, they’re not supposed to manipulate the original. “We noticed people moving things around, so we had to put a stop to that quickly. They now copy it to a scratch area, but do not edit the original. We manage all that from within ResearchStation”

By 2008, NGS data were flooding in. “You cannot move terabytes of data around just like that. We needed fast storage, something that could work with what we already had, and work both on an InfiniBand (cluster compute nodes) and gigabit ethernet (GigE). We needed it to work with NFS and CIFS (Common Internet File System).”

Sankhavaram settled on a small company called Ibrix, which had “a nice algorithm that does it all,” including automatic tiering when needed. “We picked Ibrix, and within two weeks, HP acquired them! It’s now an HP product. It works really well. Other vendors could not provide that. They could provide InfiniBand and GigE, but not at the same time or from the same physical device.”

As Sankhavaram’s cluster uses Infiniband, he says, “I want our entire storage to be presented to the compute nodes with fast interconnects. When you’re in the wet lab, you upload the data quickly, then turn around and launch a job on several hundred nodes. Researchers need not know what’s going on behind the scenes. It’s completely transparent to them. That’s what we wanted.” K.D.