September, 2011

The Microsoft Research Connections blog shares stories of collaborations with computer scientists at academic and scientific institutions to advance technical innovations in computing, as well as related events, scholarships, and fellowships.

Twenty years; two decades; a fifth of a century—we can phrase it several ways, but what does it mean? To a person, it’s the onset of adulthood (or maybe the point marking only 10 more years of living in Mom and Dad’s basement); to a dog, it’s senescence. But to us at Microsoft Research, it marks the lifetime of our organization, which has grown and evolved in a remarkable era of transformation and innovation in computer science and scientific research.

Yes, Microsoft Research turns 20 this September, and in keeping with the tradition of honoring base-10 birthdays, this seems like an appropriate time to look back on some significant accomplishments and take stock in our future. Over the next four weeks, we will highlight some particularly noteworthy research: from using computing to better understand the body’s immune response to HIV and AIDS, to measuring and modeling complex ecosystems and global environment conditions, to tools that inspire and enable citizen-scientists around the world.

As you will see, the vast majority of these scientific advances were made possible because of joint efforts between Microsoft Research and academic, government, and industry scientists. Collaborative research is the sine qua non of my group, Microsoft Research Connections. We work with the world’s top academic and scientific researchers, institutions, and computer scientists to shape the future of computing in fields such as parallel programming, software engineering, natural user interfaces, and data-intensive scientific research. It is through the connection of dedicated researchers at Microsoft Research’s worldwide labs with the top minds in academia that we are able to push technology to tackle some of the world’s most pressing problems. Similarly, it is through our fellowships and grants that we are able to foster the next generation of world-class computer scientists.

As we look forward to our next 20 years, we do so with renewed vigor and a reaffirmed commitment to improve the world through basic and applied research in computer science and software engineering. Whether it’s the extension of the computer into people’s everyday lives through our research on natural user interfaces, or our ongoing efforts to create educational tools such as the WorldWide Telescope, or our quest to apply algorithms to solve the mysteries of disease, we will be guided by the words of Rick Rashid, who started Microsoft Research in September 1991 and today heads its worldwide operations:

"We are investing for the future, an insurance policy for the future. We’re doing things that, when we start, we don’t know if they are going to be successful. For us, it’s more about ideas and taking risks. Basic research is about agility. It’s about giving you the ability to change when you most need it."

The ability to change when you need it most… now there’s something to celebrate, for sure.

It’s long been known that many serious diseases—including heart disease, asthma, and many forms of cancer—run in families. Until fairly recently, however, medical researchers have had no easy way of identifying the particular genes that are associated with a given malady. Now genome-wide association studies, which take advantage of our ability to sequence a person’s DNA, have enabled medical researchers to statistically correlate specific genes to particular diseases.

Sounds great, right? Well, it is, except for this significant problem: to study the genetics of a particular condition, say heart disease, researchers need a large sample of people who have the disorder, which means that some these people are likely to be related to one another—even if it’s a distant relationship. This means that certain positive associations between specific genes and heart disease are false positives, the result of two people sharing a common ancestor rather than their sharing a common propensity for clogged coronaries. In other words, your sample is not truly random, and you must statistically correct for “confounding,” which was caused by the relatedness of your subjects.

This is not an insurmountable statistical problem: there are so-called linear mixed models (LMMs), which are models that can eliminate the confounding. Use of these, however, is a computational problem, because it takes an inordinately large amount of computer runtime and memory to run LMMs to account for the relatedness among thousands of people in your sample. In fact, the runtime and memory footprint that are required by these models scale as the cube and square of the number of individuals in the dataset, respectively. So, when you’re dealing with a 10,000-person sample, the cost of the computer time and memory can quickly become prohibitive. And it is precisely these large datasets that offer the most promise for finding the connections between genetics and disease.

Enter Factored Spectrally Transformed Linear Mixed Model (FaST-LMM), which is an algorithm for genome-wide association studies that scale linearly in the number of individuals in both runtime and memory use (see FaST linear mixed models for genome-wide association studies). Developed by Microsoft Research, FaST-LMM can analyze data for 120,000 individuals in just a few hours, whereas the current algorithms fail to run at all at even 20,000 individuals. This means that the large datasets that are indispensable to genome-wide association studies are now computationally manageable from a memory and runtime perspective.

With FaST-LMM, researchers will have the ability to analyze hundreds of thousands of individuals to look for relationships between our DNA and our traits, identifying not only what diseases we may get, but also which drugs will work well for a specific patient and which ones won’t. In short, it puts us one step closer to the day when physicians can provide each of us with a personalized assessment of our risk of developing certain diseases and can devise prevention and treatment protocols that are attuned to our unique hereditary makeup.

Nearly a million children die from pneumonia each year, making it a leading cause of death and the single most important health issue facing children under the age of five. The standard vaccination schedule calls for three doses of pneumonia vaccine given at six weeks, 10 weeks, and 14 weeks of age. Naturally, the intention is to protect children from this disease as early as possible—but administering the vaccine at such an early age also reduces how long the vaccine protects the child.

The Oxford Vaccine Group is conducting a program in Nepal to determine if shifting the vaccination schedule can extend childhood immunity until the critical five-year point. For the trial, the team is scheduling the first two doses to be given at six weeks and 14 weeks, but the third dose is given much later, at eight months of age. The team is hopeful that delaying the final vaccination will protect children for much longer, thus reducing the mortality rate from this serious disease.

Building Solutions with Everyday Technology

One of the biggest problems in medical informatics is keeping track of the data. Researchers must meticulously log who collected each piece of information, how it was collected, and any associated details. Manually inputting this data takes time away from actual research and is prone to error, while incomplete entries may cause problems for other researchers who refer to the material later. A team that is working on software support for medical informatics at the University of Oxford’s Department of Computer Science is seeking ways to simplify the process and reduce the risk of errors.

With support from Microsoft Research, this team developed CancerGrid, a system to manage all the diverse data that are associated with a clinical trial. Each data item to be collected is associated with a clearly-defined semantic label so that the precise meaning will be clear to clinical staff, and researchers can be certain that any two trials that use the same semantic label for an item of data are recording exactly the same thing. This makes it possible to reuse and combine data, making each trial far more valuable to researchers. Windows Azure and Microsoft Excel, SharePoint, and InfoPath are used to collect and organize the data, providing easy and intuitive access to data and implementing rules to ensure that critical data is recorded consistently and accurately. Forms, databases, and the associated infrastructure for each new trial can be generated at the touch of a button, permitting the deployment of trial support infrastructure in a fraction of the time and a fraction of the cost of conventional methods.

It is this flexibility and automation that made it possible for CancerGrid to meet the needs of the Oxford Vaccine Group, rapidly generating full document management support for the Nepalese pneumonia vaccine trial. By using a secure Internet connection, researchers in Nepal now transmit data back to the University of Oxford, where it can be analyzed and the effectiveness of the new vaccination regime assessed. Working on CancerGrid has been a very satisfying collaboration for both the Oxford team and Microsoft Research. We are hopeful that it will prove to be a powerful tool in the fight against pneumonia and many other diseases.

—Simon Mercer, Director of Health and Wellbeing, Microsoft Research Connections