November, 2013

The Microsoft Research Connections blog shares stories of collaborations with computer scientists at academic and scientific institutions to advance technical innovations in computing, as well as related events, scholarships, and fellowships.

Starting in the tenth century, during the Medieval Warm Period, Greenland was a fraction of a degree warmer than today. Norse settlers raised livestock and cultivated small farms. Later, in the fifteenth century, a colder climate and conflicts with the Inuit caused them to abandon their settlements. Critics of anthropogenic global warming cite this earlier episode of warming as evidence that periods of rising temperatures are part of the natural cycle of climate change. By so doing, they attempt to dismiss the impact of human activities on today’s warming world, but without examining the causes of climate changes at various times in Earth’s history.

One of my challenges as a teacher at the University of California, Santa Barbara, is to help my students put such historical evidence in perspective. Every year, I teach a course about Earth system science to 80 to 100 incoming graduate students. It’s very important that these students understand how Earth functions as a planet, including how and why its climate changes. We know that Earth goes through periods in which the climate varies. And at different timescales, we understand why the climate changes.

Covering millions of years’ worth of warming trends in a single term is a challenge; managing the massive volumes of data, charts, videos, illustrations, and other support materials is even more daunting. I’ve struggled to pull together all these materials in an accessible—and manageable—manner.

Thanks to ChronoZoom, an award-winning, open-source community project that is dedicated to visualizing the history of everything, I now have an effective way of collecting, presenting, and managing all of these resources. A joint effort of the University of California, Berkeley; Moscow State University; the Outercurve Foundation; and Microsoft Research Connections, ChronoZoom lets my students navigate through time, from the Big Bang to recent historical events, stopping to study detailed information at any point in history. With the ability to compare events that occurred in the distant past with what’s going on in the present, my students can better understand how we know that today’s warming climate is driven by human activities. For example, by using data derived from ice cores and thermometers, they can examine how changes in temperature relate to events in human and pre-human history.

By using ChronoZoom, I’m developing a history of Earth that illustrates changes in climate from the beginning of the planet through modern day. I’m including images, diagrams, graphs, and time-lapse movies that illustrate changes in the environment, pulling these resources together to create tours: explanatory narratives that my students can explore at any speed and level of detail they want. They can skip over some details and dive deep into others. They can zoom rapidly from one time period to another, moving through history as quickly or slowly as they want.

By using ChronoZoom, Dozier is developing a history of Earth that illustrates changes in climate from the beginning of the planet through modern day.

Because ChronoZoom operates in the cloud and can be accessed from anywhere through any modern web browser, teachers and schools don’t have to invest in new equipment or software. That’s definitely a plus, especially in these days of tight budgets. But for my money, the best value lies in the volume and depth of material that can be packed into a single ChronoZoom timeline. Moreover, thanks to Windows Azure, the tool has the flexibility to scale up and down, so that even projects that focus on but a sliver of the history of everything—such as the history of humanity or maybe just the twentieth century or the last couple of weeks—still benefit from presentation in ChronoZoom.

What’s more, I’ll be able to share my tours with other teachers—and take advantage of theirs—because we can all upload our information to the cloud, making our data, images, and text available in any Internet-connected classroom. Our students can do this, too, creating and sharing their own tours.

I’m sure that ChronoZoom is going to change the way I teach and, more importantly, the way students learn, and I encourage anyone with an interest in understanding the interconnections of history to put their own content into ChronoZoom.

—Jeff Dozier, Professor of Environmental Science and Management, University of California, Santa Barbara

Big data: you can hardly pick up a newspaper without reading about some new scientific or business acumen derived from mining some heretofore-untouched volumes of digital information. Well, I’m happy to say that genome sequence data—which certainly qualifies as big, both in volume and velocity—is joining the party, and in a most meaningful way. When combined with information from medical records, genome data can be mined for new insights into treating disease.

Currently, however, genomics analyses are dominated by batch processing, which means that simple analytical questions can take days to answer. A more efficient way to exploit genomics data would be to run queries against a large genome database in the cloud, by using a platform like Windows Azure. In this scenario, the queries are answered across the network in seconds.

Towards this vision, I have been working with researchers at University of California San Diego (UCSD) and have invented the Genome Query Language (GQL), which features three operators that allow error-resilient manipulation of genome intervals. This, in turn, abstracts a variety of existing genomic software tasks, such as variant calling (determining whether a person has a different gene from the reference) and haplotyping (ascribing genomic variation as being inherited from the mother or the father). GQL is inspired by the classic database query language SQL and has similar operators; however, GQL introduces a major new operator: the fault-tolerant union of genomic intervals.

To understand how GQL could be used on the Windows Azure platform in the cloud, imagine that a biologist is working on the ApoE gene, which is responsible for forming lipoproteins in the body. Wondering how ApoE gene variations affect cardiovascular disease (CV), the biologist types in a query with the parameters “ApoE, CV” on a tablet computer, just as you might enter a search-engine query. The query is sent to the GQL implementation in the cloud, which returns the ApoE region of the genome in patients with cardiovascular disease. Since the ApoE gene is quite small, the data is processed quickly in the cloud and returned in seconds to the biologist’s tablet. The biologist can then use customized bioinformatics software to mine the data to identify variations.

We have implemented GQL on Windows Azure and used it to query genomic data expeditiously. We have shown, for example, how GQL can be used to query The Cancer Genome Atlas for large structural variations by using only 5 to 10 lines of high-level code. The code took approximately 60 seconds to execute on the Windows Azure application in the cloud when run on an input human genome file of 83 gigabytes. GQL can improve existing software as well by refactoring queries, significantly speeding up results. It could also be used to facilitate browsing by queries and not just location within the UCSC genome browser.

To make the GQL implementation provide interactive speeds, two optimizations were crucial: cached parsing and lazy joins. Combined, they sped up query processing by a factor of 100. I encourage interested readers to explore the details of our research—the GQL queries we used, the optimizations we implemented, and the experimental results we achieved—in the Microsoft Research Technical Report: Interactive Genomics: Rapidly Querying Genomes in the Cloud.

Big data took center stage at the fifth Microsoft Research Asia Joint Labs Symposium, held on November 2, 2013, in Hefei, China. Gathering under the theme of “Research Collaborations in the Big Data Era,” more than 50 faculty and graduate students, representing 10 labs, joined more than 20 Microsoft researchers to discuss the future of data-intensive science.

The fifth Joint Labs Symposium featured a lively panel discussion about collaborations in the era of big data.

The symposium is just one of many activities of the Microsoft Research Asia Joint Lab Program (JLP). Since its founding in 1999, the JLP has facilitated comprehensive cooperation between Microsoft Research and faculty and students at leading Chinese research universities. The program promotes joint research, advances academic exchange, and fosters talent development. Microsoft Research Asia has established 10 joint labs, eight of which have been named “Key Laboratories” by the Chinese Ministry of Education, a designation that allows them to compete for government funds. To date, the JLP has completed more than 200 joint projects and given rise to over 1,000 academic papers. Equally important, more than 1,000 students have participated in JLP, fueling a robust talent pipeline.

The fifth Joint Labs Symposium brought together key faculty and students from all 10 joint labs and provided a forum to showcase achievements, enhance scientific research, and cultivate high-caliber talent. The day’s events were broken into three segments. The first focused on urban informatics empowered by big data. The second centered on the role of cloud computing in the analysis of big data. The third featured a lively panel discussion about collaborations in the era of big data—a spirited dialogue that delved into a host of issues, including the potential of cloud services for research; the sharing of data, algorithms, tools, and even research stacks via virtual machines; and issues of data privacy.

The symposium highlighted the importance that Microsoft Research places on collaboration with major academic institutions. We look forward to another year of fruitful cooperation, as we advance together into the realm of data-intensive research.