Blogs to follow

Thought Leaders in the Cloud: Talking with Kate Keahey, Creator of Nimbus

Kate Keahey is a fellow at the Computation Institute at the University of Chicago and works as a scientist at Argonne National Laboratory Computation Institute. She is the creator of Nimbus, an open source toolkit for turning a cluster into an infrastructure as a service (IaaS) cloud, primarily targeted at making IaaS available to researchers and scientists. Her past positions included being a technical staff member at Los Alamos National Laboratory. More information about Kate and her work is available at www.mcs.anl.gov/~keahey and www.nimbusproject.org.

In this interview, we discuss:

What Nimbus is

The beginnings of “science clouds”, such as TerraGrid and FutureGrid

The potential for clouds to dramatically enhance research

Opportunities for partnerships between organizations like Microsoft and researchers

Student interest in cloud computing

The future of Nimbus

Robert Duffner: Kate, can you take a minute to introduce yourself and the Nimbus project?

Kate Keahey: Sure. I’m a scientist at Argonne National Lab, and I’m a fellow at the Computation Institute at the University of Chicago. My background is grid computing, and many years ago, I realized that one barrier to computation for scientists using grids is that they cannot control the environment on grid resources.

For many people that was a deal breaker, because their code is very complex and hard to port. I started working at combining virtualization and distributed computing at King Lab with the idea of deploying virtual machines remotely.

After a few years of research deploying remote resources, we released something called the Work Space Service, which right now is the infrastructure of the service part of Nimbus. A year later, Amazon EC2 came online, which was a lot of fun because it enabled us to experiment with the concept of larger scales.

Nimbus today is a cloud computing toolkit, one part of which is just an open source infrastructure-as-a-service implementation. It includes rough equivalents of the compute and storage cloud components EC2 and S2. Another goal of Nimbus is to make it possible for people to take advantage of infrastructure-as-a-service platforms.

So for example, we have a tool called Context Broker that takes virtual machines deployed on Nimbus clouds, EC2 clouds, RackSpace clouds, or whatever, and brokers configuration security context between those virtual machines. In other words, after the context broker is done, you get a virtual cluster, which most scientists are used to getting.

Another very important goal of Nimbus is to provide for open source implementation and community. By open source, I am not referring to something that we wrote and just put up a link so everybody can download it. We are committed to building the community and developing software in an extensible way.

That effort includes creating a thoughtful design that is extensible in various directions and providing a framework that allows people to easily test what they have in the larger context of the cloud. That part of Nimbus has been going particularly well. We have managed to attract significant contributions.

Right now, there are four developers funded by the University of Chicago and Argonne National Lab working on Nimbus. There are three other committers on Nimbus and there are a total of maybe 15 contributors.

There are other projects that are clearly finding this infrastructure useful and they are finding that it makes sense for them to experiment with it and extend it, which I think is particularly important at this time in cloud computing.

As a cloud computing evangelist for the scientific community, I have worked with many projects and tried to make it easier for them to use in particular infrastructure-as-a-service clouds and to figure out what their needs are and what the obstacles to adoption are.

Robert: One of the goals of Nimbus is to enable turnkey virtual clusters, so can you take a minute to break that down? What are the key challenges?

Kate: Say you deploy some virtual machines in Amazon’s cloud, for example, and several others in some Nimbus cloud, or some other infrastructure-as-a-service place. Those sets of virtual machines are unconnected, and they don’t know about each other. They don’t share a security context. If you look at a typical scientific cluster, you’ll see that it’s connected in some ways.

In other words, when you run let’s say MPI on that cluster, you can do a send from one node to another without having to type the password, because those nodes exchange host certificates.

But when you deploy virtual machines on somebody’s data center, how do they exchange those host certificates? If somebody from the Internet that you don’t know about comes to you and says, “Here’s my host certificate, can you put it in your office file?” it might not happen because you don’t have a trusted relationship with that person.

Therefore, there needs to be some route of trust, and somebody needs to broker that trusted relationship for various members of the cluster. Those also become configuration conflicts. For example, in your typical scientific cluster based on MPI, it would be configured with something like NFS: Network File Service.

The nodes that are clients to NFS need to know information about the NFS server like the location of the volume. The server also needs to know about the clients. There’s a concept in MPI called MPI COMM WORLD that defines which nodes you’re going to be communicating with. If you’ve got a barrier, you need to contact all those nodes and synchronize. That information has to be exchanged between those nodes somehow.

That’s what the context broker does. It sets up a configuration exchange, provides a trust route, and establishes a trusted security exchange between the nodes of a cluster. The result is a cluster that shares a security context and a configuration context, so that your typical scientist can come in and say, “All right. This cluster has NSS, it has PBS, it has whatever else I need to do my job.” They can just treat that cluster as if it were a cluster in their computational center.

Robert: The University of Chicago and several other universities are offering science clouds. Can you talk a little bit about that?

Kate: That actually began as a grassroots movement. About three years after we released Nimbus, we did some experiments with various scientific applications and Amazon. At some point, the University of Chicago said, “All right, this is an interesting project. Let’s give them a little partition on one of our existing clusters.” Some other universities also thought it was a good idea, so the University of Florida and Purdue University configured clusters on their own resources.

The one at Purdue is particularly interesting, because they configured a cloud on TeraGrid resources. TeraGrid is a large national infrastructure that typically runs the traditional grid software, but Purdue decided to set up one as a cloud. That was the first cloud within TeraGrid, and they’ve been doing some interesting experiments with that lately.

This is a very loose federation of various universities. There is no obligation as to policies or anything like that, and there is no common governance between them. People just know each other, and if some users need to have accounts on multiple clouds for some reason, they get recommended and so forth.

People can use those clouds for any kind of scientific purpose. You can’t use them to run computations for your startup, but those clouds are open to anything that is academically viable. Actually, they are probably being made obsolete right now by a new project called FutureGrid.

It’s also a national infrastructure, TeraGrid type of project, but this one has been specifically set up to experiment with new technologies and new paradigms such as cloud computing. And some of the machines on FutureGrid are configured with Nimbus to form clouds. Traffic is slowly moving from the science clouds, which are very small, toward the clouds set up in FutureGrid.

Robert: My understanding of FutureGrid is that there’s a lot to it other than cloud computing. Could you speak to that a bit?

Kate: Sure. It was set up to provide an experimental environment. Infrastructures like TeraGrid and Open Science Grid cater to the needs of production files. In other words, if oceanographers, astrophysicists, or high energy physicists have some simulations or other computations they want to run, they can do so on those infrastructures.

That arrangement has one problem, though, which is that it’s very difficult for computer scientists to use, because computer science experiments typically inject some failures into the system or experiment with new technologies like cloud computing that create instabilities. Those factors are fundamentally incompatible with the needs of production environments.

There is a test bed in France called Good 5000 that is specifically built for experimental computer science, and it has been running for many years now. They have worked out how the governance of a test bed like that should work, what mode of usage patterns is required, and so forth.

At some point, it became clear that it would be good to have something like that in the US, and that’s essentially what FutureGrid is. It gives us an excellent opportunity to experiment with new paradigms, including cloud computing but also, for example, new networking paradigms. It also has a private network that you can inject delays and failures into, so it’s a very interesting experimental tool, and it’s coming online this fall.

It’s possible now to get accounts on FutureGrid. In fact, there are quite a few people who did, and they are running interesting projects on Nimbus and other setups on the FutureGrid. It’s not supporting all modes of usage yet, but within a few months, we’re hoping to support minimal users and usage modalities.

Kate: I think there is enormous potential. I alluded earlier to the problems that people were encountering with grids, when they couldn’t control an environment on a remote resource. That issue went away with virtualization, since if you can deploy a virtual machine that you have configured yourself, it’s going to support exactly the environment that you want. That is a very, very powerful thing, because often in science, you don’t need computation on an ongoing basis, but only on demand.

You mentioned STAR. They ran a simulation that’s fairly famous by now, where they really needed to get results in time for a paper deadline. There was one last simulation they needed to squeeze in, and we helped them run their experiment on Amazon using hundreds of nodes.

Of course, paper deadlines are fairly trivial in the scheme of things, but we have also worked with epidemiologists at the University of Utah. You can imagine that there could be a far more significant incentive to get something done very quickly in the case of monitoring an epidemic.

At the Ocean Observatory Initiative, they are monitoring earthquakes. That data sometimes gets delayed, and then it comes out in big bursts. It has to be processed as soon as possible, because if there are earthquakes happening somewhere, we want advanced information on that as soon as possible.

There are a lot of events that benefit from timely processing. For example, in the case of an oil spill, you want to run simulations about projected movement of the spilled oil, effects on marine life, and so on. Hurricanes, tsunamis, and a wide spectrum of fairly sudden phenomena make it very powerful to be able to go and get resources at a drop of a hat and then come back to your normal processing needs.

Another factor is that in many branches of science, there are periodic rapid changes. For example, in bioinformatics, the cost of sequencing machines went down dramatically. It used to be that many centers simply could not afford those machines and therefore were not producing that data, but now there’s a sudden explosion of data, and people’s computational needs are growing very, very dramatically.

Given that those patterns are very difficult to predict, how do you accommodate that growth? Why not outsource the computation? There are many economic and convenience factors; scientists simply don’t want to run computational centers.

They want to do the climate or the physics, or whatever other discipline they are good at. They would prefer not to spend their resources acquiring expertise to run these computation labs on site, so they are happy to outsource the problem, if possible.

Another important issue has to do with democratization: creating the computing middle class, if you will. It used to be that if you started a research team, you needed some computational resources to back you up. You needed to buy a cluster. Well, not necessarily anymore.

Many of the same factors that make cloud computing compelling for business also make it compelling to academia. In academia, one huge usage pattern is coming to the fore right now, and that’s the ability to deal with unpredictable phenomena.

And this is what we’re doing in the Ocean Observatory Initiative. People are putting sensors in the ocean, and based on the information that those sensors return, we need to run simulations. We need to scale elastically and on demand, and sometimes we need to scale in very short amounts of
ime.

This implementation of that pattern didn’t used to be possible. Now it is, and it’s changing things dramatically. We’ll have to see how the availability of that pattern will change things further and whether it will speed up this process of outsourcing computation. It’s certainly a very powerful catalyst.

Another pattern that we see emerging is that not all of these scientists have the skills to configure their own virtual machines to run on the cloud infrastructure. As a result, there is a new type of role emerging for specialists who take care of creating the right virtual machines for the community, validate them in special ways, and maintain them.

The stuff that sys-admins used to do for the communities is now being done not on a per-cluster basis, but on per-community basis. Since it’s done on a per-community basis, they are putting all kinds of stuff on those virtual machines and performing all kinds of maintenance tasks that cluster administrators didn’t do before, because they were doing it on a per-cluster basis.

We collaborate with a bioinformatics project at University of Maryland called Clover that essentially developed a set of mechanisms that make cloud computing easier, customized to their specific community. I predict that there will be many more projects like that emerging, and it will lead to a new technical role that we didn’t really have before.

Robert: We recently ran an Azure academic pilot to start building relationships with researchers and better understand their cloud computing environments. We wound up striking a deal with the National Science Foundation to provide free cloud resources for NSF projects. What would be your message to an organization like Microsoft that’s interested in providing compute resources to researchers and students?

Kate: The only serious obstacle I can see is that the prevalent platforms in the scientific community are Linux or Unix-based. From the perspective of many researchers, particularly those that have been established for a long time, transitioning to Windows is a major obstacle. In order to go to Azure, they would not only have to port it, but they would also have to take on maintenance responsibilities on a platform other than their primary one.

Working with newer communities that do not have those strong legacy requirements, bringing them in while they’re developing computational capabilities, and providing them with encouragement and help while showcasing Azure’s features is probably the way to go. It’s very hard to move a community that has already been developing things on a Unix-like platform for many years onto a Windows one.

Robert: It’s definitely a great opportunity for Microsoft, and I think you bring up some very good points. As we start to realize fully the benefits of offering a platform-as-a-service, where you’re just paying for access to the data center, I wonder whether these distinctions between various platforms start to go away. Do you have any thoughts on that?

Kate: From my perspective, infrastructure-as-a-service is the most flexible computing model, because it truly gives people the freedom of doing whatever they want to do. If you go further up the stack and provide a platform, which is what Azure is, on one hand you provide something that is potentially more convenient, but on the other hand, you restrict some degree of control that people have on the infrastructure-based platform.

And I think you’re perfectly right that if there is a useful platform and people really find the specific paradigm useful, it eventually becomes a utility. For the transitional period and for applications that do not easily lend themselves to specific computational patterns, and I personally think that are many such applications, the infrastructure-as-a-service paradigm is going to be more interesting in the long run. And then the choice of an operating system platform is going to matter.

Robert: That’s a fascinating point of view. We talked a lot about researchers and scientists, but tell us about the student side. What aspects of cloud computing seem to be the most interesting to the students that you interact with?

Kate: Well, students are typically interested in their degrees. I think that cloud computing from their perspective is like a gold rush, because all of a sudden, the paradigm changed. And when a paradigm changes, that means that many pathways emerge that were not explored before. So from the perspective of somebody looking for a thesis topic, all of a sudden there are all these thesis topics there for the taking.

To give you an example, you very often just blast a lot of copies of virtual machine images to nodes. Then you also snapshot them so in a given time slot, you have to take those images and store them on your storage system.

In principle, this is nothing new. We have done data management and storage management and distributed computing in general before, but perhaps we focused on something else. Maybe we didn’t focus on the specific pattern when one image, let’s say a five- to 15-gigabyte image or file, goes out to so many nodes.

Now, by optimizing this pattern, we have some interesting issues there that we can explore. There are intellectually challenging problems that get this change in paradigm all of a sudden exposed. So for computer science students in particular, I think it’s wonderful. And we’ve been working with quite a few under various initiatives.

We’ve had great success working with Google Summer of Code, which sponsors students to do open source involvement in the summer. Would Microsoft consider doing something like that?

Robert: I think it’s a very interesting idea, and there have actually been conversations within Microsoft regarding a similar approach. I can’t really provide you with any specifics, but there’s definitely a lot of interest. There are certainly differences because of our historical approach to developing software, but we’ve been engaging with open source communities more and more. It’s absolutely not out of the realm of possibility for us to move rapidly in that direction.

Kate: It’s certainly wonderful for students, because they find a lot of interesting, challenging problems. Throwing a lot of young minds at specific problems could be very interesting.

Robert: From a strategy perspective, Microsoft wants the Windows Azure platform to be the most open platform out there. Whether you want to develop in .NET, C#, Java, PHP, Ruby on Rails, Python, or whatever, we want to be able to run all of those in Windows Azure. That’s clearly a stated direction for Microsoft, because ultimately we believe that whoever’s cloud is the easiest to leave will be the most successful cloud.

Kate: I totally agree with that.

Robert: To slightly shift gears, in a talk called “My Other Computer is a Data Center,” Robert Grossman of the Open Cloud Consortium said, “A programmer can develop a program to process a container full of data, in this case, a shipping container, with less than a day of training using MapReduce.” How much hardcore programming does a science student need to know to be able to take advantage of the cloud?

Kate: I think we’ve got two issues here. One is the issue of taking advantage of resources that are external to your lab. And then the other issue is the issue of a convenient paradigm. Certainly MapReduce has proved to be a very convenient paradigm, maybe not for physics, but for bioinformatics, certainly, or other biological sciences or any kind of problem that is in some way, shape or form similar to search.

So the paradigm is convenient, and then you get the easy access to external resources, in a sense as a bonus. Easy access to external resources has to be provided to you via a paradigm that is easy to use. Sometimes it will be a platform paradigm, and sometimes it will be an infrastructure paradigm.

There was a huge area where the cloud computing paradigm was hard and now it is easy. And that makes a very, very significant difference to application scientists.

Robert: What are some of the other uses of Nimbus that you’re seeing outside of the scientific domain?

Kate: Nimbus clouds are being used fairly extensively for education. I don’t think it has really been significantly adopted in industry. We talked earlier on to some commercial companies, but at this point, there are many other projects that cater to larger companies like that, in particular projects that provide a better business model for support and so forth. And our target is also primarily science and education.

Robert: So can you talk a little bit about the future of Nimbus? What’s on the road map?

Kate: We’ve got a couple of releases planned, one of which is going to include fast propagation, so we’ll have an improved tool to support the need to suddenly deploy hundreds of images, which I referred to earlier.

One of my key members developed something called LAN Torrent, which works roughly like BitTorrent, but on a LAN, and it just streams images. A node becomes sourced and streams images to other nodes, which significantly reduces the deployment time for large clusters.

We tested something like 1, 000 images in 10 minutes on the Magellan Cloud. In scientific clouds, the prevalent model with infrastructure-as-a-service is to allow users service-on-demand requests. Conventionally, you have to significantly over provision your cluster. You get like 10 percent utilization, or if you have much higher utilization, you have to reject a large proportion of those requests because somebody else is already running it.

That’s not a very good trade off, so we said, “All right, how about if while nobody is asking for on-demand resources, you just deploy a default virtual machine on those resources?” And that default virtual machine could join for example, Condor Pool, or some sort of study-at-home pool, or be used by some infrastructure that is very failure-tolerant.

Condor is a system for high-throughput computing that is used to work in environments such as desktops. The owner of the desktop can come any time and interrupt it, and some of the computations are effectively lost and have to be rerun.

We ran some studies on that, and you can have a 100 percent-utilized cloud. That’s coming out in 2.7, which we hope to release on Thanksgiving, because most of those features have been contributed by our wonderful open source community.

So we have lots to be thankful for. And then later on, we’re going to be releasing capabilities that we’ve worked on for several years now, and in particular, the capability to scale elastically. The Alease project was our first venture into management of elastic scaling.

They had a queue of jobs that they run on a global test bed, and the jobs are managed by a group scheduler. And they said, well, we would like to monitor that queue, and if it’s very large, we would like to spin up some virtual machines on a cloud to pick up those jobs. And if the queue gets smaller, we’ll kill those virtual machines, because we won’t need those external resources.

In other words, it was elastic scaling to extend their test bed. We did that about three years ago, and we have done several projects in that vein since then. More recently, we’ve been working with the Ocean Observatory Initiative to provide a highly available elastic scaling service like that. We’re hoping to release that early next spring.

Robert: Those are all of the prepared questions I have. Is there anything else you’d like to talk about or something that you would like an opportunity to address to our Windows Azure community?

Kate: I see open source as an extremely important vehicle in progress, in particular in cloud computing. And from my vantage point, watching the community for quite a few years, many of the breakthroughs that have led to the development of cloud computing can be traced back to open source.

For example, we had VMware for many years, and it worked great. But at the beginning, when I was trying to convince users to use virtual machines with distributed computing, they would say, “Why would I buy a virtual machine if I could buy a real one for the same amount of money?” Because licenses were quite hefty.

And then when open source virtualization came out, people could improve it and adapt it to what they needed to do, and it was very fast. It really created a breakthrough, and eventually it led to the development of services like EC2, and in some ways to the whole cloud revolution.

When there is a paradigm change, it’s very important to invest in open source software. I’m not saying that all software should be open source, and I don’t think anybody would advocate that quite completely. But it’s important that there is software that people can experiment with, extend, and contribute to.

It’s extremely important to generate progress and allow people to experiment and do new things. I think that is a very important aspect of what we’re going to be seeing in the future.