February 08, 2013

This article echoes several themes that I speak to often.
In my conception, every researcher is an entrepreneur, and researchers, like entrepreneurs, should be able to run their (virtual) operations from coffee shops. Science as a service frees researchers to work when and where they want, while also saving them time and money.

For many researchers, tasks concerned with managing (collecting, storing, annotating, indexing, analyzing, sharing, archiving) data are among the most time-consuming. Thus, my colleagues and I established Globus Online (www.globusonline.org) to deliver research data management as a service. The data that Globus Online manages sits at sequencing centers, computer centers, in cloud storage services, and on laboratory computers; Globus Online services run on Amazon computers.

Our first Globus Online service focuses on file movement; several thousand people use it routinely to move large quantities of data (more than 10 petabytes to date) rapidly, reliably, and securely between hundreds of endpoints. During 2013, we'll be adding data sharing, cataloging, analysis, and other functions.

Globus Online is operated by the University of Chicago as a non-profit service for the research community. Please take a look and provide feedback if you can.

November 23, 2010

The Globus team showcased Globus Online, our new cloud-based managed file transfer service, at the SC conference in New Orleans last week. Hundreds of people came by our booth to sign up. I don't think it was just the free T-shirts: people seemed really interested. Thus, I outline here the rationale for Globus Online's development, and provide a few notes on its design and implementation.

Rationale

Growing up in New Zealand, I heard endless repeats of the Goon Show. In one episode, hero Neddy Seagoon is offered five pounds to move a piano from one room to another. It turns out that one room is in France and the other in England, so it is a more difficult task than Neddy anticipated. (He ends up sailing the piano across the Channel.)

Moving data has that flavor: it can sound trivial, but in practice is often tedious and difficult. Datasets may have complex nested structures, containing many files of varying sizes. Source and destination may have different authentication requirements and interfaces. End-to-end performance may require careful optimization. Failures must be recovered from. Perhaps only some files differ between source and destination. And so on.

Many tools exist to manage data movement: RFT, FTS, Phedex, rsync, etc. However, all must be installed and run by the user, which can be challenging for all concerned. Globus Online uses software-as-a-service (SaaS) methods to overcome those problems. It's a cloud-hosted, managed service, meaning that you ask Globus Online to move data; Globus Online does its best to make that happen, and tells you if it fails.

Design

The Globus Online a service can be accessed via different interfaces depending on the user and their application:

A simple Web UI is designed to serve the needs of ad hoc and less technical users

A command line interface exposes more advanced capabilities and enables scripting for use in automated workflows

A REST interface facilitates integration for system builders who don't want to re-engineer file transfer solutions for their end users

All three access methods allow a client to:

establish and update a user profile, and specify the method(s) you want to use to authenticate to the service;

authenticate using various common methods, such as Google OpenID or MyProxy providers;

characterize endpoints to/from which transfers may be performed;

request transfers;

monitor the progress of transfers; and

cancel active transfers

Having authenticated and requested a transfer, a client can disconnect, and return later to find out what happened. Globus Online tells you which transfer(s) succeeded and which, if any, failed. It notifies you if a deadline is not met, or if a transfer requires additional credentials.

The Globus Online Web frontend, in the best traditions of Web 2.0, uses Javascript to provide a rich user experience, forwarding requests to the Globus Online service back-end.

Globus Online REST requests are of course simple HTTP GETs and POSTs, with the destination URL indicating the requested operation and the body of the message containing any arguments.

A command line interface (CLI) has long been valuable for client-side scripting, but requires installation of client-side libraries. What we call (somewhat tongue in cheek) CLI-2 supports client-side scripting with no client-side software installation. We achieve this behavior via a restricted shell, into which any user with a Globus Online account can ssh to execute commands. Thus, I can write

to copy myfile from source alcf#dtn to destination nersc#dtn. Two useful features are illustrated:

Endpoints define logical names for physical nodes. For example, alcf#dtn denotes the data transfer nodes associated with the Argonne Leadership Computing Facility. Sites can publish their endpoints, and users can define their own endpoint names.

The Globus Online scp command echoes the syntax of the popular scp (secure copy), thus facilitating access by scp users. It supports many regular scp options, plus some additional features--and is much faster because it is built on GridFTP.

There's more, including a powerful transfer command. I encourage you to browse the documentation.

Implementation

The two keys to successful SaaS are reliability and scalability. The service must behave appropriately as usage grows to 1,000 then 1,000,000 and maybe more users. To this end, we run Globus Online on Amazon Web Services. User and transfer profile information are maintained in a database that is replicated, for reliability, across multiple geographical regions. Transfers are serviced by nodes in Amazon's Elastic Compute Cloud (EC2) which automatically scale as service demands increase.

Next steps

We will support InCommon credentials and other OpenID providers in addition to Google; support other transfer protocols, including HTTP and SRM; and continue to refine automated transfer optimization, by for example optimizing endpoint configurations based on number and size of files.

November 02, 2010

Nah, I'm not going to tell you here ... that is the title of a talk I will give in Indianapolis on December 1st, at the CloudCom conference. But here's the abstract:

We've all heard about how on-demand computing and storage will transform scientific practice. But by focusing on resources alone, we're missing the real benefit of the large-scale outsourcing and consequent economies of scale that cloud is about. The biggest IT challenge facing science today is not volume but complexity. Sure, terabytes demand new storage and computing solutions. But they're cheap. It is establishing and operating the processes required to collect, manage, analyze, share, archive, etc., that data that is taking all of our time and killing creativity. And that's where outsourcing can be transformative. An entrepreneur can run a small business from a coffee shop, outsourcing essentially every business function to a software-as-a-service provider--accounting, payroll, customer relationship management, the works. Why can't a young researcher run a research lab from a coffee shop? For that to happen, we need to make it easy for providers to develop "apps" that encapsulate useful capabilities and for researchers to discover, customize, and apply these "apps" in their work. The effect, I will argue, will be a dramatic acceleration of discovery.

August 05, 2009

Ed Walker wrote a nice article last year in which he used the well-known NAS parallel benchmarks to compare the performance of a commercial infrastructure-as-a-service offering (Amazon EC2) with that of a high-end supercomputer (the National Center for Supercomputing Applications "Abe" system). Not surprisingly, the supercomputer was faster. Indeed, it was a lot faster, due primarily to its superior interprocessor interconnect. (The NAS benchmarks, like many scientific applications, perform a lot of communication.)

However, before we conclude that EC2 is no good for science, I'd like to suggest that we consider the following question: what if I don't care how fast my programs run, I simply want to run them as soon as possible? In that case, the relevant metric is not execution time but elapsed time from submission to the completion of execution. (In other words, the time that we must wait before execution starts becomes significant.)

For example, let's say we want to run the LU benchmark, which (based on the numbers in Ed's paper) when run on 32 processors takes ~25 secs on the supercomputer and ~100 secs on EC2. Now let's add in queue and startup time:

On EC2, I am told that it may take ~5 minutes to start 32 nodes (depending on image size), so with high probability we will finish the LU benchmark within 100 + 300 = 400 secs.

On the supercomputer, we can use Rich Wolksi's QBETS queue time estimation service to get a bound on the queue time. When I tried this in June, QBETS told me that if I wanted 32 nodes for 20 seconds, the probability of me getting those nodes within 400 secs was only 34%--not good odds.

So, based on the QBETS predictions, if I had to put money on which system my application would finish first, I would have to go for EC2.

Here is a more detailed plot showing cumulative probability of completion (the Y-axis) as estimated by QBETS as a function of time since submission (the X-axis). We see that the likelihood of my application competing on EC2 is zero until around 400 seconds, when it rapidly rises to one. For the supercomputer, the probability rises more slowly, peaking at around 0.97. (I would think that the fact that the supercomputer estimate does not reach 1 relates to the lack of data available to QBETS for long duration predictions.)

Note that in creating this graph, I do not account for application-dependent startup time on the supercomputer or for any variability in the startup time of the EC2 instances. (Looking at Cloudstatus, the latter factor seems to be relatively minor.)

This result reflects really just the scheduling policies (and loads) that the two systems are subject to. Supercomputers are typically scheduled to maximize utilization. Infrastructure as a service providers presumably optimize for response time.

Nevertheless, these data do provide another useful perspective on the relative capabilities of today's commercial infrastructure-as-a-service providers and supercomputer centers.

In my talk I discussed the impact of information technology on the practice of science:

Impressed with the telephone, Arthur Mee* predicted in 1898 that if videoconferencing could be developed, 'earth will be in truth a paradise.' Since his time, rapid technological change, in particular in telecommunications, has transformed the scientific playing field in ways that while not entirely paradisical, certainly have profound implications for New Zealand scientists. The Internet has abolished distance, as Mee also predicted–a New Zealand scientist can participate as fully in online discussions as anyone else, and their blog can be every bit as influential. Exponential improvements in networks, computing, sensors, and data storage are also profoundly transforming the practice of science in many disciplines. But those seeking to leverage these advances become painfully familiar with the 'dirty underbelly' of exponentials: if you don't constantly innovate, you can fall behind exponentially fast. Such considerations pose big challenges for the individual scientist and for institutions, for researchers and educators, and for research funders. Some of the old ways of researching and educating need to be preserved, others need to be replaced to take advantage of new methods. But what should we preserve? What should we seek to change?

November 09, 2008

I'm looking forward to receiving my copy of Scientific Collaboration on the Internet. I have an article in it on lessons learned from the NEESgrid project (an earlier version is here, I think it's a good read, especially between the lines), but the other articles are probably far more interesting:

September 19, 2008

The Argonne Named Postdoctoral Fellowship Program is a great opportunity for a recent or imminent PhD looking to work at the cutting edge of computing. You also get a fancy title, like "Arthur Holly Compton Fellow" or similar. (There are a few to choose from.)

The application deadline is November 5. If you are interested, drop me a line. More details below:

The Director's Office initiated these special postdoctoral fellowships at Argonne, to be awarded internationally on an annual basis to outstanding doctoral scientists and engineers who are at early points in promising careers.š The fellowships are named after scientific and technical luminaries who have been associated with Argonne and its predecessors, and the University of Chicago, since the 1940's.

Candidates for these fellowships must display superb ability in scientific or engineering research, and must show definite promise of becoming outstanding leaders in the research they pursue.š Fellowships are awarded for a two-year term, with a possible renewal for a third year, and carry a stipend of $76,000 per annum with an additional allocation of up to $20,000 per annum for research support and travel.

Requirements for applying for an Argonne Named Postdoctoral Fellowship:

The following documents must be sent via e-mail to:š Named-Postdoc@anl.gov by November 5, 2008.š In the subject line please include the name of the candidate.

The sponsor could be someone who is already familiar with your research work and accomplishments through previous collaborations of professional societies.š If you have not yet identified an ANL sponsor, visit the detailed websites of the various Research Programs and Research Divisions at www.anl.gov

All correspondence should be addressed to Argonne Named Postdoctoral Fellowship Program.š One application is sufficient to be considered for all named fellowships.š For additional details, visit the Argonne web site at http://www.dep.anl.gov/postdocs/

1) "There is a level of agreement that computational Grids have not been able to deliver on the promise of better applications and usage scenarios."

It is fascinating to watch the Gartner hype cycle in action, if sad to see people stuck in the trough of disillusionment. But the fact is, fortunately, that there are substantial grid projects and applications that are having substantial success. Ones that come immediately to mind are the Earth System Grid, cancer Biomedical Informatics Grid, and the LIGO Scientific Collaboration, but as it was yesterday that the LHC was switched on, we should also recall the remarkable successes of the LHC Computing Grid and its partner projects such as Open Science Grid. At a different level, Globus people will be happy to talk about the millions of files moved via GridFTP every day, and Miron Livny will be happy to talk at length about how many millions of CPU hours are delivered every day via Condor.

2) To address this purported lack of success, "there is a need to expose less detail and provide functionality in a simpliﬁed way. If there is a lesson to be learned from Grids it is that the abstractions that Grids expose – to the end-user, to the deployers and to application developers – are inappropriate and they need to be higher level."

No evidence is provided for this assertion that complex interfaces are the reason for the difficulties people have with grids. I argue that the issues are more complex.

First, the interfaces themselves are not, in my view, a significant issue. We can argue whether we prefer REST or Web Services, or say Nimbus (a grid virtualization interface) or EC2 (a cloud virtualization interface), but the differences among these alternatives are not great.

On the other hand, the economic systems that apply in the two cases are extremely different:

Amazon services are designed to support the masses, they have no political constraints on who they can provide service to, and their charging model provides strong return to scale; thus, Amazon can focus on, and succeed in providing, modest-scale, reliable, on-demand service to many.

TeraGrid (to use a US example) is designed to support a small number of extreme computing users, with a negative return to scale (the more users, the more work for fixed budget); thus, they are not motivated to provide virtualization solutions or to operate highly reliable remote access interfaces.

The implications of these different foci for users are tremendous. On EC2, I give my credit card and start a VM--a few seconds. On TeraGrid, I request an allocation (which may not be granted!), get an account, submit a request to run a job (they won't allow me to start a VM), wait in the queue--a many week process. Furthermore, I sometimes find that the remote access interfaces fail because keeping them running is not high priority.

This alternative perspective is I think more revealing about the sources of the differences and the ways we might address them. If we want on-demand, high-quality, compute and storage services, then we need either to create an economic system in which academic providers are motivated to provide such services, or decide to outsource to industry.

The importance of higher-level interfaces is a separate issue. Yes, tools like Hadoop and Swift for data analysis, Introduce for service authoring, Taverna for service composition are important and necessary. Yes, we should be hoping to leverage and influence work done in the far larger corporate market to our advantage. (A focus of the upcoming CCA workshop.)

3) "Grids as currently designed and implemented are difficult to interoperate." The authors make a big deal of this point, but it is not clear to what purpose.

It is true that interoperation is not automatic. [If only everyone used Globus software, then all would be well :) --although of course the policy issues would remain!]. But I am not sure that this is a significant problem for users, or hard to achieve when it is needed. E.g., the caBIG team recently demonstrated a gateway to TeraGrid. The LHC Computing Grid integrates resources worldwlde. Etc. Most users never ask about interoperability, in my experience.