July 18, 2008

I was invited to speak at the University of Chicago's Physics Colloquium. I foolishly agreed and then had to work out what I could possibly say that would be of interest to physicists. I figured that a talk on distributed computing wouldn't be too interesting. Instead, I gave a more general talk with the grandiose title "Computation and Knowledge":

I speak to the question of how computation can contribute to the generation of new knowledge by accelerating the work of distributed collaborative teams and enabling the extraction of knowledge from large quantities of information produced by many workers. I illustrate my presentation with examples of work being performed within the Computation Institute at the University of Chicago and Argonne National Laboratory.

July 16, 2008

We all know that the US patent system has its weaknesses: that every now and then, people get granted patents for things that are either well known or obvious. This probably explains why people keep filing dumb patents: you never know when you might hit the jackpot.

In any case, here's a doozy: following on some 40 or more years of distributed computing research, and 15+ years of commercial and academic grid computing, IBM has finally gotten around to applying for a patent on grid computing. For example:

a computer-implemented method of providing access to grid computing resources available to a plurality of users comprises receiving, from a requesting entity, a request to use a specific grid computing resource to perform a defined function; and routing the request to the specific grid computing resource in accordance with the request.

July 15, 2008

I gave a brief talk at HPC 2009 in Cetraro, Italy, on Grid Projects in the US. I tried to explain what I see as three complementary sets of activities:

Resource providers (RPs) focus on providing substantial communities with on-demand access to computing and storage. TeraGrid, Open Science Grid, campus grids, and the likes of Amazon EC2 and S3 fit in this space.

Service providers (SPs) use either dedicated resources or RP-provided resources to provide services to communities. Certificate authorities, Globus Reliable File Transfer service, and Amazon Simple Queue service are examples of hosted services.

Content providers (CPs) deliver application-specific content (data, software, etc.) to communities, using either dedicated or RP-/SP-provided resources and services. A TeraGrid "science gateway" is an example of a content provider that builds on resources and services provided by a third party.

As with any attempt to categorize complex activities, the divisions are not entirely accurate. For example, the cancer biomedical informatics grid (caBIG) project operates resources, host services, and provides content. But in recent work, caBIG has started making use of TeraGrid resources.

I (and my co-organizers Ioan Raicu and Yong Zhao) use the term to denote high-performance computations comprising multiple distinct activities, coupled via (for example) file system operations or message passing. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large.

Were we right to coin a new term, multi task computing, to denote such applications? There are certainly alternatives that we could have used instead. For example:

Multiple Program Multiple Data (MPMD): A variant of Flynn's original taxonomy, used to denote computations in which several different programs each operate on different data at the same time. (In contrast to SPMD, in which multiple instances of the same program each execute on different processors, operating on different data.) Not a bad term, really, although in our case, the set of tasks can vary dynamically. Maybe we should say dynamic MPMD?

Heterogeneous applications: A computation that involves multiple, different parts. Not a bad term, but rather unspecific. Perhaps a synonym for MPMD?

High throughput computing (HTC): A term coined by Miron Livny to contrast workloads for which the key metric is not floating point operations per second (as in high performance computing: HPC) but "per month or year." We didn't use that term because the applications we work with are often are just as concerned with performance as the most demanding HPC applications--they want to run in minutes or hours, they just don't happen to be SPMD programs.

Workflow: Surely one of the most abused terms in computing, workflow was first used to denote sequences of tasks in business processes, but is sometimes also used to denote any computation in which control passes from one "task" to another. I find its use to describe many task (or MPMD or heterogeneous or ...) computations an unwarranted perversion of the English language.

Capacity computing: A term used to denote a computing resource designed to support many small tasks --in contrast to a capability computing resource, on which a single large computation can run efficiently. I see the same problem here as with HTC: many task computations, while heterogeneous, can be extremely large and can place great demands on a computing system.

Embarrassingly (or happily) parallel: A delightful term used to denote parallel computations in which each individual (often identical) task can execute without any significant communication with other tasks or with a file system. Certainly some "many task applications" will be simple and happily parallel. But others will be bothersomeingly complex and communication intensive, interacting frequently with other tasks and/or a file system.

Are we making a useful distinction in using the term "many task computing" rather than one of those above, or just engaging in unnecessary neologism? Tell me what you think!

Perhaps we could simply have said: applications that are communication-intensive but are not naturally expressed in MPI. In that sense (and this is really the primary goal of the workshop) we are simply drawing attention to the many computations that are heterogeneous but not "happily parallel." Such computations can arise for a variety of reasons, such as:

Individual tasks are themselves parallel programs.

Many tasks operate on the same input data, and we can use the fast network to broadcast that data to all nodes, or to distribute references to data subsets if that is more efficient (or if the data cannot be replicated)

There is considerable communication between tasks.

There is a need for substantial distributed data reduction operations prior to output.

In any case, we hope to see submissions from people working on high throughput computing, data-intensive scalable computing, and any other sort of high-performance computing that isn't conventional SPMD.

July 05, 2008

I spent a wonderful week in Cetraro, Italy, at the 9th HPC Conference. A lovely location, exceptional colleagues, and fascinating presentations and discussions on various topics relating to HPC and Grid. Prof. Lucio Grandinetti organizes a wonderful event.

A focus this year on hardware--people showing off their impressive close-to-petascale computers and talking about plans for exascale. Lest people get too proud, Jack Dongarra reminded us that on average it takes 6-8 years for the number one system on the Top 500 list to fall off the bottom. Some discussion of GPGPUs (but does anyone believe that they won't be replaced by general-purpose multicores within a few years?). Also much discussion concerning power management--but not clear how serious anyone is about the topic. (E.g., no real discussion of what tradeoffs will be made to reduce power consumption, or mention of genuinely low-power systems, such as SiCortex.)

Considerable discussion of "clouds" (whatever they may be) much of it naive and ill-informed. Fascinating to hear the same expansive expectations expressed (without any apparent doubts) for "clouds" as we heard five years ago from the most fullsome proponents off grids. "Soon, applications of all sorts will be hosted in the cloud." Well maybe, but big legal, business model, and sociological barriers have so far hindered hosting of core business applications as services, and those barriers seem slow to fall. "Clouds offer arbitrarily scalable computing." Scalable to infinity, for any value of infinity less than a few hundred (at least at present). "Clouds offer much simpler interfaces than grids." There isn't much to choose between say EC2 and Globus Workspace Service from the interface perspective.

Perhaps the most thoughtful remarks on these topics were from Ignacio Llorente, who is working with relevant technologies via his Globus GridWay and OpenNebula projects. He characterized clouds as "a paradigm for the on-demand provision of virtualized resources as a service" (correctly identifying virtualization rather than interfaces as the key advance) and grids as "the technology that will allow for cloud interoperability." It will be interesting to see whether cloud interoperability emerges as an important requirement. In the grid space, a lack of interoperability has so far proved to be more of an irritant than a real obstacle to progress.

Discussion of the European Grid Initiative, a proposal to create a pan-European grid linking national grids within each EU state. A curious lack of discussion concerning the application drivers for this new infrastructure, or their requirements. Is EGEE (focused on distributing jobs to federated clusters) really the right model? Why no discussion of data federation (a big driver for grid computing in the UK and US), services (at the heart of successful grids such as caBIG, BIRN, and Earth System Grid), or the role of supercomputer centers (surely major "powerplants"?). The software strategy was striking in its lack of vision--keep supporting three separate European middleware platforms (ARC, gLite, Unicore), and attempt to integrate them (to what end?), while ignoring the Condor and Globus software used by so many Europeans. Sounds more like "jobs for the boys" than a strategy for supporting European eScience. Let us hope for an outbreak of vision among European grid leaders.

Our goal in convening this workshop is to encourage discussion among those interested in escaping the "SPMD rut" that characterizes much work in parallel, if not grid, computing. We believe that new applications and more powerful computers are spurring increased interest in computations involving many separate tasks, loosely or tightly coupled, often linked via a global name/file space. These computations may be no less demanding of computing, file systems, and networks than SPMD computations, but have important differences that may require new tools. Topics of interest include:

July 02, 2008

Charles Bacon, GT integration lead, writes: "On behalf of the Globus Toolkit development team I am pleased to announce that a new stable release of the Globus Toolkit is now available. GT4.2.0 contains an upgrade to the web services specifications used by the toolkit [to the final WSRF and WS-Notification specifications] as well as new features in all services. New users are encouraged to use the 4.2.0 release. Existing users may wish to evaluate the new software while maintaining their existing installations; due to the specification upgrade, the web services are incompatible with the 4.0.x series. Details on the spec upgrade are available in the release notes.