Interesting

August 13, 2008

I'm fascinated by Geoge Church's personal genome project (the name recalls another PGP), in which one volunteers to have one's genome sequenced and made available for research. Personally, I think it's a great thing and I would be delighted to participate. I can't imagine caring if I (or the world) know that I have the gene for obsessive-compulsive disorder or whatever I may turn out to have. The one thing that gives me pause is wondering whether my children or siblings should have a say in whether I publish my genome, given that they share a fair bit of it. (I don't think my parents would mind.) The PGP web site doesn't discuss that issue, although it does point out other dangers that hadn't occurred to me, e.g.:

[someone might] make synthetic DNA corresponding to the participant and plant it at a crime scene

A quick search didn't reveal too many profound thoughts on this topic, just some recognition that if is an issue. E.g., from Baylor College of Medicine:

With [personal genomes] will come a host of legal and ethical issues, said Amy McGuire, J.D., Ph.D., Assistant Professor of Medicine in BCM's Center for Medical Ethics and Health Policy.

"Sequencing a personal genome possibly will reveal information about children, parents and siblings," she said. At present, there are no real standards as to what control family members can have over sequencing of an individual's genome or its release.

August 12, 2008

Alina Bejan just posted the following announcement:MidWest Grid School (MWGS'08) -- Call for Participation

Please JOIN US for an exciting 3-day course in large-scale and high-performance grid computing to take place Sep 17-19, 2008, at University of Chicago, Chicago IL.

The Open Science Grid (OSG), a major national grid infrastructure, provides scientists with more than 70 production sites offering over 20,000 CPUs and 4 Petabytes of storage to advance their research. This organization includes members from particle and nuclear physics, astrophysics, bioinformatics, gravitational-wave science and computer science collaborations, all contributing to the development of the OSG and benefiting from advances in grid technology. Applications in other areas of science, such as mathematics, medical imaging and nanotechnology can also gain from the interactions with OSG through its partnership with local and regional grids or their communities’ use of the Virtual Data Toolkit software stack.

We invite you to learn more about grid and high throughput computing and its implications in various research areas through this intensive OSG course that introduces the techniques of grid and distributed computing for science and engineering with hands-on training in the use of large-scale grid computing resources.

The workshop will focus on enabling the use of OSG and TeraGrid cyberinfrastructure to perform large-scale computations and data-intensive processing in different application domains. Participants will learn how to use grids of thousands of processors and will be able to continue to use these resources for their research after the course completion.

The Computation Institute, a joint effort of the University of
Chicago and the U.S. Department of Energy's Argonne National
Laboratory, has received a grant for a computer system that will enable
researchers to store, access and analyze massive data sets.

The system is made possible through a $1.5 million National Science
Foundation grant, which includes cost-sharing support from the
University of Chicago. The new system is called the Petascale Active
Data Store (PADS), which has been optimized for rapid data
transactions, both on campus and around the globe.

The PADS design resulted from a study of the storage and analysis
requirements of groups in astronomy and astrophysics, computer science,
economics, evolutionary and organismal biology, geosciences,
high-energy physics, linguistics, materials science, neuroscience,
psychology and sociology.

For these groups, according to the PADS team, PADS represents a
significant opportunity to look at their data in new ways, enabling new
scientific insights and collaborations across disciplines. PADS also
will serve as a vehicle for computer science research into active data
storage systems and will provide rich data to investigate new
techniques.

Several nVidia Tesla graphics processing units (GPUs) will be
integrated with traditional CPUs in the PADS system. These GPUs are
capable of computing certain operations many times faster than
general-purpose personal computers.

PADS will be a hybrid system with many layers of storage. These
layers range from a large, tape-based system at Argonne to individual
computers on campus and elsewhere. The intermediate layer is a rack of
computer disks at Argonne containing duplicate data sets as insurance
against hard-drive failure.

To University of Chicago scientists, PADS represents a dramatic
improvement over current practice, which requires them to quickly
analyze data and then remove it from the system to make room for new
data sets. With the storage that PADS provides, groups will be able to
keep data active for longer periods of analysis.

August 07, 2008

I have been watching for a while the ever-broadening range of definitions for "cloud." My favorite is a recent piece that explains that Skype, BitTorrent, and SETI are all clouds. I now realize that a cloud can be whatever you want it to be, and that this is the best definition of the term, from dictionary.com:

cloud11. to make obscure or indistinct; confuse

That got me wondering whether the "grid" term has also subject to similar expansion over the years. To some extent, yes--I've read about discovery grids, knowledge grid, data grids, and too many others to count. But I do think there has been a fair bit of conceptual clarity, albeit with some expansion over time. Specifically:

The term was initially used to refer to on-demand computing (basically the Amazon cloud definition, but without the benefits of virtual machines) -- e.g., see the first edition of The Grid

Then it was broadened to include resource federation within distributed virtual organizations (what you need to enable distributed teams to achieve on-demand access to their federated resources) -- e.g., see The Anatomy of the Grid

Then there is the use of the term to mean "any sort of parallel computing" -- e.g., see Sun Grid Engine, Oracle 10-G, etc. But that is just marketing.

Having dissed the term, I should explain why we at Chicago/Argonne are using it. In brief, we see an opportunity to make connections with the extremely interesting developments in on-demand/utility/cloud computing that are emerging in industry. Thus:

Kate Keahey describes her work on Nimbus as a "cloud", to point out that her Globus virtual workspace service provides the same virtual machine provisioning capabilities that Amazon EC2 provides (and some), but in a package that you can run on your machines (if you so desire).

We are holding a workshop on Cloud Computing and Applications in Chicago in October to help establish connections between those working on data-intensive science and thus developing tools and services for on-demand/utility/cloud computing.

July 18, 2008

I was invited to speak at the University of Chicago's Physics Colloquium. I foolishly agreed and then had to work out what I could possibly say that would be of interest to physicists. I figured that a talk on distributed computing wouldn't be too interesting. Instead, I gave a more general talk with the grandiose title "Computation and Knowledge":

I speak to the question of how computation can contribute to the generation of new knowledge by accelerating the work of distributed collaborative teams and enabling the extraction of knowledge from large quantities of information produced by many workers. I illustrate my presentation with examples of work being performed within the Computation Institute at the University of Chicago and Argonne National Laboratory.

July 16, 2008

We all know that the US patent system has its weaknesses: that every now and then, people get granted patents for things that are either well known or obvious. This probably explains why people keep filing dumb patents: you never know when you might hit the jackpot.

In any case, here's a doozy: following on some 40 or more years of distributed computing research, and 15+ years of commercial and academic grid computing, IBM has finally gotten around to applying for a patent on grid computing. For example:

a computer-implemented method of providing access to grid computing resources available to a plurality of users comprises receiving, from a requesting entity, a request to use a specific grid computing resource to perform a defined function; and routing the request to the specific grid computing resource in accordance with the request.

July 15, 2008

I gave a brief talk at HPC 2009 in Cetraro, Italy, on Grid Projects in the US. I tried to explain what I see as three complementary sets of activities:

Resource providers (RPs) focus on providing substantial communities with on-demand access to computing and storage. TeraGrid, Open Science Grid, campus grids, and the likes of Amazon EC2 and S3 fit in this space.

Service providers (SPs) use either dedicated resources or RP-provided resources to provide services to communities. Certificate authorities, Globus Reliable File Transfer service, and Amazon Simple Queue service are examples of hosted services.

Content providers (CPs) deliver application-specific content (data, software, etc.) to communities, using either dedicated or RP-/SP-provided resources and services. A TeraGrid "science gateway" is an example of a content provider that builds on resources and services provided by a third party.

As with any attempt to categorize complex activities, the divisions are not entirely accurate. For example, the cancer biomedical informatics grid (caBIG) project operates resources, host services, and provides content. But in recent work, caBIG has started making use of TeraGrid resources.