The CI @ SC14: Discovery Engines, Exascale, Cloud Computing, and More

14

Nov

2014

Next week, the world's experts on high-performance computing, networking, storage and analysis will gather in New Orleans for the 2014 edition of the international Supercomputing conference. Several Computation Institute researchers will be there, presenting papers, participating in panels, leading workshops and tutorials, and speaking at the Department of Energy booth. Below are some of the highlight CI events, covering exascale computing, data services for campuses and national laboratories, the mapping of microbial life, experimental cloud computing, and much more.

The 17 national laboratories of the U.S. Department of Energy are important hubs for science, studying everything from clean energy and battery technology to the universe and subatomic particles. The massive scale of these projects and the powerful instruments used to execute the research make the national labs important hubs for scientific data, creating, analyzing, importing and exporting petabytes of information each day. To ensure that this growing data flood does not cause a traffic jam that slows the national pace of science, a project from Computation Institute (CI), Mathematics and Computer Science (MCS), and Argonne Leadership Computing Facility (ALCF) researchers are building and applying a new “data fabric” for seamless and shareable access to data.

Argonne’s Discovery Engines for Big Data project seeks to enable new research modalities based on the integration of advanced computing with experiments at DOE facilities. The infrastructure includes the Petrel online data store, Globus research data management services, the supercomputing resources of the ALCF, and the parallel scripting language Swift. The work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.

Early users of the system include Argonne’s Advanced Photon Source (APS), a powerful x-ray facility used by thousands of scientists around the world to examine the elementary structures of materials important for medicine, engineering, and other fields. With Petrel, data collected by a visiting scientist at the APS can easily be moved back to the scientist’s home institution or shared with outside collaborators. Researchers can also use Globus data publication and discovery services to permanently store datasets and, if desired, make them public and discoverable for the scientific community.

Microbial life on earth is so diverse that there are an estimated number of 10^30 organisms and perhaps 10^34 proteins. Millions of protein families could exist with an unlimited number of novel sequences. However, only as few as 1% of organisms can be cultured in a laboratory. Apart from what we know from cultivated microbes, our understanding of microorganisms comes from sequencing DNA extracted directly from environmental samples. Yet, we often do not have direct knowledge of which sequence came from which organism.

In a collaboration of five U.S. national laboratories, we are building the tools to improve our ability to search and mine the collection of environmental proteins and to relate those proteins to those we can study in the context of their full genome. We are building a complete mapping of the proteins seen in environmental samples to proteins that occur in sequenced genomes. This enables us to identify those proteins in environmental samples that are closely related to families of proteins of interest to DOE applications and calibrate the methods for building these maps. Additionally, we determine rules that govern the co-occurrence of protein families and protein clusters within sequenced organisms to interpret the patterns in environmental samples in order to gain biological insight into the nature of the organisms and their communities. Last, we demonstrate how to scale data storage, data management and computational analysis methods for a future that will contain many millions of isolates genomes and environmental samples as well as the many billions or trillions of proteins that can be identified from these datasets.

By comparing the sequences from the environment to known sequences from isolate genomes, we can expand our knowledge about the millions of organisms that can’t yet be grown in a laboratory, and even improve our ability to culture some of them. Ultimately, this could enable the discovery of proteins in the environment that hold the key to science and engineering problems at DOE (e.g., biofuels, energy production, novel chemistry, novel structures).

The Swift parallel programming language allows users to perform large-scale simultaneous runs of simulations and data analyses more efficiently and with less user effort. Users write what looks like ordinary serial scripts; Swift automatically spreads the work expressed in those scripts over as many parallel CPUs as the user has available. Swift efficiently automates critical functions which are hard, costly, and unproductive to do manually:

implicit parallelization using functional dataflow

data transport and distribution of work across diverse systems

failure recover and error handling

Swift is both portable and fast. It provides a uniform way to run scientific and engineering application workflows over diverse parallel multicore PCs and clouds. On supercomputers, Swift has achieved speeds of over a billion tasks / sec. Swift is used in applications in materials, chemistry, biology, earth systems science, power grid modeling, and simulations in architectural design and urban planning.

Argo is a new exascale operating system and runtime system designed to support extreme-scale scientific computation.It is built on an agile, new modular architecture that supports both global optimization and local control. It aims to efficiently leverage new chip and interconnect technologies while addressing the new modalities, programming environments, and workflows expected at exascale. It is designed from the ground up to run future high-performance computing applications at extreme scales. At the heart of the project are four key innovations: dynamic reconfiguring of node resources in response to workload, allowance for massive concurrency, a hierarchical framework for power and fault management, and a “beacon” mechanism that allows resource managers and optimizers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures. It is expected to form the basis of production exascale systems deployed in the 2018–2020 timeframe.

Existing campus data services are limited in their reach and utility due, in part, to unreliable tools and a wide variety of storage systems with sub-optimal user interfaces. An increasingly common solution to campus bridging comprises Globus operating within the Science DMZ, enabling reliable, secure file transfer and sharing, while optimizing use of existing high-speed network connections and campus identity infrastructures. Attendees will be introduced to Globus and have the opportunity for hands-on interaction installing and configuring the basic components of a campus data service. We will also describe how newly developed Globus services for public cloud storage integration and metadata management may be used as the basis for a campus publication system that meets an increasingly common need at many campus libraries.

The tutorial will help participants answer these questions: What services can I offer to researchers for managing large datasets more efficiently? How can I integrate these services into existing campus computing infrastructure? What role can the public cloud play (and how does a service like Globus facilitate its integration)? How should such services be delivered to minimize the impact on my infrastructure? What issues should I expect to face (e.g. security) and how should I address them?