Data intensive scientific workflows are at a pivotal stage in which traditional local computing resources are no longer capable of meeting the storage or computing demands of scientists. In the Earth System Sciences (ESS) community, we are facing an explosion of data volumes where new datasets, sourced from models, in-situ observations, and remote sensing platforms, are being made available at prohibitively large volumes to store at even medium to large High Performance Computing (HPC) centers. NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of its data using commercial cloud services (e.g. Amazon Web Services (AWS)). Availability of these data in cloud environments, co-located with a wide range of computing resources, will revolutionize how scientists use these datasets and provide opportunities for important scientific advancements. Fully leveraging these opportunities will require new approaches in the way the ESS community handles data access, processing, and analysis.

These technologies will be deployable on commercial cloud infrastructure where NASA's Earth Observing System Data and Information System (EOSDIS) is anticipated to be stored. At present, tools for working with these datasets consist of convenient interfaces for discovering and downloading data (e.g. Earthdata search) from individual Distributed Active Archive Centers (DAACs). We anticipate that the transition to cloud storage for many of these DAACs will bring immense opportunities and specific challenges to researchers.

This project will facilitate the ESS community's transition into cloud computing by developing technologies that build on existing open-source tools (e.g., Python, Jupyter) by integrating building on top of the growing Pangeo ecosystem.

Our first task is to deploy a scalable cloud-based JupyterHub on AWS for community use. JupyterHub is a multi-user, multi-language interactive computing environment that facilitates open-ended, exploratory analysis and data visualization. Content ("notebooks") developed on JupyterHub are both functional and fluid in the manner of an "executable paper" combining data, processing, and interpretation, a necessary departure from traditional publication as a sequence of static artifacts.

Our second task is to integrate existing NASA data discovery tools with cloud-based data access protocols. Existing data discovery tools, such as the Common Metadata Repository (CMR) and Global Imagery Browser Services (GIBS), provide convenient access to dataset metadata but navigating the access, retrieval, and processing steps for these datasets is left to individual users. We are developing an advanced Python application program interface (API) that leverages high-level tools like Xarray and Dask allowing scientists to accelerate their analysis. Integration of this API with the Pangeo ecosystem provides our API with cutting edge scientific tools for pre-processing, regridding, machine learning, and visualization.

We demonstrate the use of these tools with several datasets including the North American Land Data Assimilation System (NLDAS), Gravity Recovery and Climate Experiment (GRACE), and Sentinel-1 synthetic aperture radar (SAR). The example applications serve as templates for the broader community and real-world applications for evaluation of the cloud services and applications we develop.

The project will help accelerate a shift in the ESS culture toward cloud computing by providing short but intensive training opportunities providing new ways for scientists to collaborate and make full use of NASA satellite datasets.

Update October 2019

The project has made steady progress towards the goal of facilitating the Geoscience community's transition into cloud computing by building on top of the growing Pangeo ecosystem.

Provided Outreach and Training and participated in the on-going outreach efforts of the Pangeo project. The Pangeo JupyterHub was deployed on AWS to support several hackweeks offered by the UW eScience Institute and the Applied Physics Laboratory. These included the Cryospheric Sciences with ICESat-2 on June 17-21, 2019, and Geohackweek on Sept 9-13, 2019. Both events were ideal environments to test of scalability of the JupyterHub infrastructure for more than 50 simultaneous users.

Developed documentation and interactive examples as well as scientific use cases that combine NASA data stored on the Cloud with the Pangeo software stack.