What’s needed for the Future of AOS Python? Tools for Automating AOS Data Analysis and Management

Preface: the future looks good

Python’s standing in the AOS community has never been stronger: its user base is passionate and growing, and AOS-relevant packages and functionality continue to proliferate. These trends seems poised to continue, with (among other things) the emergence of the xarray package for labeled N-dimensional arrays and the dask package for out-of-core computation.

In this post, we discuss one outstanding community need and our recent work in Python on a solution. Meeting it would further accelerate Python’s already impressive momentum in the AOS community.

Motivations: too many calculations, too little time

Climate and weather modeling centers around the world put significant effort into optimizing their models to run quickly on high-performance computing machines and be straightforward for users to configure and run. While this focus on improving the user-experience of running models is never-ending (and vital), less attention has been given to improving users’ workflows post-simulation. In particular, the systematic analysis of the results from a collection of runs from one or more models remains a challenge for a variety of reasons.

Climate and weather models and gridded observational products generally provide many useful quantities, but usually some fields you’re interested in must be computed from the provided outputs. Even for those stored directly in data files, further slicing and dicing in time and/or space are typically required. Moreover, these multiple computations and spatiotemporal reductions need be repeated across multiple models, simulations, date ranges, subsets of the annual cycle, and so on.

Manually permuting over all of these parameter combinations quickly becomes impractical. But even once this gets automated, the resulting plethora of data is of little value unless it is highly organized and imbued with ample metadata describing these parameter settings.

Our solution: the “aospy” package for automated AOS data analysis

We have developed an open-source Python package, “aospy”, that provides a framework for automating calculations across datasets and the management of the resulting output. Using aospy, it is possible to submit 1000s of calculations spanning multiple models, simulations, variables, etc. with a single command, with the results saved as netCDF files in a highly organized and metadata-rich directory & filename structure, making it easy to access them for plotting or other analyses.

aospy’s documentation includes high-level overviews of the aospy computational procedures, concrete examples of using aospy with code snippets and plots, and full API reference. aospy also ships with example netCDF data and a sample “main” script which users can execute out-of-the-box to get a feel for aospy’s workflow.

We also recommend any MATLAB users check out the GOAT (Geophysical Observational Analysis Tool) package developed by Ori Adam and colleague. It seeks to address similar needs but is more mature than aospy; among other nice features it includes a GUI and directly integrates with CMIP and other online data archives.

The future: growing automation need

The disparity in sophistication of the software used to generate simulations (i.e. climate and weather models) vs. that used to analyze their outputs (often an unwieldy assemblage of very simple scripts) is already vast. This directly limits the quality and quantity of scientific insights gleaned from these datasets. And increasing prominence of large ensemble simulations in both weather and climate research (e.g. http://www.cesm.ucar.edu/projects/community-projects/LENS/) further exacerbates the need for tools that automate calculations across data sources.

aospy is our attempt at narrowing that gap, and we have many ideas for how to further improve it (see our Github Issues page). Regardless of aospy’s fate, we urge the AOS community to think harder about the tools we use to interact with our model & observational data once it’s been generated, and how do we keep up with the ever-growing scale of that output. We believe that Python is uniquely well suited for these kinds of tools, although GOAT exemplifies that they are entirely feasible in other languages also.

One encouraging development in this regard is the pangeo-data organization created by Ryan Abernathey. Pangeo-data seeks to foster development of scalable Python tools in the broadest sense, including with respect to the bottlenecks discussed here. It is our hope that pangeo-data (which is actively seeking additional members) can provide a forum for and help initiate the discussions on this topic needed to spur progress.

Postscript: potential benefits for Open Science

The ongoing rise of Python in AOS (and other disciplines) coincides with the rise of Open Science — i.e. that scientific analyses should be made freely available and meticulously documented. aospy helps promote this in at least two ways. First, all calculations in aospy proceed through the same pipeline, making it straightforward to understand how each was completed. Second, the extensive automation features of aospy make large sets of calculations easily reproducible from a single Python object or script.