twitter

usenix conference policies

You are here

connect with us

Downtime Budgets

Cory Lueninghoener, Los Alamos National Laboratory

Abstract:

The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. This talk will explore a way to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages, both planned and unplanned, and to ultimately give customers the best computing environment possible.

Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.

Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.