TOKIO: Total Knowledge of I/O

The Total Knowledge of I/O (TOKIO) project is developing algorithms and a software framework to analyze I/O performance and workload data from production HPC resources at multiple system levels. This holistic I/O characterization framework provides a clearer view of system behavior and the causes of deleterious behavior to application scientists, facility operators and computer science researchers in the field. TOKIO is a collaboration between the Lawrence Berkeley and Argonne National Laboratories and is funded by the DOE Office of Science through the Office of Advanced Scientific Computing Research, and its reference implementation is open for contributions and download on GitHub.

TOKIO Architecture

TOKIO is a software framework that is designed to encapsulate the mechanics of various I/O monitoring and characterization tools used in HPC centers around the world and reduce the burden on institutional I/O experts to maintain a complete understanding of how each tool works and what data it provides. It is comprised of three distinct layers:

TOKIO connectors are modules that interface directly with component-level monitoring tools such as Darshan, LMT, or mmperfmon. They simply convert the data emitted by a specific tool into an in-memory object that can be manipulated by the other layers of the TOKIO framework.

TOKIO tools combine site-specific knowledge and different connectors and expose interfaces to provide data from parts of the storage subsystem in a way that does not require a deep understanding of the specific tools used by an HPC center. For example, a tool may provide answer the question of "what was the I/O performance of job 5723433?" by understanding how to use that jobid to find any and all monitoring data that represent I/O performance.

TOKIO analysis apps and data services are more sophisticated analyses, visualization tools, and data management utilities that combine tools and connectors to provide a holistic view of all components in the I/O subsystem.

Overview of the TOKIO architecture

An example of a TOKIO analysis app is the Unified Monitoring and Metrics Interface (UMAMI) which provides a simple visualization of how different components of the I/O subsystem were performing over a time of interest.

Unified Monitoring and Metrics Interface (UMAMI) of an anomalously performing HACC job on Edison's scratch3 file system during 2017.