PROBLEM ANALYSIS

Overview

Automating problem analysis is crucial to achieving maintainable systems at the scales needed for tomorrow's high-end computing. Our research explores methodologies and algorithms for automating analysis of failures and performance degradations in large-scale systems, such as distributed storage. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Fingerpointing is one approach to problem diagnosis that
combines node-level (local) anomaly detection, followed
by system-wide (global) detection.

By combining statistical tools with appropriate instrumentation, we hope to significantly reduce the difficulty of analyzing performance and reliability problems in deployed large-scale systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing. Obtaining meaningful results will involve understanding which and how well statistical tools work to meet the challenge of problem detection/prediction. It will also involve quantifying the impact of instrumentation detail on the effectiveness of those tools so as to guide justification for associated instrumentation costs. Explorations will be done primarily in the framework of the Ursa Minor/Major cluster-based storage systems via fault injection and analysis of case studies observed in deployment.

This material is based on research sponsored in part by the National Science Foundation, via grants CCF-0621508 and CNS-0326453, by the Army Research Office,under agreement number DAAD19-02-1-0389, and by the Department of Energy under Award Number DE-FC02-
06ER25767.