2
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Motivation Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. –Expect 24 x 7 availability, but service outages still happen! A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation Very little detail on operator mistakes –Details strongly guarded by companies and administrators

3
CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 This work Understanding: Gather detailed data on operators’ mistakes –What categories of mistakes? –What’s the impact on the service? –How do mistakes correlate with experience, impact? –Caveat: this is not a complete study of operator behavior Approaches to deal with operator mistakes: prevention, recovery, automation Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service –Like offline testing, but: Virtual environment (extension of online environment) Real workload Migration back and forth with minimal operator involvement