ABSTRACT:Failures and anomalies are inevitabilities rather than
exceptions in a large-scale cloud computing infrastructure. Its
multi-layered architecture and sheer scale indulge a high fault
frequency thus exacerbate purely manual-based recovery approaches.
In our Science Cloud scenario, the criticality of the
scientific applications running on top of the cloud infrastructure
requires that once a fault happens, recovery solutions must be
planned fast to reduce the outage period or any negative impact.
To address this challenge, we propose an automated fault recovery
architecture with an AI-based planning algorithm as the core
of our approach. As main contributions, this poster presents: an
algorithm to automated recovery plan composition; data models
for the recovery planning knowledge and an architecture to
facilitate planning operations. Evaluation results of the primary
prototypical implementation are also presented.