Date

Author

Metadata

Statistics

Abstract

Highly replicated cloud applications are deployed only when they are deemed to be func-
tional. That is, they generally perform their task and their failure rate is relatively low.
However, even though failure is rare, it does occur and is very difficult to diagnose. We
devise a tool for failure diagnosis which learns the normal behaviour of an application in
terms of the statistical properties of variables used throughout its execution, and then
monitors it for deviation from these statistical properties. Our study reveals that many
variables have unique statistical characteristics that amount to an invariant of the pro-
gram. Therefore, any significant deviation from these characteristics reflects an abnormal
behaviour of the application which may be caused by a program error.
It is difficult to get the invariant from the application’s static code analysis alone. For
example, the name of a person usually does not include a semicolon; however, an intruder
may try to do a SQL injection (which will include a semicolon) through the ‘name’ field
while entering his information and be successful if there is no checking for this case. This
scenario can only be captured at runtime and may not be tested by the application de-
veloper. The character range of the ‘name’ variable is one of its statistical properties; by
learning this range from the execution of the application it is possible to detect the above
described abnormal input. Hence, monitoring the statistics of values taken by the different
variables of an application is an effective way to detect anomalies that can help to diagnose
the failure of the application.
We build a tool that collects frequent snapshots of the application’s heap and build a
statistical model solely from the extensional knowledge of the application. The extensional
knowledge is only obtainable from runtime data of the application without having any
description or explanation of the application’s execution flow. The model characterizes
the application’s normal behaviour. Collecting snapshots in form of memory dumps and determine the application’s behaviour model from them without code instrumentation
make our tool applicable in cases where instrumentation is computationally expensive.
Our approach allows a behaviour model to be automatically and efficiently built using
the monitoring data alone. We evaluate the utility of our approach by applying it on
an e-commerce application and online bidding system, and then derive different statisti-
cal properties of variables from their runtime-exhibited values. Our experimental result
demonstrates 96% accuracy in the generated statistical model with a maximum 1% per-
formance overhead. This accuracy is measured at the basis of generating less false positive
alerts when the application is running without any anomaly. The high accuracy and low
performance overhead indicates that our tool can successfully determine the application’s
normal behaviour without affecting the performance of the application and can be used to
monitor it in production time. Moreover, our tool also correctly detected two anomalous
condition while monitoring the application with a small amount of injected fault. In ad-
dition to anomaly detection, our tool logs all the variables of the application that violates
the learned model. The log file can help to diagnose any failure caused by the variables
and gives our tool a source-code granularity in fault localization.