Google's MapReduce framework enables distributed, data-intensive, parallel
applications by decomposing a massive job into smaller (Map and Reduce) tasks
and a massive data-set into smaller partitions, such that each task
processes a different partition in parallel. However, performance problems in
a distributed MapReduce system can be hard to diagnose and to localize to
a specific node or a set of nodes. On the other hand, the structure of large
number of nodes performing similar tasks naturally affords us opportunities
for observing the system from multiple viewpoints.

We present a "Blind Men and the Elephant" (BliMeE) framework in which
we exploit this structure, and demonstrate how problems in a MapReduce sys-
tem can be diagnose by corroborating the multiple viewpoints. More
specifically, we present algorithms within the BliMeE framework based on
OS-level performance counters, on white-box metrics extracted from logs, and on
application-level heartbeats. We show that our BliMeE algorithms are able to
capture a variety of faults including resource hogs and application hangs, and
to localize the fault to subsets of slave nodes in the MapReduce system.

In addition, we discuss how the diagnostic algorithms' outcomes can be
further synthesized in a repeated application of the BliMeE approach. We
present a simple supervised learning technique which allows us to identify a
fault if it has been previously observed.