Adaptive Dependability for Fault-Tolerant Middleware

I architected and implemented the
MEAD
(Middleware for Embedded Adaptive Dependability) system,
which served as a fault-tolerance platform in
the DARPA ARMS-II and PCES-II programs.
MEAD uses system-call interception and does not modify the applications or
the operating systems. In this manner, MEAD enhances legacy CORBA applications
with replication and recovery mechanisms, and it allows users to select the
appropriate trade-offs among resource usage, fault-tolerance and performance
[CC:PE 2005].

Instead, MEAD provides versatile dependability
[ADS III].
Low-level knobs control the internal fault-tolerant mechanisms of the
infrastructure (e.g., the degree and style of replication).
In contrast, high-level knobs regulate external properties (e.g.,
scalability, availability) that are relevant to the system's users, and
they hide the internal implementation details.
For example, a low-level knob allows switching between active and passive
replication on the fly, and a high-level knob tunes the system's scalability.

MEAD also helped us discover the magical 1% phenomenon
[Middleware 2005].
While MEAD's maximum response time is not predictable, the latency
outliers are limited to only 1% of the remote invocations. We showed that
this is a general phenomenon through a broad empirical study of
unpredictability in 15 additional systems, ranging from simple transport
protocols to fault-tolerant, middleware-based enterprise applications
[COMNET 2012].
The maximum latency is not influenced by the system's workload, cannot be
regulated through configuration parameters and is not correlated with the
system's resource consumption.
The high-latency outliers (up to three orders of magnitude
higher than the average latency) have multiple causes and may originate
in any component of the system. However, after selectively filtering
1% of the invocations with the highest recorded response-times, the
latency becomes bounded with high statistical confidence.
We verified this result on different operating systems (Linux 2.4,
Linux 2.6, Linux-rt, TimeSys), middleware platforms (CORBA and EJB),
programming languages (C, C++ and Java), replication styles (active
and warm passive) and applications (e-commerce and online gaming)
[Middleware 2007;
COMNET 2012].
Moreover, this phenomenon occurs at all the layers of middleware-based
systems, from the communication protocols to the business logic.
This suggests that, while the emergent behavior of middleware is not
strictly predictable, distributed systems could cope with the inherent
unpredictability by focusing on statistical measures, such as the
99th latency percentile.

My contributions to the MEAD project include:

I implemented the first version of the system, MEAD v1.0

I implemented MEAD's adaptation mechanisms, e.g.,
switching between active and passive replication on-the-fly and
tuning the system scalability