Commentary

Given, for the majority of industry programmers, distributed programming is the new normal, dealing with failure in application code has become an inevitable part of building cloud services and mobile applications.

However, even in the simplest case of a distributed counter application with a single client and server, dealing with failures can be challenging. This forces the developers of these applications to have to consider all possible failure conditions: client crash prior to issuing request; after issuing request but before response; server crashing after receiving request but before sending response; etc. This leads to application logic having to have complex failure handling code that’s usually ad-hoc and error prone.

At least in the world of data processing, systems such as Spark, have spent a significant amount of time working to hide the complexities of failure from the application developer with much success. Despite this, there’s still no solution that’s both performant and isolates the developer from failure for general distributed programming.

Cloud Services Today

“While it is possible to build fully resilient distributed applications with today’s cloud development tools, application writers don’t usually fully implement this guarantee, due to the impression that such an implementation would be difficult to code and perform badly.”

Many cloud service designs today rely on durable queues, such as Event Hub or Kafka. These queues are typically processed using serverless, stateless services such as Azure Functions or Amazon Lambda (or, services running in Docker containers through Kubernetes.)

The typical way requests are processed are:

Events are logged as they are received by the client.

Functions, because they are stateless must retrieve application state from storage, process the event, and write state back to storage.

Sequence numbers are used to deduplicate messages and recover from partial failure that leaves the system in an inconsistent state.

This technique is expensive, as it requires many round-trips to storage and requires code for dealing with the deduplication of messages, idempotent processing of these messages, and recovering from partial failure. The combination of at-least-once delivery with idempotent processing achieves application “exactly-once” semantics, even under failure.

So, how can we achieve exactly-once processing without sacrificing performance when building our distributed applications?

Virtual Resiliency

The technique behind Ambrosia is called “virtual resiliency”.

“Virtual resiliency - a mechanism in a (possibly distributed) programming and execution environment, typically employing a log, which exploits the replayably deterministic nature and serializilbility of an application to automatically mask failure.”

The intuition is the following: all of the messages are durably logged as they are sent and received by participants in the system – these messages are the request and response pairs of RPC messages made between processes in the distributed system. If the callee of an RPC happens to fail, the request is replayed until acknowledged by the callee to the caller, after processing.

The log must be durable – and replicated for the system to remain highly-available – this can be done using something like Azure Files, or any other high-performance durable distributed log. For example, Azure Premium Managed Disks (used in the experimental evaluation) offer extremely fast replicated, durable writes at up to 60MB/second.

To reduce the size of the log, periodically a checkpoint is taken – the log position and the application state together – which reduces the amount of the log that needs to be replayed under failure, and reclaims space. These logs are transmitted to active standby instances that also process the log and are ready to take over under failure. Each of these standby instances races to lock the log to determine who is the leader – or the active node.

In Ambrosia, these processes that recover transparently and logically never die are referred to as “immmortals”.

The trick to enabling all of this is ensuring the deterministic replayability of the application code: under failure, when standby nodes take over and replay the log, this log may be replayed an arbitrary amount of times before the system recovers.

Therefore, application code must be deterministic: under replay, the same requests must be made and the same responses generated.

Impulses

One of the challenges around this is related to events that are inherently nondeterministic.

For instance, what happens if application code needs to talk to external storage or interact with the clock, interact with the user, etc?

Ambrosia provides a novel technique here called impulses: nondeterministic actions are “determinized” by first inserting their effect into their own log (effectively, a self RPC) before acting on them. In the case of retrieving the current time, Ambrosia can log a self-RPC that access the clock and stores the result of accessing the clock. Under replay, if a value has been stored, it will reuse that value, if not (because of a crash prior to completion), it will use at-least-once to replay the command and log the resulting value which would be used under future replays and replication to active standbys.

Application Programming Interface

Ambrosia is designed to be language-agnostic.

Immortals in Ambrosia are composed of two components:

The Immortal Coordinator: infrastucture built on top of Microsoft’s open-source CRA framework which virtualizes communication between nodes, ensuring that under recovery of a failed immortal, all required connections are reconnected at the new location of the immortal.

A language-specific Ambrosia binding and application code, responsible for delivering messages to the application, checkpointing, recovery replay, and log writing. Application code that binds to the Ambrosia language-specific binding must only adhere the following contract: execution of the same messages in the same order, given some initial state must yield the same final state and the same outgoing messages in the same order.

High Performance

To achieve high performance, a combination of adaptive batching – which ensures low latency under both light and heavy workloads is used for message transmission between nodes. These batches are accumulated in buffers that are concurrently written to and flushed using a shared buffer pool – both techniques seen in high-performance systems such as Trill and Quill. Batch commit is used to deliver applications to both the log and application for processing from the system itself to keep latency low.

We refer the interested reader to the paper for the full experimental evaluation, but we can say the results are amazing: 12.7X improvement compared to gRPC, despite gRPC having no failure protection mechanism and 100x improvement in cost per unit served compared to stateless compute using serverless and existing exactly-once strategies.