Eliminating Storage Failures in the Cloud

With the advent of disk mirroring over 35 years ago, data redundancy has been the basic strategy against data loss. That redundancy was extended in the replicated state machine (RSM) clusters popularized by cloud vendors in early aughts, and widely used today in scale-out systems of all types.

The idea behind RSM is that running on many servers, with the same intial state, and the same sequence of inputs, will produce the same outputs. That output will always be correct and available if a majority of the servers are functional. A consensus algorithm, such as Paxos, ensures that the state machine logs are kept in sync.

At Usenix FAST ’18 conference, Ramnatthan Altagappan et. al. presented the paper Protocol-Aware Recovery for Consensus-Based Storage that introduced a new approach to correctly recover from RSM storage faults. They call it corruption-tolerant replication, or CTRL.