Explains how CRUSH, the pseudo-random deterministic function, maps an input value to a list of devices with weights on which to store object replicas. CRUSH extends RUSH by introducing straw buckets strategy (well described in this Chinese blog CRUSH straw ).

The dynamic metadata management system, where adaptive subtree partitioning is introduced to achieve both optimal hierarchical tree partitioning for cluster workload and also to load balance large number of clients accessing same file in “flash crowds” way.

The initial pseudo-random data distribution algorithm (also Rushp in Rush family), based on which CRUSH is built. It supports weighted devices, adding or removing devices dynamically while achieving optimal data migration and data distribution. RUSHp utilizes an advanced analytic number theory result called the Prime Number Theorem for Arithmetic Progressions.

Consensus Theory

Recently, I studied a series of influential papers regarding consensus. Consensus under non-Byzantine circumstance is the theory backing up distributed systems such as Chubby (Paxos), etcd (Raft) and Zookeeper (Zab). These algorithms, together with their correctness reasoning are hard to interpret but worth the effort.

This paper, brought by Leslie Lamport, the Turing award winner in 2013, won ACM SIGOPS Hall of Fame Award (2007) and Dijkstra award (2000). It introduced partial order and global order in distributed environment, which greatly influenced happens-before concept in multi-threaded program. The vector clock presented in the paper also inspired multithreading race detection of Golang race detector, whose paper is at Vector clock, How Developers Use Data Race Detection Tools.

The amazing paper, which won Dijkstra award (2001), asserts no completely correct asynchronous consensus algorithm exists even when only one faulty process is tolerated. Lemma 3 in the paper is mind boggling, better interpreted with help of A Brief Tour of FLP Impossibility.

This is the paper written by Lamport himself after the famously rejected paper The Part-Time Parliament. The anecdote can be found on Leslie Lamport Writings.
The paper clearly discusses single paxos, that is, deciding a single value from processes but does not explain well on multi-paxos, that is, deciding a series of values.

Google Chubby team implemented Chubby using multi-paxos, with which, replicated state machine is made possible. The paper talks about many engineering issues when putting it into practical use. I found another article that well explains multi-paxos in an easier approach by Tencent Weixin team (in Chinese) Weixin PhxPaxos. Yet there is a slide presenting consensus history, Paxos family and replicated state machine,Distributed Consensus: Making Impossible Possible.

Raft, the replicate state machine consensus algorithm, was designed to be undertood with less effort compared to Paxos family. Nowadays, quite a lot consensus systems, most notably etcd, are implemented using raft.