groups of RMs can be configured to give differentperformance or reliability characteristics

–once theR

andW

have been chosen for a set of RMs:

–the reliability and performance ofwrite

operations may be increased bydecreasingW

–and similarly for reads by decreasingR

the performance of read operations is degraded by the needto collect a read consensus

examples from Gifford

–three examples show the range of properties that can be achieved byallocating weights to the various RMs in a group and assigningR

andW

appropriately

–weak representatives (on local disk) have zero votes, get a read quorum fromRMs with votes and then read from the local copy

•

7

Gifford’s quorum consensus examples (1979)

Example 1

Example 2

Example 3

Latency

Replica 1

75

75

75

(milliseconds)

Replica 2

65

100

750

Replica 3

65

750

750

Voting

Replica 1

1

2

1

configuration

Replica 2

0

1

1

Replica 3

0

1

1

Quorum

R

1

2

1

sizes

W

1

3

3

Derived performance of file suite:

Read

Latency

65

75

75

Blocking probability

0.01

0.0002

0.000001

Write

Latency

75

100

750

Blocking probability

0.01

0.0101

0.03

Example 1 is configured for a file with high read to write ratio

with several weak representatives and a single RM.

Replication is used for performance, not reliability.

The RM can be accessed in 75 ms and the two clients can accesstheir weak representatives in 65 ms, resulting in lower latency andless network traffic

Example 2 is configured for a file with a moderate read to write ratiowhich is accessed mainly from one local network. Local RM has 2votes and remote RMs 1 vote each.

Reads

can be done at the local RM, butwrites

must access one localRM and one remote RM. If the local RM fails onlyreads

are allowed

Example 3 is configured for a file with a very high read to write ratio.

Reads

can be done at any RM and the probability of the file beingunavailable is small. Butwrites

must access all RMs.

Derived performance

latency

blocking probability

-

probabilitythat a quorum cannot beobtained, assuming probabilityof 0.01 that any single RM isunavailable

•

Distributed Replicated FIFO Queue

1.State Machine Approach (One copy of the Queueon each replica)

2.Quorun Consensus

1.Can we use the approach above?

8

Distributed Replicated FIFO Queue

1.State Machine Approach (One copy of the Queueon each replica)

2.Quorun Consensus:

1.Propably representing a state machine does not help here. Insteadrepresent the queue as a log of versioned entries:

1.enq(x)

2.enq(y)

3.deq(x)

9

FIFO Queue

Can we use the log representation of the FIFO queueto build a distributed highly available queue basedon quorum consensus?

Enter enq or deq:

Read queue version

Compute new version

Write new version

Make sure that all quorums intersect!

10

FIFO Queue

Here is a new replication protocol:

Definition: To merge a log:

–Short entries in version order

–Discard Duplicates

–Merge logs from the initial read operation

–Reconstruct object’s state from log

–Apply operation and compute new entry

–Append new entry to log and write log to the the final quorum

–Each replica merges logs

11

12

13

14

Log Compaction

Here is a more compact queue representation:

–No deq records

–The event horizon: enq version of most recently dequed item

–The sequence of undequed enq entries

To merge:

–take latest event horizon

–Discrad earlier enq entries

–Sort remaining enq entries, discard duplicates

Replicas can send event horizons in ”gossip” messages. Page (21)

15

Log Compaction

Event horizons are type-specific, but many similarideas can work

Garbage collection:

–Local: discard entries that can’t effect the future

–Non-local use background ”gossip” to discard entries.

16

Quorum Assignments

How are quorums chosen?

–deq needs to know about earlier enq operations

–deq needs to know about earlier deq operations

–enq does not need to know about other operations

17

Depends-On Relation

Let

–D be a relation onopertions

–hany

operationsequence

–and pany

operation

Aview

of

hto

p is

•asubsequence

gof

h

•contains

every

qsuch

that

pDq

•If gcontains

q,then

itcontains

any

earlier

rsuch

that

qDr

Definition: D is adepends-on relationif

whenever

g.p

is legal, so ish.p

18

Depends-On relation

Quorum consensus replication is correct if and onlyif the quorum intersection is a depends-on relation

19

20

The passive (primary-backup) model for fault tolerance

There is at any time a single primary RM and one or more secondary(backup, slave) RMs

FEs communicate with the primary which executes the operation andsends copies of the updated data to the result to backups

if the primary fails, one of the backups is promoted to act as the primary

FE

C

FE

C

RM

Primary

Backup

Backup

RM

RM

Figure 14.4

•

The FE has to find the primary, e.g. after it crashes and another takes over

21

Passive (primary-backup) replication. Five phases.

The five phases in performing a client request are as follows:

1. Request:

–a FE issues the request, containing a unique identifier, to the primary RM

2. Coordination:

–the primary performs each request atomically, in the order in which it receivesit relative to other requests

–it checks the unique id; if it has already done the request it re-sends theresponse.

3. Execution:

–The primary executes the request and stores the response.

4. Agreement:

–If the request is an update the primary sends the updated state, the responseand the unique identifier to all the backups. The backups send anacknowledgement.

5. Response:

–The primary responds to the FE, which hands the response back to the client.

•

22

Passive (primary-backup) replication (discussion)

This system implements linearizability, since the primarysequences all the operations on the shared objects

If the primary fails, the system is linearizable, if a singlebackup takes over exactly where the primary left off, i.e.:

–the primary is replaced by a unique backup

–surviving RMs agree which operations had been performed at take over

view-synchronous group communication can achieve this

–when surviving backups receive a view without the primary, they use anagreed function to calculate which is the new primary.

–The new primary registers with name service

–view synchrony also allows the processes to agree which operations wereperformed before the primary failed.

–E.g. when a FE does not get a response, it retransmits it to the new primary

–The new primary continues from phase 2 (coordination-uses the uniqueidentifier to discover whether the request has already been performed.

•

View-synchronous Group Communication

Systems with dynamic groups extend this model by providing explicit join andleave operations to adapt the group membership over time. Moreover, suchsystems can exclude faulty servers automatically from the membership. Still,reaching agreement on the group membership in the presence of failures is nottrivial.

Two approaches have been considered:

1. Run a consensus protocol among the all previous group members to agree on the future

group membership. This is the canonical approach, tolerates further failures during the