While simple to implement, this approach is unreliable
because it assumes the writes are successful.
It could be that the other processes are down1,
but we would still dutifully inform the client that
their change had been successfully recorded.

To resolve this weakness, we'll make
reading and writing a one-phase commit (rather than the zero phase commit we have now, aka firing blindly)2.
After receiving a write request from a client, the receiver will wait for the writing
nodes to acknowledge back success before responding to the client.

However, once we decide to perform a more sophisticated commit protocol, a few questions come up:

Knowing that we need to wait before informing the client of a successful write, how many nodes should we wait for?

In the same vein, when we are reading values, how many nodes should we query before returning the value to the client?

Rather than providing fixed answers, Dynamo makes these two values configurable
(along with a third value, the number of nodes data is replicated to), and by doing
so makes each Dynamo cluster highly customizable to the needs of its consumer.

These configurable values are:

N is the number of nodes where each piece of data is replicated (for our current implementation N is
equal to the number of nodes in the system).

W is the number of nodes where each piece of data is written before notifying the user of success (our implementation is W := N, sort of 3).

R is the number of nodes which are consulted for their value of a key (currently in our implementation
R := 1).

Combining these values in different ways creates data-stores with extremely different properties.
A system with W := N, R:= 1 will write to every single node before notifying a user of success,
and is a low-availability write system as a single downed node will prevent further writes.
However, by guaranteeing updates are written to every node, it is possible to get the latest value
by reading from only a single node, making it the system highly available for reads.

On the opposite end of the spectrum, a W := 1, R := N system would be highly available for writes,
but with low availability for reads. If it's important to never reject a write, such inefficient reads
might be justifiable.

For most users, the values of W and R will be somewhere between 1 and N.
According to the Dynamo paper, Amazon finds N := 3, W := 2, R := 2 to be a reasonable
balance between performance and reliability for most applications.

Now let's take a look at extending our key-value store to respect configurable
values of W and R (we'll have to take on N later).

Records, Configuration & kvs.hrl

In the previous post I mentioned that -defines are usually kept in
separate files with the .hrl extension, and now it's time for us to start using
one. In this case we'll be calling it kvs.hrl.

First, we'll be adding constant values for W and R (as well as moving over the TIMEOUT
value),

-define(KVS_WRITES,3).-define(KVS_READS,3).-define(TIMEOUT,500).

but more importantly we'll be creating an Erlang record to manage the increasing
amounts of state maintained by kvs:store/1.

Erlang records are nothing more than tuples with some syntactic sugar to make it possible for a
human to access the contained values without remembering the elements' ordering.
For most purposes, the syntactic sugar really is the biggest gain from using records,
but there is a performance advantage versus using a property list4.

Our kvs_store record will start out looking like this:

-record(kvs_store,{data,pending_reads,pending_writes}).

We'll use data to store our key-value mappings,
pending_reads to store reads which are still waiting
for additional nodes' values before returning to user, and pending_writes
for writes which are waiting for confirmation from additional nodes before
notifying user of a successful write.

% a brief interlude for record syntaxStore=#kvs_store{data=[],pending_reads=[],pending_writes=[]},#kvs_store{data=D,pending_reads=R,pending_writes=W}=Store,Writes=Store#kvsstore.pending_writes,get_value(Key,S=#kvs_store{data=Data})->proplists:get_value(Key,Data).set_value(Key,Value,S=#kvs_store{data=Data})->S#{data=[{Key,Value}|proplists:delete(Key,Data)]}.

We'll need to add an -include at the top of kvs.erl to take advantage of
the record definition.

-include("kvs.hrl").

And we'll also need to update the kvs:start/1 function to use the record:

Durable Writes

The first step in our implementation of durable writes is to
update the interface of the store/1 function
(this change is necessary for the consistent read changes
we make below as well). The function's arity will remain
unchanged, but we'll be passing around our new kvs_store
record instead of the old property list.

% old definitionstore(Data)->term().% new definitionstore(Store=#kvs_store{data=Data,pending_reads=Reads,pending_writes=Writes})->term().

The skeptical reader may wonder why we went through the
bother of defining a record when we could have simply
changed the function definition to

store(Data,Reads,Writes)->term().

and been done with it.

A reason to prefer the record is that you only pull out the necessary
data, and don't have to juggle unused values. This becomes particularly
meaningful as a record grows from an arity of one to an arity of three,
from an arity of three to an arity of six, and so on.
In the end, however, it is a matter of style.

Now that we've updated the function definition,
it is time for us to start updating the receive
block. We'll examine the set, update and updated
messages independently, but keep in mind that
they are all patterns matching against store/1's mailbox.

set, update and updated

Receiving a set patterned messages is the beginning of
the update process, so we'll begin there.

Previously set would broadcast the updated key-value
pair to all nodes and then immediately notify the user
that the write was successful, but now we're going to
track the number of acks received from writing nodes
and wait until they pass the W threshold before
notifying the user of a successful write.

With these changes, all nodes maintain a property list in pending_writes
which contain all writes which have been initiated but still haven't surpassed
the W threshold for a durable write. The keys in pending_writes are a
2-tuple of the client process id (the pid which called the kvs:set/2 function)
and the key which is being updated; the values are counters which contain the number
of additional acknowledgements required before the write threshold is surpassed.

{Sender,update,Client,Key,Value}->% sent to all nodes by first receiving nodeSender!{self(),updated,Client,Key,Value},store(Store#kvs_store{data=[{Key,Value}|proplists:delete(Key,Data)]});

As nodes receive an update, they notify the node which received the original set message
that they have successfully updated the key's value.

As other nodes acknowledge a successful write
the corresponding write counter is decremented.
Once it has been full decremented the Client pid
is notified of a successful durable write, and
the write is deleted from the pending_writes property list.
Further acknowledgements for that update are ignored.

Consistent Reads

Implementing consistent reads follows a pattern similar
to durable writes, but with the added complexity of
reconciling inconsistent reads (i.e. what to do if
one node says the value if 5 and another says
the value is 16).

First we implement the read fanout, which sends reads
to the other nodes and then adds the new read to pending_reads.

{Sender,get,Key}->% client interface for retrieving valueslists:foreach(fun(Pid)->Pid!{self(),retrieve,Sender,Key}end,pg2:get_members(kvs)),% ?KVS_READS is required # of nodes to read from% [] is used to collect read valuesReads2=[{{Sender,Key},{?KVS_READS,[]}}|Reads],store(Store#kvs_store{pending_reads=Reads2});

Next, sending sending the local value of the key back to the
node which received the original get message.

Note that we're using an extremely weak algorithm for resolving inconsistent reads:
we pick the most common value (if there are multiple most common values, then
we indirectly pick one based on the order of responses from the reading nodes).
If we can bear with this unsightliness for the moment, the next entry will take
a look at using various kinds of clocks to great improvements in read consistency.

Failing Reads and Writes

In addition to a weak algorithm for resolving inconsistencies,
another flaw in our current implementation is the handling of
reads and writes which fail or otherwise cannot complete.

Consider a system whereW := 10, but there are
only five available nodes due to a node failure.
The write counter will never reach zero, and thus a
response will never be sent to the user.
That doesn't make for a very usable system.

A reasonable solution is to consider all pending operations older
than some timeout value to have failed, and then to notify the user
that the read/write was a failure.

Currently we're not storing when a set or get request
comes in, so we''ll need to extend the information recorded
in pending_reads and pending_writesto include a timestamp.

First we need to write a utility function to convert the output
of erlang:now/0 into seconds since epoch,

ts()->{Mega,Sec,_}=erlang:now(),(Mega*1000000)+Sec.

and next we need to update a number of the receive patterns,
starting with the set pattern, which needs to be updated to
create pending writes as a 2-tuple of the number required writes and time
the request was received (previously it was only the
number of required writes).

filter_writes/1 and filter_reads/1 detect pending operations which
are taking too long to complete, notify the client that their request failed,
and then removes the failed operation from the pending_writes and pend_reads
lists respectively.

It wasn't strictly necessary to implementing these functions outside of the receive block,
but it is becoming rather sizable, and it'll be easier to modify if we carve it up a bit.

(This is also the first time we've seen the Erlang if block, whose guards I tend to like less
than case's pattern matching. This is certainly a matter of style, but I prefer using case
in all situations, except those which devolve into the snippet case X of true -> A; false -> B end;,
which seems like trying a bit too hard to avoid if.)

Finally, the absolutely last modification, we'll also need to add the timeout value to kvs.hrl.

In the next article we'll take a more principled stance on
consistent reads by exploring the impact of using physical,
logical and vector clocks to version and reconcile values.

The post should be up in a couple of days.

It is possible for process groups to become contaminated with dead pids,
especially if you are being slack (like we are, for the time being) and using
spawned processes instead of gen_servers. Even with a gen_server, if you don't
trap_exit, then it is entirely possibly for a gen_server to terminate without
calling its termination cleanup function.

Finally, even if you properly set trap_exit on a gen_server, the termination
cleanup function can fail to fire if the process is killed brutally (kill -9, etc).

Rest assured, however, that it is possible to detect dead processes, using a combination
of erlang:node/1, rpc:call/3 and erlang:is_alive/1 (erlang:is_alive/1 must be executed on the node where
the process lives, which means you must first discover the node and then remotely check its liveliness).

In systems where you anticipate the possibility of processes being brutally killed,
it is necessary to design your system in such a way that it survives sending messages
to dead pids (with gen_server:call you'll receive some kind of error, although it
may not immediately lead you to understand the issue, but gen_server:cast and
using ! probably won't raise any issue, so you'll either need a timeout to detect
dropped messages, or to explicitly verify a process is alive before sending... with
the timeout being much less error prone).↩

This one-phase commit still suffers from a weakness, which is how to back out
a write where a portion of the nodes fail. We'll need something along the lines of a two-phase
commit to really address that adequately. This should come in a later entry. Which I guess I say
a lot these days.↩

This is only true to the extent that we can trust pg2:get_members/1 to contain all nodes,
including all crashed or downed nodes. In a true W := N condition, if any nodes are crashed then all
writes to the system would fail.↩

There is also a penalty in migrating running code from a previous
definition of a record to a new one, which requires hand-crafted transition code.
In general, it's easier to bring a system down and then bring it back up again
(starting with the new record definition) than attempting to migrate it, unless
you really can't bring your processes down.

In situations where you think you'll be frequently updating a record definition,
consider taking the performance hit and just using a property list instead.
No one ever woke up wanting to do a manual migration.↩

Hi folks. I'm Will, known as @lethain on Twitter.
I write about software and management topics,
and love email at lethain[at]gmail.
Get email from me by subscribing to
my weekly newsletter.