dynamo.js

This is a modified version of Amazon's Dynamo Paper, annotated with information about Riak and how its design compares to Dynamo.
Basho Technologies built Riak based on many (but not all) of the ideas and design decisions set forth in this paper. We
often get questions about how closely we adhered to the principles and design decisions
put forth in the paper. I thought it would be worthwhile to annotate it with
Riak specifics.

In the right column, you'll find the paper reprinted in its entirety, images and all.
In this, the left column, you have Riak specifics that relate to a given section of the paper;
anything from links to the Riak wiki, to code references, to explanations of why and how
we did what we did when we did it. There is also some work to do to make Riak more like Dynamo in some
ways. This is noted, too.

The goal for this resource is to simplify the Dynamo paper in the context of Riak and better introduce Riak's
design principles to developers and technologists. I hope you enjoy it and find it useful. If there's something
you believe needs changing, drop me a note or submit a pull request.

This paper was first release in ... and was popularized on the blog of Werner Vogels. Since then there
has been a large amount of databases that were insprired (either entirely or partially) by this paper.
In addition to Riak, Cassandra and Voldemort come to mind. Some of you may also remember
Dynomite (which predates all of these). I'm sure there are more.

Basho Technologies started to develop Riak back in 2007 to solve an internal problem. We were,
at the time, builing a web application that would require a database layer that afforded higher
availability and scale out properties than any technology we knew of. So, we (primary Justin Sheehy, Andy
Andy Gross, and Bryan Fink at the time) rolled our own.

After using Riak in production for several successful applications that generated revenue, we decided
to open source it and share our creation with the world.

Riak is a highly available, scalable, open source key/value database. Theses notes
describe where Riak's design decisions emulated and diverged from Dynamo's (as described in this paper).

Riak has offers several query methods in addition to the standard key/value interface,
is made to be highly-avaible, is efficient in its resource uses, and has a simple scale out
story accompany data and traffic growth.

Riak offers no traditional "ACID" semantics around transactions. Instead, it's built to
be "evenutually consistent." We did this because we were of the opinion (and our users proved this)
that most applications don't require heavy transactions. (Even ATMs are eventually consistent.)

Much like Amazon built Dynamo to guarantee their applications were always available to retail shoppers,
the design decisions in Riak were taken to ensure that developers could sleep well knowing that their
database would always be available to serve requests.

Many of our clients and open source users have explicit uptime agreements related to their applications and
services built on Riak. This was not an accident.

Remember Eventual Consistency? We followed Dynamo's lead here and
made sure that Riak could withstand network, server and other failures
by sacrificing absolute consistency and building in mechanisms to rectify
object conflicts.

We refer to hosts as "nodes", too. Riak provides a simple set of commands to start
and join nodes to a running cluster. With proper capacity planning, this process should
be painless for the ops team and devs, and imperceivable by the client.

Again, we agree. Each storage nope is the same at its neighbor. Any node can coordinate
a request and, in the event that a node goes does, its neighbors can cover for it until
it's restarted or decommissioned.

Whereas Dynamo only has the concept of keys, we added a higher level of organization called a "bucket."
Keys are stored in buckets and buckets are the level at which several Riak properties can be configured
(primarily the "N" value, or the replication value.) In addition to the bucket+key identifier and value, Riak
will also return the associated metadata for a given object with each get or put.

Riak concatenates the bucket with the key and runs it through the SHA1 hash to generate a 160 bit identifier
which is then used to determine where in the database each datum is stored. Riak treats data as an opaque
binary, thus enabling users to store virtually anything.

As mentioned above, Riak uses consistent hashing to distribute data around ring to partitions responsible
for storing data. The ring has a maximum key space of 2^160. Each bucket+key (and its associated value)
is hashed to a location on the ring.

Riak also breaks the ring into a set number of partitions. This number is configured when a cluster is first built.
Each node will be repsonsible for storing the data hashed to a set number of partitions.
Each storage node will optimisitically handle an equal number of partitions.

Replication in Riak, like in Dynamo, is fundamental and automatic. Remember above I introduced the
concept of a bucket? In Riak, the replication parameter, "N" (also called "n_val"),
is configurable at the bucket level. The default n_val in Riak is 3, meaning that
out of the box Riak will store three replicas of your data on three different partitions on the ring.

This diagram is applicable to Riak and the manner in which it replicates data.
The preference list is present in Riak, too, and is the reason why any node in the ring
can coordinate a request. The node receives a request, consults the preference list,
and routes the request accordingly.

Riak is an "eventually consistent" database. All replication is done asynchronously,
as you can expect, could result in a datum being returned to the client that is out of
date. But don't worry. We built in some mechanisms to address this.

Much like Dynamo was suited to the design of the shopping cart, Riak, and its tradeoffs,
are appropriate for a certain set of use cases. We happen to feel that most use cases
can tolerate some level of eventual consistency.

The same holds true for Riak. If, by way of some failure and concurrent update
(rare but quite possible), there come to exist multiple versions of the same object,
Riak will push this decision down to the client (who are we to tell you which is the
authoritative object?). All that said, if your application doesn't need this level of
version control, we enable you to turn the usage of vector clocks on and off at the bucket
level.

Dynamousesvectorclocks[12]inordertocapturecausalitybetweendifferentversionsofthesameobject.Avectorclockiseffectivelyalistof(node,counter)pairs.Onevectorclockisassociatedwitheveryversionofeveryobject.Onecandeterminewhethertwoversionsofanobjectareonparallelbranchesorhaveacausalordering,byexaminetheirvectorclocks.Ifthecountersonthefirstobject’sclockareless-than-or-equaltoallofthenodesinthesecondclock,thenthefirstisanancestorofthesecondandcanbeforgotten.Otherwise,thetwochangesareconsideredtobeinconflictandrequirereconciliation.InDynamo,whenaclientwishestoupdateanobject,itmustspecifywhichversionitisupdating.Thisisdonebypassingthecontextitobtainedfromanearlierreadoperation,whichcontainsthevectorclockinformation.Uponprocessingareadrequest,ifDynamohasaccesstomultiplebranchesthatcannotbesyntacticallyreconciled,itwillreturnalltheobjectsattheleaves,withthecorrespondingversioninformationinthecontext.Anupdateusingthiscontextisconsideredtohavereconciledthedivergentversionsandthebranchesarecollapsedintoasinglenewversion.Figure3:Versionevolutionofanobjectovertime.Toillustratetheuseofvectorclocks,letusconsidertheexampleshowninFigure3.Aclientwritesanewobject.Thenode(saySx)thathandlesthewriteforthiskeyincreasesitssequencenumberandusesittocreatethedata's vector clock. The system now has the object D1 and its associated clock [(Sx, 1)]. The client updates the object. Assume the same node handles this request as well. The system now also has object D2 and its associated clock [(Sx, 2)]. D2 descends from D1 and therefore over-writes D1, however there may be replicas of D1 lingering at nodes that have not yet seen D2. Let us assume that the same client updates the object again and a different server (say Sy) handles the request. The system now has data D3 and its associated clock [(Sx, 2), (Sy, 1)].Next assume a different client reads D2 and then tries to update it, and another node (say Sz) does the write. The system now has D4 (descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A node that is aware of D1 or D2 could determine, upon receiving D4 and its clock, that D1 and D2 are overwritten by the new data and can be garbage collected. A node that is aware of D3 and receives D4 will find that there is no causal relation between them. In other words, there are changes in D3 and D4 that are not reflected in each other. Both versions of the data must be kept and presented to a client (upon a read) for semantic reconciliation.Now assume some client reads both D3 and D4 (the context will reflect that both values were found by the read). The read'scontextisasummaryoftheclocksofD3andD4,namely[(Sx,2),(Sy,1),(Sz,1)].IftheclientperformsthereconciliationandnodeSxcoordinatesthewrite,Sxwillupdateitssequencenumberintheclock.ThenewdataD5willhavethefollowingclock:[(Sx,3),(Sy,1),(Sz,1)].

Riak makes use of the same values. But, thanks to our concept of buckets, we made it a bit more
customizable. The default R and W values are set at the bucket level but can be configured at
the request level if the developer deems it necessary for certain data.
"Quorum" as described in Dynamo is the default setting in Riak.

4.8MembershipandFailureDetection4.8.1RingMembershipInAmazon’senvironmentnodeoutages(duetofailuresandmaintenancetasks)areoftentransientbutmaylastforextendedintervals.Anodeoutagerarelysignifiesapermanentdepartureandthereforeshouldnotresultinrebalancingofthepartitionassignmentorrepairoftheunreachablereplicas.Similarly,manualerrorcouldresultintheunintentionalstartupofnewDynamonodes.Forthesereasons,itwasdeemedappropriatetouseanexplicitmechanismtoinitiatetheadditionandremovalofnodesfromaDynamoring.AnadministratorusesacommandlinetoolorabrowsertoconnecttoaDynamonodeandissueamembershipchangetojoinanodetoaringorremoveanodefromaring.Thenodethatservestherequestwritesthemembershipchangeanditstimeofissuetopersistentstore.Themembershipchangesformahistorybecausenodescanberemovedandaddedbackmultipletimes.Agossip-basedprotocolpropagatesmembershipchangesandmaintainsaneventuallyconsistentviewofmembership.Eachnodecontactsapeerchosenatrandomeverysecondandthetwonodesefficientlyreconciletheirpersistedmembershipchangehistories.Whenanodestartsforthefirsttime,itchoosesitssetoftokens(virtualnodesintheconsistenthashspace)andmapsnodestotheirrespectivetokensets.Themappingispersistedondiskandinitiallycontainsonlythelocalnodeandtokenset.ThemappingsstoredatdifferentDynamonodesarereconciledduringthesamecommunicationexchangethatreconcilesthemembershipchangehistories.Therefore,partitioningandplacementinformationalsopropagatesviathegossip-basedprotocolandeachstoragenodeisawareofthetokenrangeshandledbyitspeers.Thisallowseachnodetoforwardakey’sread/writeoperationstotherightsetofnodesdirectly.4.8.2ExternalDiscoveryThemechanismdescribedabovecouldtemporarilyresultinalogicallypartitionedDynamoring.Forexample,theadministratorcouldcontactnodeAtojoinAtothering,thencontactnodeBtojoinBtothering.Inthisscenario,nodesAandBwouldeachconsideritselfamemberofthering,yetneitherwouldbeimmediatelyawareoftheother.Topreventlogicalpartitions,someDynamonodesplaytheroleofseeds.Seedsarenodesthatarediscoveredviaanexternalmechanismandareknowntoallnodes.Becauseallnodeseventuallyreconciletheirmembershipwithaseed,logicalpartitionsarehighlyunlikely.Seedscanbeobtainedeitherfromstaticconfigurationorfromaconfigurationservice.TypicallyseedsarefullyfunctionalnodesintheDynamoring.4.8.3FailureDetectionFailuredetectioninDynamoisusedtoavoidattemptstocommunicatewithunreachablepeersduringget()andput()operationsandwhentransferringpartitionsandhintedreplicas.Forthepurposeofavoidingfailedattemptsatcommunication,apurelylocalnotionoffailuredetectionisentirelysufficient:nodeAmayconsidernodeBfailedifnodeBdoesnotrespondtonodeA’smessages(evenifBisresponsivetonodeC*smessages).Inthepresenceofasteadyrateofclientrequestsgeneratinginter-nodecommunicationintheDynamoring,anodeAquicklydiscoversthatanodeBisunresponsivewhenBfailstorespondtoamessage;NodeAthenusesalternatenodestoservicerequeststhatmaptoB's partitions; A periodically retries B to check for the latter'srecovery.Intheabsenceofclientrequeststodrivetrafficbetweentwonodes,neithernodereallyneedstoknowwhethertheotherisreachableandresponsive.Decentralizedfailuredetectionprotocolsuseasimplegossip-styleprotocolthatenableeachnodeinthesystemtolearnaboutthearrival(ordeparture)ofothernodes.Fordetailedinformationondecentralizedfailuredetectorsandtheparametersaffectingtheiraccuracy,theinterestedreaderisreferredto[8].EarlydesignsofDynamousedadecentralizedfailuredetectortomaintainagloballyconsistentviewoffailurestate.Lateritwasdeterminedthattheexplicitnodejoinandleavemethodsobviatestheneedforaglobalviewoffailurestate.Thisisbecausenodesarenotifiedofpermanentnodeadditionsandremovalsbytheexplicitnodejoinandleavemethodsandtemporarynodefailuresaredetectedbytheindividualnodeswhentheyfailtocommunicatewithothers(whileforwardingrequests).