Musings on Computing and Storage

It’s been ten months or so since I have been in Bangalore. My voracious appetite for news and analysis now includes (in addition to Nyt/Wapo etc.) – regular feeds of Hindu, Firstpost and links from many more local contacts on FB/Quora. There’s finally a point where one thinks that one understands what’s going on. The glasses finally fit – everything comes into focus. I think i reached that point in the last few weeks. It’s possible I am wrong. But i think the world economy is irretrievably screwed up.

This post is for me to look back and hopefully remember as an example of how the pendulum swings too far on the dark side. It’s also simply catharsis.

It’s rather simple. Globalization lead to investments and transfer of production to developing countries. The developed countries weren’t producing as much anymore. In a rational world – they would have had to decrease consumption to match. Instead they kept going at it. No democratic Govt. can tell it’s citizens to lower their standards of living. And there were funds available to borrow. So the Americans, the Europeans – they kept (and keep) living off money they don’t have.

Prices should have fallen much more than they did. After all – wages in developing countries are an order of magnitude less than the West. But the false demand from West, built on a sea of debt, kept them from getting totally depressed. The wages for workers in India, for example, are still just barely sustenance levels. So who made out?

The companies that produced cheaply and sold dearly had to have made out. Strong corporate profits and balance sheets over the last few years are testimony.

These developing countries are absolutely and totally corrupt. If I were to use India as an example – the politicians, the mafia, some business houses – a small handful of brutally corrupt folks made out big time. That investment that came into the country – a big chunk of that – from direct bribes, to taxation (most of which just ends up as black money) to subtle theft of savings (forced investments from Govt. run savings institutions like LIC and SBI, depreciation of currency, inflation) – all this just disappears into a few private pockets.

So think about it. An indebted middle class in the developed world trades with a penurious one in the developing countries. And a whole lot of money, that used to be in the pockets of hundreds of millions of relatively honest middle class people just disappeared into thin air. Into Swiss accounts, Gold – whatever can never be taken back.

The permanent loss of demand that is happening because of this altered equilibrium is the root cause of world economic ills. It cannot be fixed by developed country govt. stimulus. It cannot be fixed structurally. It cannot be fixed just by taxing the rich in the developed world (for the rich have a lot of places to hide wealth). In America – they dream that the developing world will one day become rich enough that all boats will again start rising. They never consider that not all outcomes are so sanguine. Maybe – the boats will all end up lower.

So the global economy is now a robber baron one. With no global institutions and mechanisms to pull it back. Maybe we are simply reverting to mean – maybe the last half of the 20th century was an aberration. Enough for one day – we will end on a cheerful note by not bringing up global warming.

I studied during my high school days in a small town called Ranipur - on the outskirts of the Hindu holy town of Hardwar. It was beautiful – but remote and parochial. The girls and the boys would sit on separate aisles in the most well known school in town (that i used to attend). When we entered high school – some of my best friends left town. Ranipur was too small a place to get into IIT from – they left to join schools in Delhi (DPS RK-Puram most prominently). The best that anyone had ever ranked in JEE from Ranipur was a few hundreds. Like many other middle class families in pre-reform India – we had little spare change. Getting into a good engineering school was a must – our future financial security pretty much depended on it.

For two years – I studied like crazy for the JEE. It was inhumane – and I wouldn’t be able to do it again. I would doze off solving problems and wake up and then go at them again. I was scrawny, rarely played and was almost constantly sick. Life rotated around solving problems from Irodov - every starred problem solved was like a mini-achievement. Amongst the more bizarre things I would keep track of was the number of notebooks and ball-pens I would run through every week solving problems. School – when I went there – was for fun (ie. gawking at girls). Visiting friends from Delhi would drive me nuts – it seemed like they were much further along in their preparation than I ever was. We went to Delhi a couple of times to attend training camps run by Brilliants and Agarwals. One time I saw this dude in a blue shirt sitting on the front bench (I was, obviously, a back bencher) – he solved problems as soon as the instructor wrote them out – and completely psyched me out. (I learnt later his name was Ashish Thusoo). Another time we saw this guy named Basu. He was Class Xth topper – had appeared on the cover of a famous Science Magazine. After one of the training exams – everyone would cluster around him – how did Basu solve it?

The JEE gave me role models – the guy in the blue shirt, Basu. The guy from my old school in Delhi where I had studied until the 8th grade – Pankaj Gupta - who came 15th in ’91 JEE. Deepankar Aron – who aced ’91 JEE as well from Ranipur – coming in an amazing 193! The pictures of JEE toppers from Agarwals and Brilliants tutorials would constantly swim across my eyes – Ashish Goel, Alok Mittal, Vineet Buch. I wanted to join them on those brochures. If they could do it – maybe so could I.

The JEE made me a better student. I was not a genius – but I practiced and became better. It was the 10,000 hour rule in action – I solved enough difficult Physics problems that I almost looked and felt (to others) like a genius. Chemistry was horrible – the only subject for which I got a private tutor . When I took the CBSE board exams (that are proposed, in part, to replace the JEE) – the difficulty level was like taking exams from a couple of grades below. I scored 99/100 in all the science/math exams. It didn’t matter, it was too easy – and I was laughing about the results. The teachers were impressed, maybe even the girls – hey – maybe the Board exams were good for something!

For a small towner like me – the JEE gave me an equal opportunity to prove myself. I had felt disadvantaged – but it turned out I was not (quite the converse actually). I was able to focus for two years, undistracted by cinemas, big town action (and prettier girls presumably) to make myself into a better student. The folks from DPS RK-Puram had no advantage on me – they were richer, better connected and had more resources – but it didn’t matter. Connections and money didn’t buy JEE ranks. Talent and hard-work did. People couldn’t put me down because I didn’t look right, or wear the right clothes and come to school in a nice car. I was the king of my study.

I ranked 18th in ’92 JEE. It was scandalous – it wasn’t supposed to happen. I remember vividly my father’s smile when I got off the phone and told him about it. I think he lifted me up. I don’t think I have ever – before or since – seen a smile like that. He’s no more – but he lives in my heart wearing that smile. In moments like this – I still grieve for him.

Like the giants before me – I became a role model for the people who followed me from Ranipur. High rankers exploded. If I could do it – so could they – and even better. While my family left Ranipur a long long time back – I know my name lived on there as a myth – egging others on. I had played my little role in the Circle of Life. And yeah – I had made my way into the Agarwals brochure.

In the years since – the JEE also reminded me constantly of who I could be at my best. At my down moments (and there were many) – there was this ultimate fallback. All I knew I had to do was throw myself, heart and soul, at something – and I would come off OK.

So goodbye JEE. You made me, in good part, who I am. And you will live on, living though you never were, in my heart and in the hearts of many others like me. My child and others in the next generation may not be able to aspire for you anymore. And for that, we will all be worse off.

Backwards march!

The Real Heroes

The responses have been overwhelming – the comments are better than the post. Here are some amazing people who have left their stories behind:

Board Exam Ridiculousness

I haven’t blogged for a couple of years now. It seemed little fun to rage against stuff – or to put expression to ideas that didn’t seem to have a chance to materialize. Ideas need a successful business plan to make a positive difference.

Or like Zuck said: “we don’t build services to make money; we make money to build better services”

So we started this journey – to build a company that would build great data products, serve it’s customers famously – and then build even better products out of that fuel.

I am, unfortunately, also old enough to not feel like beating my head against a wall. But these are exciting times – the cloud is the PC of the 80s. Like crap today – but with obvious potential. Surfers ride waves. Software guys ride technology trends. Lucky to get a chance to ride this one.

So hopefully – the next couple of years will not be like the last two. There will be a lot to discuss while we build some cool stuff. Most of my personal technical work – direct or indirect – will show up at the Qubole Blog . We are also pimping ourselves at all the social media pitstops du jour (Linkedin, Facebook or Twitter) – follow along for the ride – it will be fun. I guarantee it.

HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase allow). This is a particularly fun and feasible prospect now that frameworks like Hive allow users to express such analysis in plain old SQL.

The BigTable paper mentioned this aspect only in passing. But looking back – the successor to the LSM Trees paper (from which BigTable’s on-disk structures are inspired) – called LHAM – had a great investigation into the possibilities in this area. One of the best ideas from here is that analytic queries can have a different I/O path than real-time read/write traffic – even when accessing the same logical data set. Extrapolating this to HBase:

However Applications that do not care about the very latest updates can directly access compacted files from HDFS

Applications in the second category are surprisingly common. Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic – not based on large bursty batch jobs (alternately put – partially isolates the online traffic from resource contention by batch jobs).

Many more possibilities arise. Data can be duplicated (for batch workloads) by compaction processes in HBase:

To a separate HDFS cluster – this would provide complete I/O path isolation between online and offline workloads

Historical versions for use by analytic workloads can be physically organized in (columnar) formats that are even more friendly for analytics (and more compressible as well).

This also eliminates cumbersome Extraction processes (from the (in)famous trio of ETL) that usually convey data from online databases to back-end data warehouses.

A slightly more ambitious possibility is to exploit HBase data structures for the purposes of data analytics – without requiring the use of active HBase components. Hive, for example, was designed for immutable data sets. Row level updates are difficult without overwriting entire data partitions. While this is a perfect model for processing web logs, this doesn’t fit all real-world scenarios. In some applications – there are very large dimension data sets (for example – those capturing the status of micro-transactions) that are mutable – but that can have a scale approaching those of web logs. They are also naturally time partitioned – where mutations in old partitions happen rarely/infrequently. Organizing Hive partitions storing such data sets as a LSM tree is an intriguing possibility. It would allow the rare row level updates to be a cheap append operation to the partition, while largely preserving the sequential data layout that provides high read bandwidth for analytic queries. As a bonus, writers and readers would’t contend with each other.

Many of (harder) pieces of this puzzle are already in place. As an example – Hadoop already has multiple columnar file formats – including Hive’s RCFile. HBase’s HFile format was recently re-written and organized in a very modular fashion (should be easy to plug into Hive directly for example). A lot of interesting possibilities, it seems, are just round the corner.

Thanks to everyone who responded (in different forums). i will try to summarize the responses made on the comments to my initial post as well as some of the substantive discussions at Dave’s blog and news.ycombinator.

The things i have heard are roughly in this order:

Data loss scenario under the Eventual Consistency section is not possible with Vector Clocks: agreed. I screwed up. I was thinking about Cassandra that does not have Vector clocks – only client timestamps – where this is a definite possibility. I have updated the section with this information.

However, i remain convinced that one should not force clients to deal with stale reads in environments where they can be avoided. As i have mentioned in the updated initial post – there are simple examples where stale reads cause havoc. One may not be able to do conflict resolution or the reads can affect other keys in ways that are hard to fix later.

The other point that i would re-emphasize is that there is no bound on how ‘stale’ the reads are. Nodes can be down for significant amounts of time or they may rejoin the cluster after having lost some disks. It’s hard to imagine writing applications where the data returned is that much out of date.

About Vector Clocks and multiple versions – it’s not a surprise that they were not implemented in Cassandra. In Cassandra – the cost of having to retrieve many versions of a key increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may be required for each file that has a version of the key. Even though the versions may not require reconciliation, one still has to read them.

Analysis of quorum protocols is wrong: I don’t think so. Consider W=3, R=1, N=3. In this case, one node disappearing for a while and coming back and re-joining cluster is clearly problematic. That node may serve the single copy of data required by a read operation. Depending on how much data that node does not have – the read may be very stale. Dynamo paper says this setting is used for persistent caching – but i am surprised that Amazon can afford to read cached data that is potentially grossly out of date (we can’t).

Consider W=2, R=2, N=3. This has somewhat better properties. I need to take out two nodes in same shard and re-insert them after a while into the cluster to get stale reads. Writes in the interim succeed due to quorums provided by hinted handoffs. So stale reads are still possible – if considerably more improbable. Let’s do the math: we want a system that’s tolerant of 2 faults (which is why we have 3 replicas). Let’s say in a cluster of 100 nodes – that the mean number of times two nodes are simultaneously down is roughly once per day (taking a totally wild guess). Under the assumption that a single node serves a single key range – the odds of these two nodes belonging to the same quorum group is about 0.06 (100*(3C2)/100C2). That means that roughly every 16 days my cluster may get into a condition where there can be stale reads happen even with W=2 and R=2. Bad enough. Now if we throw virtual nodes into the equation – the odds will go up. The exact math sounds tough – but at a low ratio of virtual nodes per physical node – the odds will likely go up in direct proportion to the ratio. So with a virtual/physical ratio of 2 – i could see stale reads every week or so.

Consider also the downsides of this scheme: twice the read is a very high penalty in a system. In Cassandra – which is write optimized – read overheads are worse that traditional btrees. Note also that although writes can be made highly available by hinted handoffs – there’s no such savior for reads. If the 3 replicas span a WAN – then one of the data centers only has one copy of the data. R=2 means one must read from both the sides of the WAN when reading from this data center. That sounds pretty scary and highly partition intolerant and unavailable to me !

Replication schemes with point in time consistency also don’t prevent stale reads: Let me clarify – i simply wanted to correct the assertion in the paper that commercial databases update replicas across a WAN synchronously. They mostly don’t. They also aren’t typically deployed to perform transactions concurrently from more than one site. So there’s no comparison to Dynamo – except to point out that async replication with point in time consistency is pretty darned good for disaster recovery. That’s it.

Pesky resync-barriers and quorums again
One more problem with Dynamo: i completely overlooked that resynchronization barriers, in the way i was thinking are impossible to implement (to fix the consistency issues).

The problem is this – how does a node know it’s rejoining the cluster and it’s out of date? Of course – if a node is rebooting – then this is simple to guess. However consider a more tricky failure condition – the node’s network card (or the switch port) keeps going up and down. The software on the node thinks everything is healthy – but in effect it’s leaving and re-joining the cluster (every once in a while).

In this case – even if Merkel trees are totally in-expensive (as Dave claims in his post) – i still wouldn’t know when exactly to invoke them in such a way as to not serve stale reads. (surely i can’t invoke them before every read – i might as well read from all the replicas then!)

So, unfortunately, i am repeating this yet again – Dynamo’s quorum consensus protocol seems fundamentally broken. How can one write outside the quorum group and claim a write quorum? And when one does so – how can one get consistent reads without reading every freaking replica all the time? (well – the answer is – one doesn’t – which is why Dynamo is eventually consistent. I just hope that users/developers of Dynamo clones realize this now).

Symmetry
While i pointed out the inherent contradiction in Dynamo’s goal of symmetry and the notion of seeds – i did not sufficiently point out the downside of Symmetry as a design principle.

One aspect of this is that server hardware configurations are inherently asymmetric. The way one configures a highly available stateful centralized server is very different from the way one configures a cheap/stateless web server. By choosing symmetry as a design principle – one effectively rules out using different hardware for different components in a complex software system. IMHO – this is a mistake.

Another aspect of this is that network connectivity is not symmetric. Connectivity between nodes inside a data center has properties (no partitions) that connectivity across across data centers does not have. Dynamo’s single ring topology does not reflect this inherent asymmetry in network connectivity. It’s design treats all the inter-node links the same – where they clearly aren’t.

Lastly, symmetry prevents us from decomposing complex software into separate services that can be built and deployed independently (and on different machines even). This is how most software gets built these days (so Dynamo is in some sense an anachronism). Zookeeper is a good example of a primitive that is useful to build a system like Dynamo that can be run on dedicated nodes.

Fault Isolation and Recovery

As a matter of detail – Dynamo paper also does not talk about how it addresses data corruptions (say a particular disk block had corrupt data) or disk failures (on a multi-disk machine – what is the recovery protocol for a single disk outage). In general the issue is fault isolation and recovery from the same. This is not a flaw per se – rather I am just pointing out that one cannot build any kind of storage system without addressing these issues. Cassandra also doesn’t have any solutions for these issues (but some solutions may be in the works).

Revisiting and Summarizing

A lot of things have been mentioned by now – let me try to summarize my main points of contention so far:

Stale Reads are bad. We should do our utmost to not have them if they can be avoided.

Unbounded Stale Reads are pure evil and unacceptable. Even under disaster scenarios – applications expect finite/bounded data loss. In most cases – admins will prefer to bring down a service (or parts of it) rather than take unbounded data loss/corruption.

Network Partitions within a data center can (and are) avoided by redundant network connectivity (they are usually intolerable). Designing for partition tolerance within a data center does not make sense.

Centralization does not mean low availability. Very high availability central services can be built – although scalability can be a concern.

The notion of Symmetry as a deployment and design principle does not model well the asymmetry that is inherent in hardware configurations and networking

Consistency, high availability and scalability are simultaneously achievable in a data center environment (that does not have partitions). BigTable+GFS, HBase+HDFS (perhaps even an Oracle RAC database) are good examples of such systems. Strong Consistency means that these systems do not suffer from stale reads

Dynamo’s read/write protocols can cause stale reads even when deployed inside a single data center

No bound can be put on the degree of staleness of such reads (which is, of course, why the system is described as eventually consistent).

When deployed across data centers, there is no way in Dynamo to track how many pending updates have not been reflected globally. When trying to recover from a disaster (by potentially changing quorum votes) – the admin will have no knowledge of just how much data has been lost (and will be possibly corrupted forever).

Enough said. Hopefully, in part II (when i get a chance), i can try to list some design alternatives in this space (I have already hinted at the broad principles that i would follow: don’t shirk centralization or asymmetry, model the asymmetry in the real world, don’t build for partition tolerance where partitions don’t/shouldn’t exist, think about what the admin would need in a disaster scenario).

(The opinions in this article are entirely mine – although i owe my education in part to other people)

(Update: Please do read the followup. Conversations around the web on this topic are well tracked here)

Background

Recently i have had to look at Dynamo in close detail as part of a project. Unlike my previous casual perusals of the paper/architecture – this time i (along with some other colleagues) spent a good amount of time looking at the details of the architecture. Given the amount of buzz around this paper, the number of clones it has spawned, and the numerous users using those clones now – our takeaways were shockingly negative.

Before posting this note – I debated whether Dynamo was simply inappropriate for our application (and whether calling it ‘flawed’ would therefore be a mis-statement). But it is clear in my mind, that the paper is flawed – it tries to lead readers to believe things that are largely untrue and it makes architectural choices that are questionable. It’s design contradicts some of it’s own stated philosophies. Finally, i believe that the problems it proposes to solve – are solvable by other means that do not suffer from Dynamo’s design flaws. I will try to cover these points in a series of posts.

Eventual Consistency

First – i will start with the notion of ‘eventual’ consistency which Dynamo adhers to. What does this mean to the application programmer? Here are some practical implications:

committed writes don’t show up in subsequent reads

committed writes may show up in some subsequent reads and go missing thereafter

that there is no SLA for when the writes are globally committed (and hence show up in all subsequent reads)

At a high level – not having SLA is not a practical proposition. ‘Eventually’ we are all dead. A system that returns what I wrote only after I am dead is no good for me (or anyone else). At a more ground level – returning stale data (the logical outcome of eventual consistency) leads to severe data loss in many practical applications. Let’s say that one is storing key-value pairs in Dynamo – where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ‘stale’ the read was – the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss – not even in the case of a Disaster.

(Update: Several people have called out that this data loss scenario is not possible in Dynamo due to Vector Clocks. That sounds correct. Some followups:

The scenario is possible in Cassandra that does not use vector clocks but only client timestamps

Dynamo depends on conflict resolution to solve this problem. Such conflicts are very difficult to resolve – in particular if deletion from the list is a valid operation – then how would one reconcile after mistaken truncation?

In another simple scenario – a stale read may end up affecting writes to other keys – and this would be even harder to fix

The general points here are that returning stale reads is best avoided and where cannot be avoided – at least having some bounds on the staleness allows one to write applications with reasonable behavior. Dynamo puts no bounds on how stale a read can be and returns stale reads in single data-center environments where it can be entirely avoided)

Quorum Consensus

Dynamo starts by saying it’s eventually consistent – but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) – one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster – but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data. This is partly the reason why the system is only ‘eventually’ consistent. Of course – in a large cluster – nodes will leave and re-join the cluster all the time – so stale reads will be a matter of course.

This leads to the obvious question – why can one simply not put into place a resynchronization barrier when nodes re-join the cluster? The answer to this is troublesome: no one knows whether a node is in sync or not when it rejoins the cluster. No one can tell how much data it doesn’t have. There is no central commit log in the system. The only way to figure out the answer to these questions is to do a full diff of the data (Dynamo uses something called Merkel Merkle trees to do this) against all other members of the quorum group. This will of course be remarkably expensive and practically infeasible to do synchronously. Hence it is (at best) performed in the background.

The other way to provide strong consistency is to read from all the replicas all the time. There are two problems with this:

This clearly fails under the case where the only surviving member of the quorum group is out of date (because it had failed previously and hasn’t been bought up to date)

This imposes a super-high penalty for reads that is otherwise not implied by a lower setting for the parameter R. In fact it renders the parameter R moot – wondering why Dynamo would talk about quorum consensus in the first place?

I am not the first one to observe these problems. See for example Cassandra-225 which was an attempt to solve this problem with a centralized commit log in Cassandra (a Dynamo clone).

WAN considerations

It’s worth pointing out that when Dynamo clusters span a WAN – the odds of nodes re-joining the cluster and remaining out of date are significantly increased. If a node goes down, ‘hinted handoff’ sends updates to the next node in the ring. Since nodes of the two data centers alternate – the updates are sent to the remote data center. When the node re-joins the cluster, if the network is partitioned (which happen all the time), the node will not catch up on pending updates for a long time (until the network partitioning is healed).

Disaster Recovery

The effect of eventual consistency and replication design is felt most acutely when one considers the case of disaster recovery. If one data center fails, there is absolutely nothing one can say about the state of the surviving data center. One cannot quantify exactly how much data has been lost. With standard log-shipping based replication and disaster recovery, one can at least keep track of replication lag and have some idea of how far behind the surviving cluster is.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).

Contradictions

At the beginning of the paper, the paper stresses the principle of Symmetry. To quote:

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers

By the time we get to Section 4.8.2, this principle is gone, to quote:

To prevent logical partitions, some Dynamo nodes play the role of seeds.

Again, in section 2, one reads:

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window

and by the time one gets to Section 2.3, one reads:

Dynamo is designed to be an eventually consistent data store

where of course – the term ‘eventual’ is not quantified with any ‘window’!

As an engineer who has worked extensively on production commercial replication and disaster recovery systems – I can vouch this claim is incorrect. Most database and storage systems are deployed in asynchronous (but serial) replication mode. Replicas are always point-in-time consistent. Commercial recovery technologies (backups, snapshots, replication) all typically rely on point-in-time consistency. The combination of high availability and point in time consistency at remote data centers is relatively easy to achieve.

It is possible that the paper is trying to refer to distributed databases that are distributed globally and update-able concurrently. However, these are extremely tiny minority of commercial database deployments (if they exist at all) and it’s worth noting that this statement is largely untrue in practice.

Consistency versus Availability

Several outside observers have noted that Dynamo is chooses the AP of the CAP theorem – while other systems (notably BigTable) choose CA. Unfortunately, Dynamo does not distinguish between ‘Availability’ and ‘Partition Tolerance’ in the SOSP paper.

The reader is left with the impression that there is always a tradeoff between Consistency and Availability of all kinds. This is, of course, untrue. One can achieve strong Consistency and High-Availability within a single data center – and this is on par for most commercial databases – as well as for systems like HBase/HDFS.

The unfortunate outcome of this is that people who are looking for scalable storage systems even within a single data center may conclude (incorrectly) that Dynamo is a better architecture (than BigTable+GFS).

In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Again – as an engineer who worked on in data-center high availability for years – i find this general line of thought questionable. Centralized control typically causes scalability bottlenecks. But by no means are they necessarily of low availability. The entire storage industry churns out highly available storage servers – typically by arranging for no single points of failure (SPOF). This means dual everything (dual motherboards, dual nics, dual switches, RAID, multipathing etc. etc.) and synchronously mirrored write-back caches (across motherboards). A storage server is most often a central entity that is highly available – and as such i am willing to bet that the bank accounts of all the Dynamo authors are stored and retrieved via some such server sitting in a vault somewhere. Such servers typically have 5 9′s availability (way more than anything Amazon offers for any of it’s services). The principles employed in building such highly available servers are well understood and easily applied to other centralized entities (for example the HDFS namenode).

Of course – having highly available centralized services needs discipline. One needs to have redundancy at every layer in the service (including, for example, power supplies for a rack and network uplinks). Many IT operations do not apply such discipline – but the resultant lack of availability is hardly the fault of the centralized architecture itself.

The irony in this claim is that a Dynamo cluster is likely itself to be centralized in some shape or form. One would likely have some racks of nodes in a Dynamo cluster all hanging off one set of core-switches. As such it’s clients (application/web-servers) would be in different network segment connected to different core-switches. Network partitioning between these two sets of core-switches would make Dynamo unavailable to the clients. (While this example shows how futile Dynamo’s goal of avoiding centralization is – it also shows how data centers need to be architected (with dual uplinks and switches at every network hop) to prevent network partitioning from happening and isolating centralized services).

Finally (and with extreme irony) we have already seen that the lack of centralization (of commit logs for instance) is the reason behind many of the consistency issues affecting Dynamo.

What next?

There are many other issues regarding fault isolation from data corruptions that are also worth discussing. And as promised, I will try to cover simpler schemes to cover some of the design goals of Dynamo as well. If all is well – in a subsequent post.

Mark Callaghan started a discussion on mysql replication lag today on the MySql@Facebook page. This happens to be one of my favorite topics – thanks to some related work i did at Netapp. I was happy to know that there’s already a bunch of stuff happening in this area for MySQL – and thought it would be good to document some of my experiences and how they correlate to the MySQL efforts.

Somewhere back in 2003/2004 – i started looking at file system replay at Netapp (primarily because it was the coolest/most important problem i could find that didn’t have half a dozen people looking at it already). Netapp maintains an operation log in NVRAM (that is mirrored over to a cluster partner in HA configurations). A filer restart or failover requires a log replay before the file system can be bought online. The log may have transactions dependent on each other (a file delete depends on a prior file create for example) – so the simplest way to replay the log is serial. Because replay happens when the file system cache is cold – most log replay operations typically block waiting for data/metadata to be paged in from disk. No wonder that basic serial log replay is dog slow and this was one of the aspects of failover that we struggled to put a bound on. To make things worse – NVRAM sizes (and hence accumulated operation log on failure) were becoming bigger and bigger (whereas disk random iops were not getting any better) – so this was a problem that was becoming worse with time.

When I started working on this problem – there were some possible ideas floating around – but nothing that we knew would work for sure. Some of the obvious approaches were not good enough in the worst case (parallelize log replay per file system) or were too complicated to fathom (analyze the log and break into independent streams that could be replayed in parallel). The latter particularly because there was no precise list of transactions and their dependent resources – the log format was one hairy ball (a common characteristic of closed source systems). At the beginning – I started with an approach where the failed filer’s memory could be addressed by filer taking over (ie. instead of going to disk – it could use the partner filer’s memory for filling in required blocks). But this was problematic since buffers could be dirty – and those weren’t usable for replay. It was, of course, also problematic since it would have never worked in the simple restart case. At some point playing around with this approach – I started having to maintain a list of buffers required for each log replay (don’t precisely remember why) – and soon enough I had this Eureka moment where the realization dawned that this was all was ever needed to speed up log replay.

The technique is documented in one of the few worthy patents i have ever filed (USPTO link) – but long story short – it works like this:

The filer keeps track of file system blocks required to perform each modifying transaction (in a separate log called the WAFL Catalog)

This log is also written to NVRAM (so it’s available at the time of log replay) and mirrored to partner filers

At replay time – the first thing that happens is that the filer issues a huge boatload of read requests using the Catalog. It doesn’t wait for these requests to complete and doesn’t really care whether the requests succeed or fail – the hope is that most of them succeed and warm up the cache.

Rest of replay happens as before (serially) – but most of the operations being replayed find the required blocks in memory already (thanks to the IO requests issued in the previous step)

This implementation has been shipping on filers since at least 2005. It resulted in pretty ginormous speedups in log replay (4-5x was common). I was especially fond of this work because of the way it materialized (basically hacking around) and the way it worked (a pure optimization that was non-intrusive to the basic architecture of the system) and because of the generality of the principles involved (it was apparent that such form of cataloging and prefetching could be used in any log replay environment). It was a simple and elegant method – and I have been dying ever since to find other places to apply this to!

As a side note – disks are more efficient at random io operations if they have more of them to do at once. Disk head movement can be minimized to serve a chain of random IO requests. (Disk drivers also sort pending IO requests). During measurements with this implementation – i was able to drive disks to around 200 iops (where the common wisdom is no more than 100 iops per disk) during the prefetch phase of replay (thanks to the long list of IOs generated from the catalog)

Roll forward to 2006 at Yahoo and 2008 at Facebook. Both these organizations perennially complained about replication lag in MySQL. Nothing surprising – operation log replay in a database has the same problem as operation log replay in a filer. So i have been dying to find the time (and frankly motivation) to dive into another enormous codebase and try and implement the same techniques. In bantering with colleagues earlier this year at Facebook – it became apparent that the technique described above was much easier to implement for a database. All transactions in a database are declarative and have a nice grammer (unlike the hair ball nvram log format in Netapp). So one could generate a read query (select * where x=’foo’;) for each DML (update y where x=’foo’;) that would essentially pre-fill blocks required by the DML statement. And that should be it.

So I was pleasantly surprised to learn today that this technique is already implemented in MySQL (and that it has a name too!). It’s called the oracle algorithm and is available as part of the maatkit toolset. Wonderful. I can’t wait to see how it performs on Facebook’s workloads.

MySql also has more comprehensive work on parallelizing replay going on. Harrison Fisk posted a couple of links on design docs and code for the same. It’s not surprising that these efforts take the route i had avoided (detecting conflicts in logs and replaying independent operations in parallel) – the declarative nature of the mysql log makes such efforts much more tractable i would imagine.

Kudos to the MySQL community (and Mark for his work at Facebook) – seems like there’s a lot of good stuff happening.

I often find myself asking the question on what would be the most obvious/big things happening now – if we were looking back five years forward from now. After reading the review on the Intel X25 – there’s no doubt that the emergence of flash technology would be one of those big things.

As a computing professional trained for years to think about hard drives – i found the unique architectural constraints of the Flash chip architecture (as presented in the Birell paper for example) refreshing and thought provoking. For starters – while the naiive assumption of most people is that Flash gives very high random read and write performance – it turns out that from a write perspective they are really like disks – only worse. Not only is one much better off writing sequentially for performance reasons, writing randomly also causes reduced life (because blocks containing randomly over-written pages will need to be erased at some point – and flash chips only support limited number of erasures). The other interesting aspect that Gray’s paper reports is that sequential read bandwidth does depend on contiguity – with maximum bandwidth being obtained at read sizes of around 128KB. The new generation of flash drives (including the X25) also seem to be close in implementation to the Birell paper – implemented more like log structured file systems rather than traditional block devices.

All of which implies that these drives solve some of the old problems (random read performance) but create new ones instead. The problems are entirely predictable and well exemplified by this long term test of the X25. Log Structured file systems cause internal fragmentation – small random overwrites causing a single file’s blocks to be spread randomly – causing terrible sequential read performance (and as Gray’s paper shows – one needs contiguity for sequential read performance even for flash drives). The other obvious aspect is that the efficacy of the lazy garbage cleaning approach depends a lot on free space. The more the free space, the more the overwrites that can be combined into a single erasure and the less the number of extra writes per writes (so called Write Amplification Factor). Conversely, handing over an entire flash disk to an OLTP database seems like a recipe for trouble – write amplification will increase greatly over time (if things work at all). It also seems that there are ATA/SCSI commands (UNMAP) in the works so that applications can inform the disks about free space – however this seems like another can of worms. How does a user level application like mysql/innodb invoke this command? (and how can it do so without a corresponding file system api in case it is using a regular file?)

All of which make me believe that at some point the most prominent database engines are going to sit up and write their own space management over flash drives. For example – instead of a global log structured allocation policy – a database maintaining tables clustered by primary key (for example) is much better served by allocation policies where a range of primary keys are kept close together (this would have much better characteristics when a table is being scanned (either in full or in part).

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission.

One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something that the application layer (map-reduce and higher layers) are responsible for. While native compression is supported by many file systems (for example ZFS) – the arguments in favor of the former are less well made (or at least less commonly made). On the other hand, columnar, dictionary and other forms of compression at the application level (that exploit data schema and distribution) are common parlance and there’s a whole industry of sorts around these.

After some thought though – i have become more and more convinced that functionality such as compression should be moved as much down the stack as possible. The arguments for this (at least in the context of HDFS and Hadoop) are fairly obvious:

FileSystems can manage data through it’s life cycle applying different forms of compression. One could, for example, keep new data uncompressed, recent data compressed via LZO and then finally migrate to Bzip.

Where multiple replicas of data are maintained – some of the replicas can be maintained as compressed transparently – providing redundancy while saving space. Applications can run against uncompressed data whenever possible

The former may be especially appropriate when data is maintained for disaster recovery. Data can be compressed before transmission to remote site and can be kept in compressed format there

The filesystem may also be able to push compression to external hardware (co-processors)

Fairly similar arguments can be made to move wire compression to the networking sub-system. It can dynamically recognize whether a compute cluster is currently CPU or network bound and come up with appropriate compression level for data transmission. It is impossible for application level compression to achieve these things.

This leaves the tricky question of how to get the same levels of compression that an application may be able to achieve (given knowledge of data schema etc.). The challenge here points to a missing abstraction in the traditional split of file and database systems (and other applications). The traditional ‘byte-stream’ abstraction that filesystems (and networking systems) present means that they don’t have any knowledge of data inside the byte stream. However – if a data stream can be tagged with it’s schema – and optionally even a pointer to the right codec for this stream – then the file-system can easily perform optimal compression (while preserving the benefits above).

Traditionally – these kinds of proposals would also have encountered the standard opposition to running user-land (application) code in kernel space. But with user level file systems like HDFS gaining more and more tractions – and with user level processing becoming more common in operating systems as well (witness FUSE) – that argument is fairly moot. One of the opportunities with systems like Hadoop, i think, is that we can fix these historic anomalies. The open nature of this software and the fact that files are used as a record-stream (and not as a database image) – lends itself to the kind of schemes suggested above.

It’s hard to imagine that i won’t run into him anymore walking University Avenue or that he’s no longer an email away to discuss (and tear down) the latest startup idea at some cafe nearby. My heart goes out to his family.