Thursday, 22 September 2011

Single Writer Principle

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we'll focus on the scalability limitations of the systems design.

I keep seeing people just accept having multiple writers as the norm. There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches. One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy.

Mutual Exclusion

Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy. Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order. This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use. Those waiting to enter the critical section, in advance of performing the mutation must queue, and this queuing effect (Little's Law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control

Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime. If a change has happened in the meantime you repeat the process until successful. This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion. If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day. Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware! The ability to perform the changes atomically to data is made possible by CAS instructions offered by the hardware.

Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.

Managing Contention vs. Doing Real Work

CPUs can typically process one or more instructions per cycle. For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel. If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns. To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing. If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses. Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.

Now what does this have to do with managing contention? Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations. These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed. At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released. This rescheduling of the blocked thread will result in cache pollution. The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses.

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work. The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory.

Method

Time (ms)

One Thread

300

One Thread with Memory Barrier

4,700

One Thread with CAS

5,700

Two Threads with CAS

18,000

One Thread with Lock

10,000

Two Threads with Lock

118,000

This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor. I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”. This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs??? By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points. As multi-core becomes ever more significant another approach is required. My last post illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.

Single Writer Designs

Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread? It is actually easier than you think in my experience. It is OK if multiple threads, or other execution contexts, read the same data. CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system. This has a cost but it scales very well.

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem. You can also scale up without limitation until the hardware is saturated. There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle. On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

So how can we drive towards single writer designs? I’ve found it is a very natural thing. Consider how humans, or any other autonomous creatures of nature, operate with their model of the world. We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use. We mutate the state in our heads based on inputs (events/messages) we receive via our senses. As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs. None of us reach directly into each other’s heads and mess with the neurons. If we did this it would be a serious breach of encapsulation! Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk! Who's bright idea was it to allow public access to fields of an object? You deserve your own special hell.

At university I studied transputers and interesting languages like Occam. I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state. I’m sure some of this has inspired the Disruptor. My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches. It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.

It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.

Now I’m not saying locks and optimistic strategies are bad and should not be used. They are excellent for many problems. For example, bootstrapping a concurrent system or making major state stages in configuration or reference data. However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited.

The Principle at Scale

This principle works at all levels of scale. Mandelbrot got this so right. CPU cores are just nodes of execution and the cache system provides message passing for communication. The same patterns apply if the processing node is a server and the communication system is a local network. If a service, in SOA architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better. Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data. This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded. Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.

Summary

If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

About 30-35mins into the following presentation I touch on this subject for queues.

http://www.infoq.com/presentations/LMAX

The summary is that the head, tail, and possibly size, have to be concurrently modified. The act of having to add something to a queue is a write operation, as is the act of removing something. Therefore you have multiple writers thus breaking the principle. This is why you need locks or CAS operations for the implementation. Just look at the source of ArrayBlockingQueue or ConcurrentLinkedQueue.

I'm just pointing out that since you're defining the "Single Writer Principal" in this post (which I like), you should probably elaborate on on why it should not use queues. Is it even part of the "Single Writer Principal" or just a recommended design? You're alluding to it, but there's no clear explanation.

I actually think Single Writer Principal has its place even with queues, especially in a distributed system where it eliminates contention on a shared state if there are multiple writers update it. The (relatively) minor overhead of queuing may be an acceptable trade-off in place of say distributed locking on a distributed cache.

Sorry if I was unclear here. Using queues does not break the single writer principle as you point out. What I'm trying to say is queues themselves break the single writer principle. If, for example, you use the Disruptor instead of queues then you can have a full design that avoids having multiple writers to any resource :-)

While I get your points and agree with what you say (which is no news to me), I have no idea *how* you can have an actor/thread receive messages from multiple other ones without a queue. If you have no shared state, you can scale infinitely; no problem here. If you do have a shared state, and a single writer to it, then you need a funnel architecture where N front-ends (because you don't want a single-point-of-failure front-end, and you can't call it "scalable" anyway if you can have only one front-end) send changes/events to this one writer.

I don't remember ever hearing about a pattern that allows multiple senders to one receiver *without* a queue.

Your post feels like "I know the magic solution to this very important problem and it's actually sooo obvious to me that I'm not going to tell you what it is". In other words, it's not a very useful post for someone who is already aware of the problem itself, and actually needs a solution.

So, could you please actually *explain* how you solve the "don't use queues" point, so we don't have to read your source code to grasp the pattern?

The Disruptor is an alternative to queues. It can replace a whole graph of dependencies that could be represented by queues. For some background, rather than reading the code, you can check out the following links:

Technical Paper: http://code.google.com/p/disruptor/downloads/list

Blogs: http://code.google.com/p/disruptor/wiki/BlogsAndArticles

Martin Fowler overview of the Disruptor in context: http://martinfowler.com/articles/lmax.html

Video with Q&A: http://www.infoq.com/presentations/LMAX

To answer your "multiple front end" question. At least three approaches can be taken:

1. Put the front ends on separate machines. This can be good for protocol translation and border security anyway. Then forward requests to a HA cluster of machines with the single thread for the state mutation. This needs to be asynchronous to scale.

2. If the threads are on the same machine. Configure the Disruptor with the MultiThreadedClaimStrategy which minimises contention and can be an order of magnitude faster than queue based alternatives.

3. If the threads are on the same machine with massive contention. Use one Disruptor instance for each producer/publisher thread and then have a multiplexer thread combining their traffic and publishing it on to the single business logic thread via another Disruptor instance. This solution can be extended and federated.

Thank you! I wasn't expecting such a quick response! I went and found the article of Martin Fowler myself after reading your post.

In short, what I missed from your post is that a Disruptor is a "fixed-size ring-buffer where each entry field can be written by a single thread" (or at least that is what I understood). Just one more little sentence would have made things much clearer, since most people don't know what a Disruptor is.

Internally, performance relies on : - One writer for any location at any time - Pre-allocation of slots

Would you agree that the disruptor implements a queue in the sense that there is some buffering of work items between producer and consumer(s)? In that sense a disruptor between a single producer and single consumer looks very much like a queue from the outside.

Further, this pattern seems similar to some queue designs where producers write to a queue tail pointer and a consumer writes to a queue head pointer. The extension here is to allow multiple consumers without contending head writes by each consumer maintaining their own head pointers. The interesting part is where the producer can only move its head pointer forward (making space for the tail) if all other head pointers have already moved forward. Perhaps it's non-intuitive that having the producer continuously performing a global-min on the header pointers of the consumers is more efficient than maintaining a contended tail pointer - do you have any numbers to quantify the cost to the producer of maintaining the 'safe' queue head with N consumers?

I definitely agree about the benefits of avoiding locks, batching requests and keeping cache-warm-threads busy. I think one of the main problems with understanding these benefits are the lack of numbers.

I agree that is some ways you could say it is a better queue. Many network cards have similar internal implementations so this is not new. Hardware tends to be better designed than software in my experience. One of the other benefits is the organisation of dependent consumers into a graph which can be useful. This dependency graph has nothing to do with the Single Writer Principle.

It should also be noted that the concerns are well separated in the Disruptor thus allowing multiple or single producers, and the best performance possible given each choice. This is independent of how you organise the graph of consumers.

As to lack of numbers. The Disruptor project comes with performance tests which illustrate how it performs vs traditional queue approaches. Check out the source and run them for yourself.

The Disruptor does offer FIFO semantics but it also allows the configuration of a dependency graph that work like static actors. Please check out the performance tests for examples.

A spin lock provides mutual exclusion by spinning while waiting to enter the critical section. The Disruptor has a optional multi-threaded claim strategy for a slot that uses a CAS but not for a critical section. Slots once claimed can be used in parallel without being in a mutually exclusive critical section. This is similar to many lock-free algorithms that are not using locks (spinning or not).

"On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core."

Even using single writer, because of "loads can be re-ordered with older stores", so reader could see out of date state?

Other threads do not see stale data because this rule only applies to the thread issuing the loads and stores. If it read what it just stored then it sees what is in its store buffer. It has to be stores to other locations.

The rule of the memory model is only an issue for algorithms like the Dekker or Peterson locks. This is where intent is expressed with a store then immediately followed by a load to read another intent. For this scenario it can appear that the subsequent load is re-ordered with an older store to a different location, thus breaking sequential consistency. For such operations a CAS approach is a better technique on modern processors.

In the second case Thread B might not see A's store due to timing issues. This would be the same regardless of store buffer implementation.

The more interesting, but seldom required, version is:

Thread A store $address_one[fence]Thread A load $address_two

Thread B store $address_two[fence]Thread B load $address_one

For each thread it could appear that the load got reordered with the older store due to the stores working there way out the store buffer. The fences are required if you desire sequential consistency in this case.

There is no additional delay to when the value stored by B becomes visible, whether a fence is used or not. The store buffer will drain as fast as it can. A soft fence, e.g. lazySet/putOrdered in Java, is required to the compiler to ensure StoreStore ordering.

I appreciate this good post. I have a question though. I don't understand the SOA example. If a service is the owner of the data and all modification of the data should be done through the service. Isn't it just passing the problem to the service? don't we still have the same problem? A DB would do the same: first request is served first in a "queue" model and then invalidate any cached result (the DB can cache the result of a query). A service would have to implement something like that. We would be applying the Single Write Principle because it is the DB the only one process that can touch the actual data on disk. I know that having one service as the owner of the data is an encapsulation principle with certain benefits but I don't see the "concurrency" benefit in it, but the maintainability and flexibility benefits only.

There are many concurrency benefits to fronting a database with a service beyond caching. The services can deal with higher level aggregate operations rather than row level operations in it transactions. If the service is asynchronous then these aggregates can be smart batched into the database without competing transactions.

The database is just the raw data. The service provides the higher level semantics. Just the same a component provides the higher level semantics over raw memory access to the bytes. All concurrency has a granularity. We want to queue up updates to a single mutator and not have a free-for-all from any code wanting to mutate the raw data.

How far can the definition of a "single writer principle" be taken before you consider the principle is broken?

Would the use of a ConcurrentHashMap in a scenario where you have a single writer, always the same thread, but multiple readers be sufficient for compliance?

The map uses locks internally on writes be is it enough to break the principle? Here if it's always the same thread writing, I would think the JVM would do a good job of deflating those locks, but again, it's not clear from the article if you consider controlled CAS operations out of bounds.

Would the CAS operations used by the map on read be bad enough to again break the principle?