Transitioning to Event Sourcing, part 9: additional tools and patterns

I will pass briefly on topics of interest and additional features a CQRS / Event Sourcing infrastructure could implement. Each of those topics would probably deserve their own post or even a book. If you need more information, there are a lot of material on theses subjects in discussion groups. Greg Young will also soon release an entire book on these subjects. So here are brief introductions:

Snapshots

Obviously, pure event sourcing like we implemented in part 8 has its limitations. It is fine for aggregates with few thousands events, but after that, we need to implement snapshoting.
The idea is to allow the aggregate to create a snapshot. When loading an aggregate, we only fetch a snapshot and the events that occurred since the snapshotting. Snapshots are stored in a table that is containing both the latest version of the aggregate, the snapshot data, and the snapshot version.
You should delegate the snapshoting process to a standalone process. The application itself do not have to manage snapshot creation. The event store will just “consume” snapshots created by the snapshotter. The snapshotter just need to query the snapshot table for aggregates that had, say, 100 commits or more since their last snapshot. It can simply load the aggregate in question and ask it for a new snapshot.

Versioning

What happen when you need to change an event? In most cases, you don’t want to change the events already committed in the event store. You simply create a new event version (for example CustomerAddressCorrectedv2), and make the aggregate consume this event instead of the old one. You then inserts a converter at the output of the event store that will convert the old ones to the new ones before passing it to the aggregate constructor.
How do you convert an old event to a new one? Let’s say you have a new field in the event. You will have to take the same kind of decision as when you insert a new column in a table: you have to calculate a default value for that new field.
What happen when this is completely impossible? As explained by Greg Young, this probably means that the new event is not a new version of the same type of event, but actually a new event type. In that case, no upconversion is possible.

View model

Now that you are not tied to the transactional model for your views, you have much more liberty. In particular, why not store directly the object you will need to display in a page instead of trying to do a cumbersome paradigm shift from and to a relational database? Databases that are storing unrelated object trees are called document databases. On .NET, I would highly recommend RavenDb. On other platforms, there are numerous choices like MongoDb. Some of them are even producing JSON output, so you could transfer the object directly to your AJAX application, without any transformation taking place in your application layer.

Smart merging strategies

So far, our concurrency strategy has been quite dumb: we retry the entire command when we detect a conflict. And this is fine for most of the applications. But if we need, we could also get the events that have been committed in between and look at them. We may be able to commit the events even if the command has been raced by an other one. That is, if the other events are not incompatible with our commit. Since events are capturing a detailed and very high meaning of what happened, this is usually simple to creates compatibility rules between pair of events. A good starting rule is, if there are one instance of the same kind of event in the 2 commits, we consider that as a real concurrency issue. But we can add smarter rules that actually look at the content of events.

This is particularly useful for highly collaborative applications. But it also opens to additional possibilities in deconnected systems. This is the case in the domain of mining prospection or in the police. For those cases, the client can embark a copy of the event store, and work from it. When the client reconnects to the server later, the commits are merged. Only conflicting events are merged manually by the user. By the way, this is how most of the relational database systems are implementing replication, through their transaction log.

Service integration

This is a big topic. I will here only scratch the surface. An immense amount of money is spent worldwide integrating services within companies and across companies. There are multiple reason why integration is hard. Most of the time, applications are not designed to be integrated, and are implemented as if there we an isolated component. Fortunately, with the trend in SOA, integration becomes more of a concern now. An other problem are that most of today’s integration technologies are synchronous by default: REST, web services, CORBA, RMI, WCF and so on. All this synchronous communication does not scale. First, network latency kicks in. So if a service needs to call an other service that needs to call an other service that… You get the idea. And then, the more components you add to the chain, the more likely you will get a failure along the way. And I am not talking about the maintenance nightmare that such a web of interconnected services are producing.

Event sourcing is allowing for a very clean approach. Combined with publish / subscribe, applications becomes disconnected in time, and decoupled in responsibilities. If other applications need to integrate with yours, just ask them to subscribe to your event stream. You can publish your event stream with components called Service Buses, that precisely allow that. Once you are publishing to a service bus, you don’t even need to be aware that a new application is now integrated with yours. All that application has to do is to download the history of your event stream, subscribes to the stream, and then derive whatever information they need from it, or react to events in your application in very little time:

This is solving the “always up” issue. If one or the other application goes down, the events can be queued in the service bus until the consumer system goes back up.
If you choose a serialization format carefully, it also solves most of the maintainability issues. The source application can create new types of event without impacting the consumer applications. Since the applications are responsible for projecting what they need from the event stream, they will be responsible for taking into account (or not) the new types of event. Finally, consumer applications will not need new features or data from your application, since the event stream already includes everything that ever happened in your system.

Eventual consistency

Eventual consistency is when there is a delay between the commit of the aggregate and the corresponding refresh of your view model. Or more precisely, when those 2 processes happen in different transactions. Concretely, you only commit events at commit time, and a different process is picking them up to execute the view model handlers. If you already have a service bus (see “Service integration”), you can use it. And indeed, the view model can be viewed as a separate service, almost a different bounded context if you will.

However, most of the applications don’t need such a separation. Until you need it, you should commit the changes to your view model along with the events.

When do you need to split things that way? One reason is scalability, and we will talk about this one later. One other factor could be performance. Your view model might be huge, or have parts of it that are slow to update. You may want to isolate the update of those parts in a separate process, in order to keep a good performance when executing commands.

Scaling the query side

Once you have eventual consistency, it is quite trivial to scale the read side. This is good news, because most applications have usually a lot more reads than writes. All you have to do is to put a copy of your view model on each web node. Each of those view models would subscribe to the event stream:

That way, you have an almost linear scalability of your read side: just add more nodes. The other benefit is the gained performance due to the local access to the view model. You don’t need a network call anymore because your web layer sits on the same node as its view model.

Now, if you want to scale at very large levels, your problem is reduced to scaling the publish / subscribe middle ware.

Scaling the write side

For scaling this one, you will first need to process commands asynchronously. The UI would need to send the command and not wait for the result. This usually triggers a very different kind of user experience. Think of it as ordering something on Amazon: what you get when you are clicking on the “submit” button is not a success or failure statement, but something more along the lines of “Your order has been taken into account. You will receive a confirmation email shortly”.

Once you are sending commands to a queue, you can have competing nodes treating those commands. But that is only scaling the business logic treatment, not the event storage. Luckily, by its structure, it is also quite simple to shard events by their aggregate id. Since commands contains the aggregate id, you can route your commands to the right queue:

Great series of post, but I think you sort of missing one part. Transitioning to event sourcing is not that hard if you don’t consider the existing data. What would have changed if the existing database was a data with lots of data? How would that data be considered when transitioning to event sourcing since that data is the base of many of the aggregates. Do you keep the old database as a baseline for the event store? Or do you have a migration phase where you migrate the existing database creating events from it and inject in the existing database, not that those events are probably events that will only be used in this scenario.

Thank you for your comment. This is a good point. You will need some data migration at deployment time. And nothing magical here, this will require effort, but not more than other kinds of data migrations. It however do not need to happen all at once for the entire database. You can migrate one kind of aggregate at a time.You will need to generate events for this kind of aggregate. It is easier if you can have a down time so you can migrate offline, but there are other ways. For example, you can deploy a version of the application that will generate events on the fly from the legacy DB each time it is accessing an aggregate instance for the first time.
As for the offline migration scenario, generating events is usually quite fast. On a single event store machine, I used to see 3000 event commits written per second with naive implementations, but some products are claiming to be faster, for example Greg Young’s Event Store. Since you will want to group events for an aggregate instance in a single commit, you can migrate approximately 10M aggregates per hour on a single machine with the naive implementation.
There are a lot of situations where migrating to event sourcing might be an option worth considering, given its tremendous benefits. But of course, careful planing and cost analysis should always be performed upfront.

Just some comments about the framework if you ever want to improve it.

I post this for others that might be tempted to use the framework here.

1 ) Castle Windsor implementation : you need to readjust the lifecyle of the components properly. Having components on demand and hacking via do not track is not the best option (although it still works in principle). The newer versions of Windsor have better helpers to help you out.
2 ) DTOMapper : There are some special cases where nullref would fire, based on the nVentive.Umbrella dependency. You might want to remove it and implement the thing with classic reflection. I did the change and it took me 5 minutes with no issue at all. There are no so many references in the whole solution.
3 ) ControllerController factory : The getControllerInstance does not handle the 404s properly. Here is my implementation, which stills inject the dependencies on your controller classes in a MVC project (sorry for the bad formatting). This this implementation, you get a 404 instead of a 500 :

Hi, I know that’a a quite old post but I have just a question about the read side scaling :
– how to be sure that the different nodes remains synchronized?
=> May be some view model could be malfomed for any reason and then , we should be notified of the error and the nodes have to be re-synchronized …

Hi Dypso, great question. There are many ways to ensure this. First of all, you can do periodic drop then rebuild of the views. You can optimize this process by having a process doing it in the background, and dropping a snapshot somewhere to be used by reconstructing views. When you do reconstruct a view depends. As said above, it can be periodic, or you can have anti-entropy reports comparing the view instances, and triggering the process if discrepancies are detected.