As a reminder of what I’m talking about, here’s the picture from Mark:

What I’m going to be talking about is section #1 from Mark’s picture, and in particular, I’m going to going over a number of concepts including:

The Reporting Store

Eventual Consistency

You don’t need your Domain

Once you have one, why not more?

“Reporting” doesn’t just mean Reporting

One of the first things that I found difficult when learning about CQRS was the use of the term “Reporting.” Because I come from a SQL background, when I hear the term “Reporting” in IT contexts, I think about reports, e.g. last month’s sales report. Because of how the term is normally used, I’m not sure if there is a different word that should be used here. However, especially with all of the stuff I just wrote about traditional reporting, there are a couple of concepts that are starting to make more sense to me, since it turns out that some of the concepts of CQRS are simply things that we’ve been doing already, but refined.

The first thing to keep in mind is this:

The Reporting Store (as separated from the Event Store, which is up in section #3 of the picture) is a logical concept, and as such does not have to be physically separated. Having said that, however, it probably will be.

Most Microsoft shops are already familiar with the idea of a Reporting Store, as they probably already have one in one form or another, be it a replicated version of their main database, or an OLAP store using SSAS (or perhaps some other tool). In a traditional shop, this is how traditional reporting tends to be done. You generate/run your reports off of the Reporting Store, which means you don’t tax your main database to do so.

There is nothing particularly cqrs-y about this, but once you accept that you have a separate Reporting Store, with enough ingenuity sparked by genuine need, you think about different ways of using this. Back in the dot.com heyday, a common problem involved how to generate/cache the storefront, so that you didn’t have to hit the database on every page. This is, obviously, still a common task, but there are a lot more ways that you can implement this now than there was then.

We chose to generate our site, and generate it off of the replica database. Basically, we would create the HTML once on special generation servers for every page, and then use MSMQ to push them out to the web farm (there were products that implemented this and caching, but some of them ran six figures IIRC). There’s nothing magic or special about this, of course, but what I’m hoping to convey is that the notion of a Reporting Store within CQRS isn’t magical or special either. In fact, if I had been able to tie my previous experience with replication with the idea of what a Reporting Store was, I think it would have been easier to learn.

Does this mean that using SQL Server Replication is the same thing as implementing CQRS? Of course not. For one thing, it really doesn’t matter to set up a litmus test of what counts as ‘really’ implementing CQRS, but if there was one, there are differences. As I’ll try to explain in a bit, the theory behind CQRS provides a general benefit, a theoretical construct, that is of value in itself, and goes behind a particular technical implementation like Replication.

Let me emphasize:

I am *not* saying that SQL Server Replication is the same thing as CQRS. It is a technology that *could* be used as part of an implementation of CQRS, but it is a separate thing.

Some of the *concepts* of CQRS are similar to concepts we have been using for quite a while, such as pulling data from a Reporting Store that is separate from the main database/Event Store.

When is it okay to use something like Replication? Finding an answer to that can provide use with the beneficial theoretical construct I just mentioned, and answering that question depends on understanding and applying the notion of Eventual Consistency.

It will get there eventually

If you go to a business user who works with, e.g., the rolling last 30 day sales report, and ask them if it is okay to use stale data, they probably won’t give you an affirmative answer. But if you ask them essentially the same question in a different way, they probably will.

Suppose they have a morning status meeting with the head of the Marketing Department to go over the rolling last 30 day sales report, and, as always happens, they print this out in multiple copies (the myth of the paperless office is why I don’t believe in any ‘revolutionary’ movements like NoSQL, but I digress) and take it to the meeting. Suppose replication failed at 10 PM EST the previous night, and so whatever sales might have occurred in that small timeframe are therefore missed. Does this invalidate the report?

The answer you will get is something along the lines of “Not really.” In most cases, you don’t have enough sales in that small timeframe overnight so it doesn’t matter that much, but “let me know when replication has caught up so that I can regenerate the report, just to be sure.”

That is the gist of Eventual Consistency. It is acceptable for there to be a gap between the ‘real time’ data, and the data that is viewed in some other context. Once you find out that it is acceptable for there to be a gap, then the next step is to find out how big of a gap is acceptable.

Suppose the business user is looking at the current day sales report. If he is looking at it at 2 PM EST, and replication has been down for 4 hours, that might then be unacceptable. ‘Eventual’ doesn’t mean that next week is acceptable. But suppose the business user prints out his report for his daily afternoon status meeting. Between the time he prints out his report and when it is viewed by the head of Marketing, there may have been additional sales. That is acceptable. It is accepted that from the time the report is printed and when it is looked at 15 minutes later, it might be slightly outdated.

Once you have established that it is okay for there to be a gap between the actual “this is the value at this exact moment” data and what it viewed by the end ‘user’ (the ‘user’ could actually be another system), then you can start to think of ways of using your Reporting store for other things.

Whether these ways are valid will depend on the context. What does your business do, who are your end users, what data will they commonly be seeing, and how will they be acting on it?

In my mind, CQRS offers a theoretical construct to help us here:

Anything that doesn’t involve a command is a prime candidate for acceptable Eventual Consistency. Anything that involves a command may be a candidate, if the result of the command doesn’t need to be immediately evident.

Now, this should be considered only as a starting point, and it certainly doesn’t answer the contextual questions that it needs to answer, but it can help.

For instance, contrast a desired difference between an order review page and a product list page. When a customer presses submit on the order review page, it is probably the case that you don’t want to immediately show an order confirmation page without knowing the order went through (though you *might* want to). On the other hand, when a customer goes to a product list page, it probably doesn’t matter if the page is a few minutes old (though it *might* matter).

You don’t absolutely *have* to implement Eventual Consistency to consider having a separate Reporting Store to be beneficial.

This second point will become more evident when talking about section #4) of Mark’s picture (though I will touch on some of them below), but a brief note is important here. In a thread on the DDD mailing list, Greg has emphasized that you should start off without Eventual Consistency, and then work your way towards it as the need arises. This is common-sense, and could be considered a simple application of YAGNI here (though YAGNI is unfortunately too often just used as an excuse). Once you appreciate the concept of Eventual Consistency, it’s an easy temptation to think of all the places where you could possibly implement it without a clear understanding of the drawbacks (and there are always drawbacks).

When you query your Reporting Store, you can ignore your domain

More specifically, if you need to query your reporting store, you don’t need to go through your domain model, and as a matter of fact, it would probably be a bad idea to do so. To paraphrase a comment from Udi, why should data come across 5 layers through 3 model transformations when all you need to do is populate a screen?

Typically, our end user screens will contain information from multiple entities. A typical pattern is to find the parent entity you need, load all of the relevant child entities, and then pass that entity into a mapper, which then produces a DTO with a flattened representation, which is then passed back to your screen and bound to it somehow.

Skip it. Query your reporting store and get a DTO with that flattened representation immediately.

Does this mean you should start rooting around in your code and eliminate any reference you have to AutoMapper? Of course not. But once you start to think of how you can skip going through your Domain Model for queries, some other options open up:

Put a stored procedure on top of your Reporting Store to return a ViewModel per query.

Transform the data from your main database/Event Store to your Reporting Store so that you have a table per ViewModel that you can do a simple select from.

Query off of an OLAP store to do the same.

And so on and so forth. The possibilities aren’t endless, and shouldn’t be done without thought, but it does open up a different avenue for you.

Can you have more than one Reporting Store?

Once you start to think about how to use a Reporting Store with an eye towards Eventual Consistency, even more possibilities open up.

To go back to the dot.com example I gave previously, we used MSMQ to push individual page updates across to our entire web farm. It was, given the day and our abilities, a bit crude. At times, a particular server might process individual pages more slowly than others. From an operational perspective, it worked well enough that we lived with it. A monitoring server could notice that a particular web server was slow, and pull it out of active duty. But for the most part, on most days, almost any updated page would hit each server at about the same time.

To think of a possible CQRS implementation of the same idea, why not have a Reporting Store on each web server that subscribed to events being published out of your domain? Going back to the simplistic product list page example I mentioned previously, imagine having a SQL Server Express instance on each web server which could process those events. If it is acceptable in the context of your environment to have Eventual Consistency here, and if you have a robust enough environment to be able to process these events evenly (and if it was robust enough for 1998, then surely, it might be today with more advanced service bus technology), then this opens up an avenue for immediate horizontal scalability of your ‘query-facing’ infrastructure. As your traffic increases, add another web server with its own Reporting Store. If you have a limited number of processes that utilize Eventual Consistency (think back to Greg’s emphasis of starting slow here), then you have a limited number of events that are subscribed to by a larger and larger numbers of machines.

If you think about it, this is a different way of achieving caching that you might already be doing today, but from a central architectural perspective. Once again, I don’t think anyone who has either used or considered CQRS is suggesting you start to rip and replace memcached or Velocity here. But, you might think about ways to fit memcached or Velocity into CQRS because it offers a general scalable architectural set of patterns.

When and why might you not want to do any of this?

I personally find the notion of Eventual Consistency, and that of the a separate Query layer, that skips your Domain model to be compelling. It ties together concepts that I have already been familiar with into a general architectural model. Having said that, those concepts that I was already familiar with had drawbacks, and CQRS doesn’t magically solve them.

From previous posts that I’ve made, a familiar reader will know that I have pretty extensive experience with Operations, and all that might entail. In particular, I believe pretty strongly in what I’m going to vaguely call here ‘planning for expected catastrophe.’

Starting with SQL Server Replication as a base technology, it sometimes fails. Sometimes it fails easily (an agent stops processing transactions, which merely requires a ‘right-click restart’, and might only take a few minutes to fix if your monitoring is good), and sometimes it fails hard (the entire Replica has become invalid, and must be recreated from scratch, which can take hours to accomplish).

Even though the technology of today is light-years ahead of what we had even 10 years ago, planning for ‘failing hard’ is still something that I think has to be central to planning software. If your Reporting Store suddenly is unavailable, what can you do? What we did with our relatively crude system was build in a switch (more or less) that let us immediately go back to processing off of our main database. We would still generate the site if we could, but even there, we had an emergency ‘oh, good Lord” switch that would allow our site to skip generation altogether, and hope we had enough hardware to weather the load until we could fix the Reporting Store. Obviously, if both the Reporting Store and the main database went down and our off-site log shipping failed….well, at that point we might be polishing off resumes anyway. Some catastrophes can’t be recovered from.

Another more basic reason why you might not want to do any of this is because it does require a certain amount of sophistication and a probably larger amount of faith that it will work. I don’t think a lot of advanced developers will be turned off here, and a good case study of how this works out in actual practice is in the story of MySpace. The architecture there was built under certain assumptions, and then once the limit was hit, the architecture was rebuilt. Something like CQRS, in my opinion, gives you a built-in scalability potential, but it isn’t a panacea.

Even if you choose to embrace Eventual Consistency and building a Query layer, there is another thing to keep in mind. Look at the picture again:

Pay attention to that little line from the thin data layer in section #1 that points back to the services box in section #4. When push comes to shove, sometimes it is okay to default back to calling into your Domain. If you took Greg’s cautionary message to heart, you could start off by building a Query layer that does almost the opposite of what I’ve been describing. All queries ignore the Reporting Store unless and until the Reporting Store ‘proves itself’ within your context, and then you start pointing them there accordingly. Given my experience, you should probably never need to go to this extreme, but it is there for you if you need it.

Why you should consider doing at least some of this anyway

Suppose you have an application that you hope needs to scale, but you don’t know that you need it today. What we did in the past, and what e.g. MySpace did, was build to the scalability you knew you needed today, and then when you hit that limit, you punted. Though I don’t think I’ve done as good a job as I could have, by any stretch of the imagination, I think that CQRS offers an architecture that lets you build it to match the scalability you have today, and then easily expand it. Your query layer can hit a Reporting Store that, as implemented, simply is your main database/Event Store. You can code your code and architect your architecture as if all of these differences were physically separate, since you only need to worry about it at the logical level.

At a fundamental level, building a CQRS-style query layer allows you to logically segment your code between queries and commands. Which leads to the next topic, the command layer, the topic of the next post in this series.

One question I would like to get a good take on how to implement would be that situation of issuing a command that needs an immediate response back to our user. For instance, say we have an update screen and our user attempts to save. This generates a command, but after the command completes or fails we need a way to update the status for our user.

If the UI built off the reporting store, and seeing that most CQRS examples are using a bus for pushing events to the reporting store, we don't really know for certain when the event generated by the domain has been accounted for. Our user wouldn't really expect their screen to eventually be correct in this case because they just updated it.

It seems that in this case you could instead build the screen to be "reporting" off of the domain, but that feels like compromising the basic architectural goal of CQRS in the first place.

I followed a simplified CQRS model on my last project, but using one database. My viewModels were mapped to database views over my domain tables, with an eye to eventually moving the views to tables in another database. I found the greatest benefit of CQRS is separating out how one thinks about the paths through the system (write-only/read-only). They can be worked out separately.

@Justin

Udi has offered one way to handle the case where you need to update the screen to show the user the consequence of their command. He thinks that in many cases it's ok to, um, tell the use what he/she wants to hear. Basically, you update the screen with the changes to satisfy the user. If another user is reporting on that data and it hasn't made it into their reporting store, then the other user will still see the old data. In most cases, this isn't a problem. In the cases where it is a problem...well, I'm not sure -- maybe you'd have to go against the domain directly.

It's also not a problem because the success of the command is transactional. The possibility of failure is minimized by front-loading command validation, so you can pretty much guarantee the message's validity. Then, your command is executed against the business rules, which are contextual. If the system state makes your command invalid (e.g., because of a deleted entity), it is caught by the rules, in the transaction, and you can notify the user. So if the command is successful, you can slip in the predicted view for just the user that issued it. If that feels weird, I agree. What happens if they refresh their view???

The example Udi used was a ticket selling system, I'll call it TicketMistress. The objections to eventual consistency were that users would be seeing tickets that weren't really available (because their reporting stores hadn't been updated). TicketMistress would provide a bad user experience because you'd try to book a seat and, BAM!, no dice, try again. Udi's solution was to design around the problem so that TicketMistress would offer selection criteria (price, number in party, etc...), then find you the seats that were available. It's good to step back and ask, "what are we really trying to accomplish here?" If we accept eventual consistency, we get very cost-effective scaling built in.

I was thinking something along the lines you suggested. If the command works, you show the user what will eventually show up in the reporting store. Or maybe that it's conceptually an issue but the user of the system really doesn't have that big a problem with it as long as they understand that the refresh isn't instantaneous.

For example, I know my banking website uses these concepts because I can transfer funds between accounts and the balances on the summary page will be incorrect for a little bit. If I go directly to an account I am getting an operational type of feed though because everything is correct there. They also give me an "Update Balances" button on the summary page to fix the problem if I care about it. Most the time I don't just because I know it's not really an issue.

Still, I'm interested in the real world examples of people solving or working around this idea we've taken for granted as required.

The point you raise is a good one. One of the things that I am still unclear on is a seeming paradox with CQRS: The times when I might need the benefits of horizontal scalability are precisely the times when Eventual Consistency seems most problematic.

I'll talk about this more in the next post, but a quick real-world example that I'll probably repeat in the post: at one point, I worked on a team that supported the eCommerce store for the NBA. So, besides the Christmas shopping rush, our biggest scalability moment was when a team clinched the Championship (I think it was the year the Lakers beat the Nets that I am thinking of in particular).

At all other moments, Eventual Consistency wasn't an issue, because the Reporting Store would get updated 'immediately enough', if you know what I mean. But when the title was clinched, that was the moment when we need the scalability most of all, but was also the same moment that the Reporting Store was most likely to be noticeable lagging from the Event Store. And was obviously also the moment when we would least want to revert to querying the Domain to eliminate the lag.

I have thoughts about what a good answer to this 'paradox' might be, but I have to think more about it.

@Peter

Thank you for your comment. I have not heard of this example from Udi before. It is a very good explanation of the 'tell the client what they want to hear' idea that I have read about, but it never quite clicked for me.