Saturday, November 22, 2008

A fully self contained message is a pure and complete representation of a specific event and can be published and archived as such. The message can - instantly and in future - be interpreted as the respective event without the need to rely on additional data stores that would need to be in time-sync with the event during message-processing.

Some people disagree with me that it is good practice to strive for fully contained messages in an Event-Driven Architecture. They advocate passing references to data that is stored elsewhere as being strong design. Let me explain why passing references is not suitable as an architectural principle and should even be regarded as an anti-pattern in EDA.

First of all, I think everyone agrees with me that SOA and EDA strive for loose coupling. Striving for loose coupling by definition means minimizing dependencies. In SOA the services layer acts as an abstraction layer of implementation technologies. In EDA loose coupling is pulled further upwards to the functional level of interacting business processes.

Passing reference data in a message makes the message-consuming systems dependent on the knowledge and availability of actual persistent data that is stored “somewhere”. This data must separately be accessed for the sake of understanding the event that is represented by the message. Even more: this data must represent the state at the time the event took place, which is not (exactly) the time the message is being processed. The longer the processing is deferred the harder achieving this time-sync will be. E.g. think of processing archives in behalf of business intelligence or compliancy reports. How would you manage to keep available the referenced data in a state (and structure) of the moment the event occurred?Fully self contained messages relief the consuming systems from this dependency; the event can be fully understood through the content of the message. Consuming systems can process fully self contained messages without being dependent on any additional data with regard to the event. Newly implemented consumers don’t need to be made aware of the need for additional data access and so don’t create new requirements on connectivity to these data.

In architectural approaches that strongly focus on loose coupling (such as SOA and EDA) the principle of fully self contained messages should be advocated as good practice. Advocating the passing of reference data, which happens far to often, leads into the opposite direction of the main goal of the architectural approach and so can be stated as being an anti-pattern.

However… Architectural principles must never be rigidly enforced. Architectural principles are guidelines toward a goal, in this case toward loose coupling, independency. Real-life situations may prevent us from implementing architectural principles. For a certain use case it may be too expensive or it may highly decrease performance and efficiency. Or for a specific use case it may technically be impossible to adhere to the principle. Architectural principles always are subject to negotiation with regard to costs, performance, efficiency and technical feasibility trade-offs. This also applies to the principle of fully self contained messages.

Passing reference data in a message may be the best solution in some (or many) cases. But still it is an anti-pattern for the SOA and EDA architectural approaches as it simply drives you away from the architectural goal of minimizing dependencies.

16 comments:

Anonymous
said...

Great post!

Regarding the difference between a fully self-contained message and one that is not...maybe an example could help? Is this more or less what you are talking about?

A Person has Employment History. When a new Employment record is added to a Person, an event is created. Would a fully contained message include the current state of the Person in addition to the new Employment information, rather than merely a reference Person Id? Is this what you had in mind?

It depends on the overall system semantics. If the previous Employee state is embodied in a previous event-message, then adding a Person ID in the message is relevant to serve as a correlation-id to the previous event-message(s).

If the previous state is relevant for understanding and processing the event-message, you probably would add the person's history to the message. You should try to avoid the need for adding a Person ID as a key to access separate data stores just to be able to understand the event and process the message.

But of course you must challenge trade-offs with regard to performance, costs, feasibility etcetera.

Nice post. Two comments:A) Self-containment is not about references only. It is also about other issues - e.g. identification of changes, which triggered the event. The goal is sematic defined by message.B) I reality you cannot send whole database as a context of event and you need to cut references somewhere. It is about modeling of associations and aggregations in canonical data model... The question - Which data shall be included to event? Shall we include all data required by all event subscribers? Shall be these data driven by subscriber's logic? certainly NOT!!! We want to achieve loosely coupling - not to make any assumptions about event subscribers...

You state it very well, Peter. Although you leave me with questions, your comment leads to the discussion the should take place with regard to understanding EDA. You take quite another approach than those who blindly advocate to add references to the message and those who put CEP as equal to EDA.

Jack, thank you so much for this series of great posts on EDA. I really enjoy reading them.

On the subject: the reference you are talking about is a reference to some kind of object that had some event change it's state. And then you're absolutely right that time will erase that state, so referring to the object itself is of no use.However, it is not the object we're interested in, it is the event and it's context. What we can do is pass the event around in some summarised form, and include a reference to the detailed information.

There is one little challenge: the summary that the event contains should be sufficient to do some basic filtering on. That means you'll need to be able to determine upfront what attributes other parties will want to filter on.

The message contains momentary and unchangeable information about the event. All reference data is (in general) changeable, unless you guarantee historical records that will ALWAYS represent the structure and the content at the time the event took place. And moreover, this data must be available for all current and future consumers of the respective message. And these consumers might all be unknown from the publishers perspective and even to you as the designer of the system.

You can compare the modeling of the message with modeling a data warehouse (to a certain extend). Data normalization (creating references) in relational databases serves efficient and consistent change to maintain integrity. As the data in the message - just like in a data warehouse - does not change, the data is modeled to serve availability and not to serve integrity during change.

The data in the message is denormalized, just like you denormalize the data in a datawarehouse. And just like you may want to avoid references from a data warehouse to external data stores, the same applies to event messages.

"E.g. think of processing archives in behalf of business intelligence or compliancy reports. How would you manage to keep available the referenced data in a state (and structure) of the moment the event occurred?"

Another approach, tell me if it is in any way sensible.

Take a business intelligence service, it consumes messages containing references but it has also subscribed to the events for all the referenced entities that it is interested in (excuse my terminology).

With this approach your business intelligence service can ensure that it keeps all the information it needs in its own data store, and that it stores it in a temporal manner. When it then gets message containing a reference it will have access to the up-to-date information for that referenced resource.

I'm not sure whether this is preferable to having self contained messages and storing them but I'd be interested in your views?

I agree that the event should contain all information, but i don't agree that all information needs to be sent while notifying the consumers of the event.

You can use some kind of event cache to store the complete unchangeable event, that is including all the information you would include in the message. That cache can be centralised in some broker or decentralised in the producers' domains.

You can decide to specify and communicate a time to live for that event - or you can decide to guarantee the event will be available forever. You can measure the interest consumers take in the complete event, and you can decide never to keep the complete event again, or base your time to live on the patterns you will see there.

You can do all that, and it is worth investigating if the number of bytes passed around for event messages is perceived to be a problem. If it's not, because no one cares about bandwidth, or just because all events turn out to be small messages anyway (not unlikely), you can send the entire information in one go. And of course, if it turns out that no one ever requests the complete event information, or if it turns out that nearly every consumer of an event will nearly always request for it, then the cache does not make much sense. But you just don't know that, upfront.

Good post. Like Peter I think CDM (common data model) defines what makes a message to be self-contained. Somewhere we need to cut references and I don't see a better driver than what's defined in the CDM as business references between Business Objects at enterprise level. Of course this is an enterprise architecture guideline,the solution architecture should derive from there and take solution specifics/requirements into play.

I beg to differ with the notion that an event should not simply contain a reference. But I agree wholeheartedly with the notion that the complete state that the message would have contained be made available at the time the event is "handled" by one of its interested subscribers.

What real difference is there in completely containing the content, and containing a reference to the identical content?

For sure there is the worry about duration. But we have that with durable messages anyway. There is the issue of time semantics - but actually it may be even better to store the content separately. Especially if the subscriber is somewhat unreliable and prone to miss messages. You have just the same catch-up to do if you get references as you do if you get the full content.

So for me, the bottom line is that as long as the content is exactly what you would have transmitted with the event, you have beter opportunities if you store that content externally and fetch on demand. Pull semantics tend to be more resilient than pure push semantics, so I like to see a push of the lightweight event and a pull of the content so that I can properly scale the infrastructure and insultae the systems from flooding dangers in case of rogue publishers.

I see no direct benefit to having the data completely contained in the message - in fact, I think that is the design anti-pattern for resiliency