Thursday, April 19, 2007

Systems that pass data to each other share commonly understood semantics. Explicit data semantics is the key to success in an EDA (and any other messaging system). In striving for loose coupling, data semantics is the ultimate level; when systems are decoupled at the semantic level - e.g. they don't share semantics - the coupling becomes useless, because in this case the systems will not be able to communicate at a logical level. Shared semantics is a prerequisite in connecting distinct systems, no matter whether it concerns EDA, SOA or any other form of EAI (Enterprise Application Integration). It should be obvious to anyone that analysis of data semantics will always be the first activity of any integration project.

In contrast to sharing semantics, distinct systems do not share formats that express these semantics. Think of different date formats or amounts (semantics: balance on a bank account) expressed in different currencies. Or think of different identifiers: CustomerName versus Custnm. The same semantics is expressed in different formats.

Mechanisms must be in place to harmonize between these different formats that carry semantically the same data from one environment to another environment. This pattern describes such a mechanism based on intermediate canonical formats for semantics representation.

Canonical Data Model (CDM)

In an EDA a business event is represented in a canonical format (presentation) with unambiguous semantics. This format and semantics are defined as canonical message types in the enterprise's Canonical Data Model. These messages are the core of the event-driven architecture and are valuable business assets that must be treated as such with regard to protection.

A message type may be invoked by several source systems in several environments. Several target systems in several environments may consume the same message. The environments that send and receive these messages don't need to know the canonical format. Every environment communicates in its own local format with the messaging system (typical the Global Dataspace implemented by an Enterprise Service Bus). A prerequisite is that every concerning environment has defined their local data formats and semantics in the CDM. The messaging system will provide services to transform the local format to the canonical format and vice versa. These services depend on the CDM.

There will always be a transformation from the local format at the sending side to the canonical format and there will always be a transformation from the canonical format to the local format at the receiving side. Even if the local format is identical to the canonical format a transformation will still be implemented. Such a null-transformation makes the mechanism generic and more agile when changes occur.

At design time, the definition of format transformation is not the only thing that must be accomplished. First of all correctly mapping the corresponding semantics from the local formats to the canonical formats is of utmost importance. Format transformation is the second step. Semantics mapping is vital to the success of the system, so in consequence defining semantics and recording these descriptions in the CDM is not an option, but a must if you want to succeed with EDA, SOA or EAI.

CDM, no commonly used datamodel

The CDM is not a storage component, but a metadata component. The CDM holds definitions of the local formats and semantics of the participating systems and the CDM holds the definitions of the canonical formats and semantics. There is not any persistent processing data in local or canonical format stored in the CDM. Also the CDM doesn't provide a common datamodel that everyone has to adhere to. Such a common model is no longer appropriate since we buy systems from the marketplace with their own datamodels and since we connect many systems from a variety of environments, old and new, sometimes in a B2B context or inherited from merging with other companies, each with their own datamodels, formats and semantics. The pattern described here, doesn't bother the systems owners with constraints on datamodels, formats and semantics. Everybody can use their own models, formats an semantics. Transformation services support the transformations of the shared semantics to and from canonical formats, using the definitions in the CDM. The sending and receiving systems are completely unaware of this; they talk in their own language.

Enrichment and translation algorithms may be part of the transformation services. This applies to different data representations with the same semantics, but it also applies to conversions of different but deducible semantics.

Example of data representation transformation

Two systems share the semantics for "railway station"; they both interpret the meaning of this entity type in the same way: a railway station involves a location and platforms and is owned by Dutch Railways; the rails are not part of it. However, one system uses alphabetic characters to identify a railway station. The other system uses numeric characters to identify the same set of railway stations. So one system identifies railway station Oudenbosch by "A" and the other system by "01". The canonical format uses even another set of identifying characters: alphanumeric. The transformation services must have knowledge of all railway station identifying sets and how they correlate. A persistent data set (e.g. a database) lies at the basis of the resolving algorithm of the transformation service.

In practise the case of this example may be rather complicated; think of how to keep the intermediate data set up-to-date if the connected systems may autonomously add new railway stations (or worse: change the railway station id's).

On the other hand there are also very easy translations, like translations of date formats or miles versus kilometer translations.

Note that all of these data representation translations can be bi-directional.

Example of correlating semantics conversion

In some cases it is possible to convert one semantics to another. Of course this is only possible if one semantics embodies the other one in some deducible way. Let's look at a strongly simplified example of a purchase order.

The canonical format of a purchase order consists of an order number with a set of order lines each with a part number, a quantity and the price of the concerning part on that line. The consuming system understands a purchase order as an order number and a total order amount. The transformation service multiplies the quantities by the prices and summarizes the resulting amounts.

You might argue that this example doesn't mention two semantics, but a different representation of only one semantics. You are right, it is ambiguous. On the other hand, the canonical format holds more data of the order than the consuming system does. So how can the semantics be same?

This is a simple example. In practise you may come across very complicated situations, where multiple complex data structures and complex algorithms are involved.

Note that the conversion can only take place into one direction.

Why?

Using canonical message types decouples systems at the level of message formats. Systems don't have to make assumptions or have to rely on other system's data formats. This is an important aspect in striving for loose coupling.

Defining canonical message formats creates the opportunity to supply the company with an unambiguous catalog of available messages about business events, representing valuable business assets. The business events in this catalog are independent from the sources that generate these messages. Based on this catalog policies can be implemented with regard to ownership and degree of free availability of data that is exchanged between domains. New business models may pop up with regard to data exchange. The catalog may contain rates associated with messages about business events. Publishing data about business events may be marketed: suppliers get paid for the published data by consumers. The IT-department delivers the market place (infrastructure) an may play a role as business events broker.

In a technical sense this pattern has a benefit in that at the endpoints only one transformation service per message type has to be configured. A subscriber needs to subscribe to only one message type, regardless whether there are multiple sources or not.

If transformations would take place directly between local formats (skipping the intermediate canonical format), transformation services have to be created for every source-target combination. This would lead to higher loads of management and maintenance efforts. Consumers would have to subscribe separately to every source of a particular message type and should in consequence have knowledge of the existence of these distinct sources.

Without intermediate canonical format a format change at the publishers side must be followed by changing all the transformations to the subscribers. Using an intermediate canonical format makes the transformations to the subscribers independent of changes at the publishers side.

Without canonical formats for semantics representation semantics would be represented in multiple equivalent formats. This obstructs the possibility to supply the company with an unambiguous catalog of business events independent from their sources. Also the lack of canonical formats will consequently cause system designs and resulting systems to be more complex and harder to change.

This is indeed true and more organisations are adopting this approach. Also the standards bodies are producing the market vertical common data models that can be used as the boilerplates for an organisation to implement this approach without the need to create their own model. Examples of the take up are standards such as ACORD or the TMF SID.

Futher more there are commercial products that exactly allow the implementation of the solution you mention in IDE type environments and then instantiate this all including the CDM and metadata at runtime either with or without the ESB. It is not the development time that is reduced but the maintenance as using the metadata it is possible to do impact analysis of what a change to one of the interfaces will do to the whole ecosystem. One organisation who have taken this approach are seeing real world reductions of 50% in development time where a large number of systems need to be integrated but tellingly a 90% reduction in the ongoing cost of maintenance. Not the sort of savings that are to be sniffed at.

Another major benefit is in allowing new COTS packages to be brought into the existing eco-system alongside the existing legacy application and have parallel execution of the two systems without the rest of the ecosystem being aware of the fact.If you apply semantic data rules to the data flowing through the CDM it is possible to create semantic routing to particular endpoints and even cut over from the legacy application to the COTS solution gradually based on data types or anything that can be described by a semantic data rule and in much more controlled manner than if direct integrations are used. Enrichment allows the 2 systems to operate to support upstream requests to build the coherant view of the entity during the cutover process.

I really enjoy your articles, and hopefully my observation will add something useful by way of a discussion point.

The CDM I am developing and using would take your first example (of the train station) and create an entity that represents the actual train station. All of the known facts about it: its name, lat/long coordinates, street address, owner, operator, etc. are added as associations to make give the entity context. The model then declares that the entity is unique and makes that single point of data the reference for all other representations.

So, if one system uses '01' and another 'A1' as identifiers they can both point to the single entity as the 'real' entity. This technique grew out of a large localization effort where multiple strings can refer to the same 'thing'. Rather than make English the 'parent' language and all the other languages 'children' of English I created a 'non-lingual' entity (NLE) and made all the languages both peers to each other and children of the NLE.

Any future reference to that particular station - from any system - can then connect to any other application from which it requires services using that common reference. This approach has the added benefit of beginning to clean up data integrity issues, but the primary benefit is that once created, the 'real' entity can be referenced and reused anywhere.

The second example seemed to me to be more about structure than about semantics. A related example might help me understand. If I have a cell phone, I may only be interested in its features, but an electronics engineer may be interested in the component parts. Same phone, two different 'dialects'.

This is a great article and one of the few I have found that talks about how to actually design the canonical data model. I have a couple of questions:

1. What are your thoughts on situations where the same type of data may be generated at different levels of granularity due to regional variations in the process? Are there any patterns on how to design the CDM in such situations ?

2. Can you point to any other literature that clarifies how CDM design should be done ?

1. I would say, try do define the most comprehensive canonical format and translate from/to that format. If this is not possible, you will have to design for the most feasable, which I can not define in generic terms without more detailed knowledge of the specific situation. Here it comes to the crafsmanship of the designer.

2.Until now I didn't come across good stuff. If any of my blogreaders know of such literature, feel welcome to share it here.

I also have not seen any material specific on how to design a CDM. After much study it is my opinion that the best way to understand how to design a CDM is to understand ontologies and the ontology matching problem. at the end of the day a CDM is just an ontological commitment.

This is a great article. I'm a data modeller, not a SOA architect, so I don't know how the technology works, but I do know of a large organisation that uses their 'enterprise data model' as the basis of the their service schemas (XSDs). Their 'enterprise data model' is a third-normal form Logical Data Model, which has a corresponding Physical Data Model (PDM), managed in the same data modelling tool. The PDM contains a submodel ("view") for each of the XSDs, and they use the tool's automation facilities to generate the XSDs when required. they can also use the tool to conduct impact analysis, and their schemas use the same semantics as their logical data models. Win, win win!

Whereas I believe in general this is a good article and describes a very useful pattern (especially when many commercial products are available in the architectural landscape), there are some scenarios where introducing a mediation / translation tier might actually result in a huge amount of extra work that could be avoided.An example of this scenario is when a be-spoke solution adopts the SOA architectural style to deliver a large and complex system where one of the key requirements is to deliver a “domain agnostic” flexible solution. The system may have n-tiers and could even have competing technologies utilized for the presentation and the services and orchestration tiers. In this scenario what would be the best approach to model the data across the tier? Should the presentation tier be developed to consume the services as they are defined? Or should the ESB deliver additional mediation services to tailor the information and semantics as required by the presentation tier in order to support a particular UI? In my opinion this should not be the case. I believe that in the above scenario the presentation could consume the SOA services as they are presented / described by the contract (which would probably be in a canonical –generic- form). The presentation tier (which would also consist of n-tiers probably implementing the MVC pattern) would then provide the mapping required to retrieve/display information from/to the view. The benefits from this approach is that a considerable portion of the code from the presentation tier can be reuse if the business decides to re-use the solution in another business domain.

What Anonymous said... (on October 27, 2009 1:30 AM) is what you have to consider in practice.And although no using a CDM, but canonical messages means, you try to share some knowledge globally.This is seldom necessary. In practice you should know the message provider and the consumer(s). The provider defines the data and you will have mediation / translation, probably via a common format, but this just demand driven, tactical.

Hi, this is a great post and I know this is an old one but I still think that its valid todate.

I have couple of questions about this pattern.

When you are building a brand new service and comeup with a cannonical model depending on the consumer needs, do we still need consumers to do a local format conversion or force them to adhear to the cannonical model?If we comeup with a common cannonical model that is comprehensive and there is filtering needed to be done as consumer doesnt want all the data, do you do it at the consumer level? Doesnt it make the consumer tightly coupled to the data structures?If we have to implement the filtering for the consumer, what is the best way to support most if not all of the consumer needs from adhoc query perspective?