Data In Motion

Service components in this zone are tasked with moving data between services, applications or solutions. These applications could be within our own business, or with third party organizations.

Typically, our Data at Rest repository will contain our own solution data, and our Data in Motion services act as an Entity Synchronization layer with other services, applications or solutions.

Synchronous Patterns

Synchronous communication patterns are implemented using Remote Procedure Call (RPC) models. The simplest implementation model is to deploy Web APIs to pull or push data in real-time. The primary characteristic of a synchronous API call is that the client waits for the server to respond.

Acting as a Client

When our solution is acting as a client, we need to ensure that we can connect to the remote API within a reasonable time frame. This means that our requests need to timeout if the remote API is running slowly or becomes unresponsive.

We should incorporate retry logic and circuit breaker patterns into our client components.

For push requests, we should consider use of queuing techniques to buffer data flow when the remote API is running slowly or is unresponsive, so that we don’t affect the performance of our solution.

Acting as a Server

When our solution is acting as a server, we need to ensure that we can respond to the request as quickly as possible.

For pull requests, this means that we need to have our data packaged ready to go with minimal processing.

For push requests, this means that we need to ingest the data quickly and acknowledge receipt once we have the data safely stored.

Asynchronous Patterns

Asynchronous communication patterns are implemented by passing messages between solutions. The message could be:

A file being passed using FTP/sFTP or a file share

A block of data being passed through a message queue, service bus, email, or SMS

When communicating with external resources, we must deal with slow running or unresponsive requests. It makes sense to implement an internal message buffering mechanism within our solution to decouple our solution's need to send/receive data from the external dependencies. This means that our solution will continue to perform well even if the target message queues are temporarily offline. When the target message queue comes back online, we can clear the backlog of messages.

Compression

When sending larger data sets, we should be thinking about using compression techniques to reduce the size of data being transmitted.

There is a trade-off between the performance overhead of compressing the data and the time taken to transmit the data. A good rule of thumb is to consider the latency between the source and destination, and the transport protocol used. If you are using TCP/IP then you have a maximum packet size of 64kb, so if your data is smaller than that, there is little point compressing it.

Consider the following example: your data is 10 MB, so then you need to break it down into 156 data packets for transmission. If your latency between source and target 50 ms, your transmission time is 50 * 156 = 8 seconds. If you can compress the data 10:1, then you only need send 16 data packets, and your transmission time drops to below 1 second.

If you are sending a lot of data between remote targets through a single channel, then the latency issue can severely throttle your connection bandwidth. Using compression or multiple channels are design options that need to be considered.

Encryption

As with compression, the use of encryption on data traffic does impose a processing overhead. The benefit of using encryption is data security and privacy. You should be encrypting any traffic over an external network, and it's good practice to encrypt all traffic over internal networks as well.

Encrypting data to and from external solutions means that you are sure that the security of the data is maintained, up to the point that the data reaches or leaves the external solution. What the other solution does is beyond our control! We can however demonstrate that we have done everything we could to ensure that our solution protects the security of our data and the privacy of our users. This is a strong risk management position to take against the possibility of data breaches.

Even encrypting data travelling across our own internal networks is a good idea. We cannot guarantee that our networks have not be penetrated or compromised. One of our key concerns in developing robust enterprise architecture is to create defense in depth. This means that we secure as much as we can in as many ways as we can, without compromising our ability to provide functionality to the business user.

Entity Synchronization Layer

The key objective of our Data in Motion services is to create an Entity Synchronization layer above our Data at Rest repository.

Our Data at Rest repository is going to contain all the data required to run our own solution. But it might also have to utilize data from other internal or external applications or services.

For example, we might have legacy applications within our own business that need to provide daily extracts of data through an ETL batch process.

The Data in Motion layer decouples all of the data sources we need to deal with that are outside the immediate scope of our solution. There are a couple of interesting challenges to overcome.

Security

Our data within the Data at Rest repository is secured by our Instance Access table and locked down to individual Users and Groups. However, when sending/receiving data to/from other applications, we also need to respect their security model. In some situations, this just requires us to connect to their underlying data provider. In others, we will need to map our user identities to theirs – essentially, we will also need to share administrative data.

Our Data in Motion components act as integration points that know enough about the applications they connect to ensure appropriate security data is being used, to secure the data being shared.

Correlation

Our data within the Data at Rest repository is uniquely identified using Instance GUIDs. The Instance GUIDs we create are unlikely to be the same as the primary keys for the corresponding data in other applications.

For example, we might be sharing Policy information with multiple applications, each of which will have a different definition of what a Policy is and how they are uniquely identified.

When using the DaaS solution as an Entity Synchronization Layer, we can correlate the various identifiers used in each application against our internal Instance GUIDs. We can then receive updates from any application, and map the inbound Policy identifiers against our internal Instance GUIDs. We can then generate outbound notifications of changes in the Policy data using the application identifiers for each target application.

Our Data in Motion components record correlations between the data each application shares with the DaaS solution. Since each application synchronises individually with the DaaS, the DaaS is ultimately responsible for synchronising data across ALL of the applications integrated with it, and hence acts as the Entity Synchronisation Layer.

Since our DaaS is data agnostic, it doesn’t matter that each application may have different representations for Policy data. The DaaS is just going to process the data as XML, so if we need a few extra fields for one representation and a few less for another the DaaS is really not that concerned.

The Data in Motion components that integrate the applications need to understand how to transform Policy data into the internal DaaS representation. This can be done using an appropriate hybrid of code or XSLT (preferred).

Legacy Applications

Many legacy applications don’t support real-time integration, and for these older applications we generally use batch ETL processes to synchronize data at predefined times during the day.

We start by wrapping a Data in Motion component around the legacy application to begin the integration process. Over time, we might replace the legacy application with a new application, or simply absorb the legacy application data into the DaaS, and extend the user experience and Data in Action services to replace the application completely.

Modern and SaaS Applications

Many modern and SaaS applications come with a set of adapters that can be used to ease the integration process. In these cases, the Data in Motion components can use the native APIs provided to create real-time data synchronization with the DaaS.

Microsoft supports a range of SaaS application that use Azure Active Directory (AAD) for single sign on. The DaaS could also be implemented using AAD for authentication to create a single sign on integration platform.

Data Warehouses and Data Lakes

The DaaS Data in Motion components can be used to prepare and send data to external Data Warehouses and Data Lakes. Synchronization can be configured using either scheduled batch ETL or real-time updates.

Streaming Analytics

The DaaS Data in Motion components can also be used to perform live analysis on data as it flows through them. This can be done by custom built services that extract BI Meta Data or using cloud services such as Azure Stream Analytics.

Sustainability

Each data flow into or out of the solution is encapsulated by its own Data in Motion service or channel, which is isolated from all others. There are minimal dependencies, and those that still exist are buffered and compartmentalised.

Onboarding new resources is easier, as there are distinct integration patterns that specify how to build new flows, and it is not necessary to understand more than a single flow at a time when modifying existing flows.

The hub and spoke model used by the Data in Motion services simplifies the typical integration hairballs we see in many legacy solutions. This reduces overall solution complexity and enhances our ability to build federated workflow across multiple solution components.