Blog

Abstract

An open source project called Apache Atlas was approved as an incubator project on 5th May 2015. This project aims to provide an open metadata repository and information governance frameworks.

The appearance of metadata management in open source offers an exciting opportunity to rethink the way we manage data in data science projects to create trust to both share and consume data.

This blog describes the role of Apache Atlas in changing the availability and quality of metadata, which will in turn improve both the agility of the data scientist and the transparency of the results they produce.

Lies, damned lies and statistics

Whether this is a real quote or not, the phrase “lies, damned lies and statistics” comes to many people’s minds when you ask them what they think about statistics.

There are too many examples of contradictory facts and figures published about our health, the economy, the environment and our society that many people have become suspicious of numerical based evidence.

This is not unreasonable. Even data that has been meticulously captured has some form of bias in it that reflects the reason and context in which it was collected. As a result, two data sets reporting to be about the same situation may legitimately contain conflicting information.

Why does this matter?

Data science has an increasing role beyond bringing additional insight to policy makers. Our world is driven by software; in business, transport, our homes, health and many aspects of our daily lives. Part of this software is analytics – the output of data science. So data, and the services that use it – will be embedded in the infrastructure that drives and supports our society.

What happens if the analytic model is wrong, or there is a problem with the data feeding it, or the way that the results are used? How will we know before disaster strikes? And if disaster does strike through unforeseen circumstances, then how do we restore trust?

This paper is about enabling transparency in the origin and meaning of data, the analytic models that operate on it and the IT systems, processes and people that support it.

Transparency will not prevent inappropriate use of algorithms and data, nor mistakes in interpretation, but it will allow others to verify the usefulness of data and the results of data science. This way it is possible to make an informed decision as to whether to trust a piece of data or not.

A simple example

An ice cream vendor is looking for a reliable estimate of how many ice creams she will sell each day. She observes that the hotter it is, the more ice creams she sells.

The ice cream vendor looks for a data service to help. She discovers one that offers an “ice cream sales” predictive service based on the weather.

The analytic model behind this service was trained with data that records the historical trends of ice cream sales against temperature.

The ice cream seller starts using the service to predict her demand. The service is helpful, but not always accurate which means she sometimes over stocks and other times she runs out of basic ingredients.

Analytics-driven data services may produce the wrong results for a number of reasons:

The data that is fed into the service is just wrong. For example, if the weather forecast used in the ice cream example is not very accurate.

The situation that the analytic model is being used for is not appropriate, typically because it inconsistent with the training data used to select and configure the algorithm. For example, if the model was trained with data from Italy and the ice cream seller was using it for the UK. The analytic model in this case could be inaccurate if the ice cream buying habits in Italy are different from those in the UK.

The model is too simplistic. For example, the ice cream sales may also be affected by humidity, or whether it is the school holidays.

How does the ice cream vendor know that they are using a good data service for their purpose? With a little more information about how the analytic model behind the service works, she, or maybe a data scientist advising her, could understand the behaviour of the service more deeply.

We call this data transparency.

Achieving data transparency

To assess the behaviour of a data service, particularly one that is backed by an analytic model, we need to understand its past, present and future.

The past describes the origin of any data and processing used to create the service; the present is the current way the service is deployed and being used; and its future is a view on the activities that will maintain the accuracy of the analytics over time since the world it is modelling is continuously changing.

When we think about the origin of data, it is rarely sufficient to just know which system supplied the data.

For example, during analytics development, data is typically supplied to the data scientists by a specialist system that maintains historical copies of data from the operational systems. The operational systems are the systems that actually drive the business. Their data is copied and maintained externally for analytics because the workload generated by the data science experiments is uneven and disruptive to the operational systems and operational systems rarely keep enough historical data for analytical exploration.

Operational systems are highly connected and share data. So it may be necessary to trace back through several systems to find the real source of the data.

The flow of data through IT systems is called its information supply chain. Data of different subject areas (aka domains or topics) typically each has its own information supply chain. The process of tracking the information supply chains is called lineage.

Lineage creates a record of how the systems that are processing data are linked together and exchange data so it is possible to track down the context and purpose of the data collection in the initial instance – a critical piece of knowledge to be able to determine if the data service is going to provide useful insight.

To understand the usefulness of an analytical data service, we need to understand the lineage of the data used to train the analytic model behind the service, as well as the

lineage used to supply any data to the data service once it is deployed. With this lineage data, and knowledge of the processing that occurred along the information supply chain, it is possible to determine if the resulting model is compatible with the situation (and corresponding data) that the service is to be used for.

Metadata is the name given to information such as lineage that describes data and processing. Metadata literally means “data about data”.

Metadata comes in many forms and delivers a wide variety of information. However, fundamentally it helps to trace the data’s origin and intent.

Self-service and agility

Metadata has value beyond transparency of analytics. A large part of a data scientist work is locating good data sources.

Metadata describes data in all its forms. This includes where the data is located, how it is stored, how frequently it is changing, what it represents, how it is organized, who owns it and how accurate it is.

Without this type of metadata presented in a searchable form, data-oriented projects are delayed while the team hunts for the data they need. In many data science projects, this process can consume over 70% of the project resource.

Data scientists also need the detailed lineage and knowledge of the source system that created the data in order to correctly align it with a new analytic model.

Adding governance

Metadata does more than describe data. It also encodes the requirements for how it is managed. Typically this is in the form of classification. The classification is attached to the metadata description of the data. This identifies which governance rules apply to this data.

For example, some data might be classified as personally identifiable information (PII). This label then restricts where this data can be used and by whom.

By encoding the classifications and the rules in machine- readable formats, it is possible for a data platform to automate the execution of many governance requirements offering both a cost-effective and reliable governance implementation for many legal and ethical requirements.

Data privacy, rights management, an organization’s own views on their brand image and the industry regulations they must support, can all be supported in this way provided reliable classification metadata is available – and the rules are machine-readable/executable.

Missing metadata

If metadata is so important to all forms of data-driven processing, why is it rarely available to the data scientist? This comes from the way that metadata is created and managed.

The metadata that most people are familiar with is the metadata associated with a photograph. Most digital camera manufacturers capture information about the light conditions time, location and camera settings when the photo was taken. This metadata is embedded within the photo and is accompanies the photo wherever it is copied to.

The standards associated with photographic metadata help to ensure metadata from different camera manufacturers is available and transferrable between different software packages. However, in all cases, each metadata provider implements the standards in subtly (and not so subtly) different ways, whilst still conforming to the standard. Software packages also tend to wipe out metadata attributes created by another package that it does not recognize.

What is needed is a single implementation that all manufacturers can use creating a consistent approach. This single implementation needs to allow innovation by the manufacturers – but not at the expense of the metadata fidelity. We also need metadata support that covers all types of data.

Most data that drives systems does not flow with the metadata attached as a photo does. Once the data is extracted from a system, it quickly becomes detached from any context information that describes what the data is about. Data integration software that copies data between systems tries to maintain metadata about the systems they are connecting to and the data flows between them to enable lineage, but the coverage is patchy.

A single implementation of the best practices associated with metadata management that is used across software packages would improve the fidelity in which metadata values are handled.

Traditionally metadata management has been providing by specialist tools. Despite the many metadata standards that exist, most metadata capability is specific to particular tool vendors. The metadata capability is added after the system producing data have been running for a while and the process for populating the metadata repository is often laborious and error prone.

Changing the game

In May 2015, a new open source project called Apache Atlas was started to create an open source metadata and governance capability.

The initial focus of the project was on the Hadoop platform, but IBM has been investing in it to broaden its scope, both in the types of metadata it can support, and in the ability to run on different platforms, particularly cloud platforms.

The philosophy of Apache Atlas is that the metadata repository is embedded in the data environment. This means all data activity is captured continuously by default, so there is no need for an expensive and error prone process to populate the metadata repository after the fact.

Already we see the benefit of having an embedded metadata capability in the Hadoop platform, as different components are being extended to log their data assets and activity in Apache Atlas, enabling the capture of lineage flows through multiple processing engines running on the platform.

Can we repeat this success across the majority of data- processing platforms? One priority would be for cloud- based platforms to embed Apache Atlas since it is often hard to keep track of data in a cloud service. In addition, we need a focus on systems where data is being accumulated from multiple sources for ongoing reuse.

Expanding the possibilities

Apache Atlas is in its infancy. How would it need to evolve to support a broader ecosystem?

To ensure Apache Atlas is embeddable in as many technologies as possible, it needs plug points to connect it into the hosting platform’s specific security, storage and network infrastructure.

There are two new challenges that the continuous capture of metadata within a platform creates. Firstly this generates a huge volume of metadata that needs to be continuously and automatically organized, pruned and managed so it is useful. Secondly, the local metadata repository is an island of information, so it is necessary to connect it to other metadata repositories to build up the complete picture of the lineage and data sources available. Both of these aspects must to be addressed to create the broader metadata ecosystem.

Finally we need mechanisms for automating governance and recording the action taken. This is necessary to enforce standards and legal obligations in the data creation, management and usage.

Thus I am recommending the addition of three new frameworks in Apache Atlas:

The open connector framework provides a common mechanism to access data and its related metadata. The connector is able to blend metadata stored with the data with the metadata managed in an Apache Atlas metadata repository with the context information from the execution environment to create a simple interface for tools and developers to make use of the metadata. The connectors also call the governance action framework.

The governance action framework is called to execute automated governance actions, such as masking when particular conditions are met. Typically the governance action framework is called from the connector framework when data is created, updated, deleted or accessed. However, it can be called at other times, for example, when new processes are deployed, or to select between different connectors.

The discovery framework orchestrated pipelines of discovery analytics to enhance and maintain metadata automatically as the data stored in the platform changes.These capabilities are added as frameworks rather than closed component to enable future innovation by all types of organizations.

The open source project will offer basic function by default. These can be augmented through open interfaces with advanced function sold commercially or developed in house.

Building the ecosystem

For Apache Atlas to be sustainable, it needs a broad community of contributors and consumers from across commercial, government and academic organizations.

Through ongoing activity, the project will grow both in capability and in the confidence that people will have to invest in it.

The broader the adoption, the more metadata will be stored in open formats and managed thorough well-defined and open governance processes.

As a result, vendors, consuming organizations and researchers that work within the ecosystem will benefit in working with more discoverable and assured data, and so a network effect is created that will be self-sustaining and increase the data available for advanced data driven services.

Getting involved

If the Apache Atlas project seems of interest then there are the following suggestions on how to get involved:

Direct code contribution to the Apache Atlas project. There are many features that still need to be coded.

Research into automation around the identification, capture and maintenance of metadata. Automation keeps the cost of metadata management to a minimum and often improves it accuracy.

New standards for exchanging governance and lineage metadata between metadata repositories, and ways to encode metadata into data flows.

Encouraging vendors/partners and projects internal to your organization to embrace Apache Atlas and its standards to grow the ecosystem of data and processing that is assured by metadata and governance capability.

Data is too important to allow metadata management and governance to be an optional extra for a computing platform. This is an opportunity to make an important step forward in the usefulness, safety and value associated with data driven processes and decisions.

Additional information

The following URL links to a blog series that provides additional information about open metadata and governance and the Apache Atlas project

When a person shares data with an online service, how far should it be shared? On one hand, inappropriate data sharing can breakdown trust in a service. On the other hand, no-one likes to type in the same information every time they make a request of the service, so some saving and storing of information is certainly desirable.

I started looking at the options for data sharing within a digital service as part of my research into privacy by design. The resulting model shows that data sharing is occurring in a complex technical environment where the needs of different parties (people and/or organizations) need to be kept in balance.

Figure 1 is the start of the data sharing journey. The end user has entered some data on their mobile device that is connected to a digital service.

Figure 1: data sharing – to the digital service

Data sharing scope labelled (1) represents the case where the end user’s data does not leave their mobile device or PC. The digital service processes the data locally and potentially shares anonymized results, or requests for actions with the connected digital service. This sort of pattern is rare because many digital services want to capture the raw data about the user for future processing. However, this approach is very useful where highly sensitive data is being used – such as when biometric information being used to confirm identity. The digital service does not need the biometric information of the end user – it only needs to know he/she is an authorized user. This type of pattern is an example of data minimization that is recommended by the privacy-by-design practices. It can be used as a mechanism to gain permission to process data that a person does not want to share.

In the more typical case, data is sent from the end user’s mobile device to the digital service. The act of sending the data brings in data sharing scope (2). Between the end user’s mobile device and the digital service is a multitude of network service providers, each able to see the packets of information flowing. These service providers can see how much data is flowing, how often and between which devices and services. Sometimes that is enough to guess what is going on. If the data exchange between the mobile device is not secured, then the contents of the information packets are also being shared.

Figure 2 considers what happens inside the digital service. There are 3 data sharing scopes shown, labelled (3), (4) and (5). Each represent data that is only seen and used by the digital service, but for different lengths of time.

Figure 2: data sharing – inside the digital service

Data sharing scope (3) represents data that is used for a single transaction. You can think of a transaction as a business exchange – such as selecting items to purchase and then confirming the order and paying for it. Within the transaction are a number of exchanges of data. The digital service may refer to other data it has stored from earlier transactions.

Some of the data from each of the end user’s transactions is often kept for future use. Data sharing scope (4) is data kept for the exclusive use of the digital service when working on behalf of this specific end user. It includes commonly use information such as their name, contact details, preference and historical information about their transactions. This end user data is typically removed when the end user closes their account, or removes their profile. Data sharing scope (5) covers data that is used by the digital service for any user. For example, a navigation system may use the data from all users to detect where congestion is occurring and then use that insight to guide an individual user.

Often a digital service provider supports multiple digital services that the end user can sign up to either incrementally or as a package. The end user is encouraged to use the broader range of services because information they have already entered is pre-populated in the other digital services. This data sharing is shown in Figure 3.

Figure 3: data sharing – digital service packages

Data sharing scope (6) shows the sharing of data between the digital services during a transaction. This may be to offer the end user additional capabilities as they use the service.

Data sharing scope (7) covers the sharing of end user data between the digital services for the specific user and data sharing scope (8) covers the shared data that is made available to all digital services in the package for all users.

Some digital service providers have many digital services that are grouped in packages. Figure 4 shows data sharing within a digital service provider with multiple digital service packages.

Figure 4: data sharing – digital service platforms

The data sharing between multiple digital service packages follows a similar pattern to that within a single digital service package. There is sharing during a transaction – data sharing scope (9) – and sharing of user data across digital services from different packages for whenever the specific user is being served – data sharing scope (10). Data sharing scope (11) covers data shared by with all digital services from the digital service provider, irrespective of the package they are in.

I have called it out as a separate set of sharing scopes because these packages typically represent different lines of business. So there may be a package of digital services for banking, another for insurance, another for loans. From the end user’s perspective, just because these packages are owned and operated by the same organization does not mean that the sharing of information between them is always acceptable.

Figure 5 adds the complexity brought in by the use of external cloud platforms. Cloud platforms are particularly complex environments for understanding data sharing because there are often multiple organizations involved in supporting a cloud-based digital service. The result is that deep in the technical implementation, out of the sight or control of the end user, their data is being processed and stored on computer systems owned by organizations unknown to them.

The cloud platform provider is the organization that provides the data centers and the infrastructure (computers, operating systems and basic services for running a digital service). .

Figure 5: data sharing – cloud platforms

The cloud platform provider can see the number and types of requests being received by the digital services they host. If the digital service provider does no properly secure and encrypt the data stored with the cloud provider using their own private encryption keys, they are inadvertently sharing their data with the cloud provider.

Digital service providers are not restricted to using a single cloud platform either. Figure 5 shows a digital service package that spans two cloud platforms. This means that data shared with or inferred by the cloud platform provider – data sharing scope (12) – may be received by multiple organizations.

Figure 6 shows the sharing of data with third parties. Data sharing scope (13) indicates the sharing of data within a transaction – for example, a call to a payment service during a purchase. Data sharing scope (14) is where accumulated information about the end user is passed to a third party. This typically occurs when the end user needs an account on the third party’s digital service platform for their digital services to operate properly. Data sharing scope (15) covers more general sharing of data with third parties by the digital service provider. This may include personal data about the end users but is more likely to be aggregated information about the digital service’s operation and the volumes of different types of requests it is processing.

Figure 6: data sharing – third parties

Once the data passes to a third party, the digital service provider looses control of the data and must rely on contracts and other legal obligations to control their use of the data. This is why there are no details shown on diagram as to what happens to the data once it is received from the digital service provider.

Finally figure 7 shows the sharing of data with everyone – or at least, anyone who signs up to a service, or open data site. At this point there is a total loss of control on how this data will be used, combined and shared going forward.

Figure 6: data sharing – public

Data sharing scope (16) shows the digital service provider making data public; (17) is the cloud provider publishing data and (18) is a third party that received data from the digital service provider that is making the data available for public use.

Public data sharing may of course be under the control of the individual – for example, when they send a message to social media to publicize that they have purchased something or have achieved a goal. It may also be an intentional behaviour of the service. However, for many digital service providers and cloud service providers, the presence of a particular data set in the public domain may be the first indication they have that they have had a data breach.

So what do these data sharing scopes teach us, apart from the fact that this is a complex topic :)? This model is firstly an analysis tool. It provides a scheme for a digital service provider to characterize the sharing scopes for each of the data sets they manage. This analysis may identify additional opportunities to share data, and places where additional security may be required. Secondly, this model provides a framework to explain to an end user how their personal data is being shared.

In summary, understanding the data sharing scopes helps digital service providers design the optimal use of the data they are capturing and where to secure it in order to balance the needs of their end users’ privacy and the breadth of services that could be offered.

Abstract:

Almost all data that is generated for our digital economy is about individuals and data-driven services typically provide targetted, customised services to people as they go about their daily lives. How does an organization whose business is built around data ensure that their work is ethical?

Privacy is a very personal perspective and it is contextual. We tend to have less concerns about our privacy when we deal with an organization that we trust. More importantly, when trust is present, we are more likely to grant access to our data and allow the organization greater license to process it. Maintaining this trust is an essential part of a digital economy.

To most people, today’s digital technology is baffling. They use the technology and see its benefit – but when they hear about cyber attacks; identity theft; the buying and selling of their data; phishing, ransomware and other scams, they have no foundation on which to judge the size or seriousness of the problem.

Emerging Regulation

In answer to these threats, countries around the world are passing legislation that is aimed at protecting their citizens from the inappropriate and careless processing of their data.

For the European Union we have the General Data Protection Regulation (GDPR). The GDPR defines a comprehensive set of requirements for the processing and protection of personal data. It seeks to address this issue of trust by creating high standards for data security and transparency in the way this data is processed. The GDPR makes no specific recommendations on how this is to be achieved – just what the effect must be – which makes sense in such a fast moving technical landscape.

The breadth of the GDPR is also interesting – it is covering all data that could potentially be connected with an individual. So this covers monitoring of assets, devices and activity at specific locations since this data may be used to understand and target an individual. It is effectively scoped to our entire digital economy and will have a significant impact on all commercial activity in this space.

So what are the implications of this type of legislation? How will digital businesses thrive in an open and transparent way, protecting their investment whilst creating a level of choice and control in people’s lives?

As with all technology, our digital technology is essentially ethics agnostic but it pushes the art of the possible to new limits:

The availability of a wide range of data from many sources.

The ability to cheaply process and link this data together to understand a bigger picture.

The accuracy with which an individual can be identified and targeted.

The ability to pinpoint location for contextual insight and surveillance.

The application of this new insight to a wide range of activities and actions.

The operation of this insight in real-time or near real-time.

It is the use of technology that determines its impact – in terms of how it consumes people’s time, their ability to move about and the information they see when making a decision.

The digital economy begins to fail when individuals become overwhelmed with interrupts as they go about their daily lives. Imagine getting an unsolicited text message offering a new service every minute – how long would it be before you switched your phone off?

It also fails if people are reluctant to share their information with a digital service because of unknown and imagined negative consequences.

In many respects we are already living in a virtual reality where our perceptions of the world and the opportunities we see are shaped by the algorithms that funnel information and offers to us. If these algorithms are trained on data that reflects the prejudices and inequalities of our society, they optimise and amplify discrimination in a way that is both illegal and unethical.

So how does the digital economy respect and individuals rights, privacy and freedom whilst providing personalised digital services?

Building Trust

One of the first lessons I learnt when I started looking at ethical issues around big data and analytics is that there are no universal definitions of concepts such as privacy, freedom and ethics. Each of these are measured by personal perspectives based on experience, education, religion, culture, upbringing and family – and this perspective is not static. It can change as they become more familiar with a situation, see the consequences and the benefits.

So you can imagine a sign-on sequence to a service as:

Computer:
Who are you?
Person:
I am Fred, here is my password, fingerprint etc.
Computer:
OK I can see that you are Fred, which means you can do a,b and d.
Person:
Thank you computer but before we go on I would like to set some
ground rules for our interaction. This is what you can do for me;
this is what you can store about me to support our interaction and
this is what you can share.

Understanding what permissions an individual is likely to grant is where we need to bring in the expertise of social science and psychology.

From their work we know that individuals are not a single persona. Their behaviour is influenced by the context in which they find themselves. This also influences what information they are willing to share. We can think of this as spheres of trust. So we are more open with our friends than with strangers.

Figure 1 – spheres of trust

We also have relationships with organizations and we extend different levels of trust and data sharing accordingly.

The spheres of trust begin to break down when we use digital technology because the same device is used in multiple spheres.

Figure 2 – mobile use

In addition, it is a complex world of interlinking services behind the logo. Data can be shared with multiple organizations without the awareness of the individual.

How do we design systems that respect our spheres of trust when data is collected from our devices and shared with other organizations?

Privacy by Design

Privacy by design is an emerging discipline that recognizes this complexity and seeks to design systems that are cautious in their use of personal data – to assume that it is not able to understand the sphere of trust it is operating and aiming to minimize what they collect, process and share and purposefully avoiding the identification of the individual in the data they collect.

It also encourages transparency in processing too so the individual understands what data is collected, how it is maintained, what it is used for, how long it is kept and where/who it is shared with.

A digital business that respects an individual’s choice, gives them control over the processing that occurs and how data is shared, and operated it services in an open and transparent way is going to be given time to create familiarity, demonstrate value and build up trust in their operation. As trust grows, so does loyalty and the broader use of the organization’s digital services.

Summary

Establishing trust and transparency will become as necessary to a digital business as cyber-security.

There is huge scope for differentiation around the ethical processing of data since a higher level of trust leads to a more open sharing of data for a broader set of services.

Everyone has their own notions of privacy and ethical behaviour. It is important to offer choice and gather consent so individuals can customise the behaviour of their digital services to a level that they are comfortable with – and over time this builds trust and them opening up to a broader use of their data by digital services.

Many analytical and reporting systems use data that originated in other systems. In fact often the data has already flowed through multiple systems, receiving additional values, whilst being transformed, combined and aggregated in multiple ways along its journey.

How does an organization understand the true meaning of data served up in a report, or used in an analytical calculation, when its origin and the transformation it has undertaken is not clear?

Without this transparency, it is not possible to ascertain that the appropriate data is being used; it is complete and timely and has not been corrupted along the way.

Achieving transparency around the origins of data is a difficult task because there are typically many technologies involved that have been deployed into the landscape over the years in a piecemeal manner, each focused on the needs of a particular project or functional requirement. There is rarely a well-architected coherent end-to-end implementation that delivers the data. As such, additional techniques need to be employed to piece together the flow of data and the processing that affected it.

There are multiple techniques that can be used to deliver insight into the data flows but each technique has different costs associated with them. The choice of technique depends on the type of question you wish to answer and the level of confidence that you need in the answers.

In brief, the techniques are:

Design lineage – providing a view on how the systems and the code that copies data between them are linked together to create an end-to-end data flow (also known as an information supply chain). With this type of lineage it is possible to understand the implementation of the data flows to ensure they are appropriate, complete and efficient.

Business lineage – providing a customized, filtered view of the design lineage focused on specific types of processing and/or key systems. A specialist uses business lineage to verify that the processing is correct.

Operational lineage – providing operational information that shows when data was copied, how much and the types of errors that occurred. This type of lineage is used to demonstrate that the data flow is operating correctly. It is useful to identify where data has been lost or missed out in the processing, and where errors are occurring in the information supply chain.

Provenance – provides value-level logging for tracing the origin of a specific data value from original source to destination. This type of traceability is used to verify that an individual data value is correct where as the lineage mechanisms are for proving that the process is correct.

What follows is a description of how these different mechanisms can be implemented. It uses a simple example of a sales report to illustrate the different levels of detail that can be captured and the resulting insight.

The sales report

The sales report example is a monthly report showing the sales made in each country (see figure 1).

Figure 1: Sales report layout

Figure 2 shows the systems involved in providing data to the sales report. The sales made in the organization’s stores are recorded in regional sales tracking systems (1) and there is also a digital sales channel system (2) for sales through their website. These digital sales are attributed to the country where the order is delivered.

Information from these sales tracking systems are gathered together into a landing area (3) file system and then picked up and complied into the monthly totals within a data mart (4). The final report (5) is assembled by combining data from the data mart with the active targets maintained in the sales management system (6).

Figure 2: Sales tracking and reporting systems

The data for the report is distributed throughout these systems and is gathered together through a number of different processes. Figure 3 shows the origin of different parts of the report.

Figure 3: Data origins for the sales report

On the left hand side are the sales tracking systems. They have a record of every transaction and the country where it occurred. These sales records are copied unchanged into the landing area and then they are picked up and aggregated into the data mart. There are 2 different processes running to create the data mart.

There is one ETL job that runs every day and aggregates the sales from the Americas sales tracking system, EMEA sales tracking system and the AFE sales tracking system.

There is a message-based process that takes messages from the digital channel sales system and adds the sales transactions they refer to the data mart totals.

The reason there are two processes is that the digital channel sales systems was added many years after the other sales tracking system and no-one wanted to change the original ETL job to add the data from the digital channel – particularly since the processing required is different.

The resulting data mart is queried by the report generation process and combined with the query results from the sales management system that returns the targets for each country. The report generation process calculates the percentage of the target attained for each country and whether they are on target given how far through the year they are.

Even with this simple example report, it is possible to see some of the challenges associated with understanding the origin of data and the processes that copy and transform it. For example:

Some data values are not stored – they are calculated by code embedded in one of the systems that deliver the data. For example, “% of Target”.

Some data values are derived from different data values. For example, “Sales this Month” is calculated from adding up the “Sales Transaction”s.

Some data values occur in every system and are used to correlate data together. (For example, Country Name). An error in these values results in errors in correlating and aggregating data.

Different systems use different names for the same type of data; or use the same name for different types of data. There are no common standards.

There are many different types of technology to implement the systems and the processes that copy data between them.

Different systems use different data formats for the same type of data. The information supply chain has to correctly transform this data to support its use in a downstream system.

So it is a detective job to piece the information supply chain together.

Building an understanding of the information supply chain

The detective work begins with identifying the relevant systems and data values that make up the information supply chain. Typically an information supply chain focuses on:

The flow of data to a specific report, analytic or application.

The flow of data related to significant type of data, such as customer or product data.

In this example we will focus on the information supply chain for the sales report. The information supply chain is describes in a metadata repository. Typically it is built up starting at the report end of the information supply chain and working backwards, identifying the data values of interest and describing them in the glossary of the metadata repository: one glossary term for each type of data and one term for the overall report.

Then looking at the implementation of the report, it is necessary to identify the data schemas and the processes (functions) that build the report.

These data schemas and processes are also defined in the metadata repository and linked to the glossary terms. See figure 4.

Figure 4: glossary entries for the sales report

When the functions combine different types of data to create the report data values, new glossary terms are created for these types of data and they are linked to the appropriate schema.

This process iterates backwards along the information supply chain by examining the processes that populate the schemas that are identified as part of the information supply chain, documenting them and the schemas they draw data from; then repeating the process until the original sources are encountered.

The result is a glossary with a term for each kind of data requiring traceability and a entry for each type of schema managing that kind of data. There are links between the two.

Design lineage

The process described above provides the foundation definitions for traceability called design lineage. The glossary identifies the schemas where a particular kind of data is stored.

Design lineage shows the tree of processes and data flows that provide data to the report. See figure 5.

Figure 5: Design lineage for the sales report

Design lineage is often documented by hand. However some tools, typically ETL tools, provide support for design lineage for the parts of the processing that is modeled for their engines. When these tools are in place, it is possible to drill down into their processes to understand the detail of the transformations. (Figure 6).

Figure 6: Drill down of design lineage

The documentation needs to be in a machine readable form connected with the glossary and schema definitions. Ideally it is stored with them in the same metadata repository so that the flow can be connected to the definitions. This makes it possible to dynamically generate reports of different perspectives of the lineage.

Design lineage is a static definition of the implementation. The size and effort associated with the metadata for design lineage is proportional to the complexity (number of systems, schemas and processes) of the implementation rather than the volume of data or the frequency with which data flows. Thus the ongoing cost of maintaining design lineage is determined by how frequently this implementation changes. There needs to be a lineage maintenance step introduced into the change management processes to ensure it remains current.

Business lineage

Design lineage is useful for architects to understand the implementation of how data flows. However, subject matter experts in the business that wish to audit the processing on the data can find it complex to navigate. Business lineage provides simplified views over the design lineage to support different types of analysis by the business. A business lineage report may, for example, only show the major systems, or may eliminate the systems and job structures to only show the transformation. Figure 7 shows a business lineage report for the sales report that focuses only on the functions that create the report data.

Figure 7: Business lineage for the sale report

Since business lineage is derived from the design lineage, there is not additional metadata management required to support it.

Operational lineage

The design and business lineage shows that the implementation of the information supply chain is correct. Operational lineage reveals problems in the ongoing execution of the information supply chain by supplementing the design lineage with logs from the operational environment. These can show how many data items were copied, when and whether there were errors. (See figure 8.)

Figure 8: Operational lineage overlay

Since the operational lineage is gathered each time the processes run, its volume is proportional to the level of activity in the information supply chain.

Provenance

The operational lineage demonstrates whether the expected amount of data is flowing through the information supply chain. However, it does not show how a particular value was derived. The process of recording the exact values that were used in each of the functions and transformations on every data value is called provenance.

Provenance requires the logging of all input values and results that flow through the information supply chain in a way that can be correlated with the design lineage.

Provenance can generate a huge amount of logging data that, since it contains the actual data values used, needs the same security protection as the actual data. This provenance data grows in proportion to the amount of data flowing through the information supply chain and the complexity of the processing. Often, provenance data is only gathered around critical transformation processes where value level inspection of the information supply chain is needed. Otherwise the provenance data can become overwhelming.

Making lineage work

The sales report used to illustrate the different types of lineage and provenance in this paper has been chosen to be as simple as possible, whilst still illustrating the different mechanisms. In reality real reports, analytics and the systems that provide data to them are many times more complicated and the landscape is changing continuously.

From the description above, it is easy to see that for a modern enterprise, the process of gathering and managing the metadata needed to provide traceability for data is significant. It needs to be managed as a fundamental IT infrastructure service.

Whatever technology is used to manage lineage and/or provenance, the usefulness of the results will depend on the effectiveness of the processes that surround the lineage tooling. For example:

Ensuring the design lineage is kept up-to-date as changes are made to the implementation of the information supply chain

Gathering and preserving the logging data used for operational lineage and provenance.

Many organizations find it impractical to manage lineage and provenance for all of their systems. As a result, they focus on the information supply chains that are core to their business. Ultimately, simplification of the system landscape and information supply chains, standardization of data structures and definition and planning for lineage/provenance data gathering as part of the core function of new capabilities are the key to successful traceability and transparency of data flows.

Factors for a successful organization-wide shared information service

Abstract:

The need for a high quality information delivery service is critical in real-time applications supporting safety and cost optimization, such as air traffic management.

How can a large, complex organization deliver a shared information service to satisfy its core operations and analytic use cases, particularly when it already has a legacy of IT systems and a service that must operate without downtime during the period of transition?

This paper summarizes the experiences of 25 organizations as they struggle with the physical realities of information management and the cultural challenges of becoming a data-driven organization with a shared information service at its heart.

It characterizes the factors that contributed to the success or failure of their shared information service. It considers how the projects were structured and led, the way information is governed, the standards used in the service and the way the organization was prepared for the transition as well as the technologies employed.

I. Introduction

Over the last three to four years, the term “big data” has become recognized as a key challenge within both commercial and public sector organizations. Big data is a misleading term because many of these organizations had already been processing large volumes of data in their operational and data warehouse systems and as a result had considerable skills in managing their data.

However, big data does not just refer to volume of data. It also encompasses a high variety of data sources and consumers along with the need to support a range of velocities for both data capture and responses, stretching from hard real-time to batch.

It is when the variety and the velocity of data are added to the large volume that the problem is characterized as big data. New approaches are needed to process this data since a big data solution must be able to blend all types of structured and unstructured data whilst also supporting both operational and analytical (batch, historical) workloads.

Despite claims of various technology advocates, there is no single technology that can support this range of requirements. In addition, even within a single organization, there are many views of what concepts and data values mean due to the different perspectives that each part of an organization works with.

So any shared information service must design the solution holistically, taking the organizational dynamics as much into account as the technology.

II. Background of the Study

Over the last 3 years, I have had the privilege of working with a wide variety of different organizations on their information strategy, architecture and governance.

This paper distills the observations I have made whilst working with 25 of these organizations who were embarking on the development and operation of an enterprise-wide shared information service.

The organizations come from the banking, insurance, industrial, telecommunication, distribution and public sectors. Each of them is complex with multiple independent operating units (or silos) that need to share information. The majority of these organizations are dispersed across multiple countries that each has their own regulations and data sovereignty laws.

Although the organizations are from different industries (affecting the subject areas of the data they focus on and the regulations they must support) there is surprising commonality in the problems they are trying to address and the approaches they use. Most of them need their shared information service to cover core operational data such as people, organizations, business transactions, reference data and data from external organizations. They have deep skills in managing traditional structured data, but want to extend their capabilities to extract value from log files, documents, emails, audio recordings and video [1].

Many of these organizations wish to become data-driven and use analytics to support their business operations and decision-making [2].

They want their employees to be self-sufficient in data. This means they can locate and use the information they need either for routine operations, ad hoc decision making or during an unexpected incident.

They want their operational processes to work with authoritative data. This is data that contains the official, best available values that the organization has.

They want their data management processes to be as timely and cost-effective as possible.

They want to comply with necessary regulations and also deliver their brand value and ethics through their use of data.

Finally, they want to continuously improve their operation through the analysis of data generated by their operational processes (digital exhaust) and through the use of analytical decision models in their automated processes.

These are all ambitious goals and the resulting projects are often multi-year. The study supporting this paper comes from my observations of the project dynamics and approaches these organizations take at key stages of their journey that is rolling out their shared information service.

At the time of writing this paper, two of the 25 projects have failed and been stopped. The rest are still on going and their statuses range from highly successful to struggling.

The reasons for this are complex, but there are similar factors that I see in the organizations that are successful. By contrast, the organizations that are unsuccessful typically have neglected to implement a key initiative or design principle that the successful organizations have put in place, rather than making mistakes or making poor technology choices.

III. Observations

The observations from the study cover the technology, architecture, standards and organizational aspects of the shared information service.

A. Use of technology

Organizations that put all of their faith in a single technology to deliver all of the use cases for a shared information service typically hit problems in performance, cost or capability. For example:

Big data is typically associated with specific technologies such as Apache Hadoop. However, experience has taught many organizations that although Apache Hadoop can deliver a cost effective batch-oriented approach to managing data at scale, it is not on its own sufficient to deliver advanced information services to an organization that needs data to drive its business.

Technology for supporting a service-oriented architecture (SOA) such as messaging, connectors, adapters, enterprise service bus and service directory are insufficient to support a shared information service. SOA technology focuses on defining the format and structure of services. This is necessary, but a shared information service must also focus on managing the data values that pass through the service interface [3].

Data integration technology provides the capability to copy and synchronize data between different systems. Data virtualization technology can blend data from different systems on demand, without taking a copy. Each approach has their uses, depending on how rapidly the values are changing, how consistent the data sources are, how often the data is queried and how much transformation is needed to deliver appropriate data to the destination.

Using multiple technologies adds complexity to the shared information service. Often its end-to-end operation goes beyond the knowledge and understanding of a single person. A team of people with different specialisms is needed to integrate the technology together so that the resulting service operates consistently and collectively meets its service level agreements (SLAs).

Our second principle states that designing a multi-technology service requires a pattern-based design process. The design patterns offer a common language to discuss design options that are independent of the technology and create a clear definition of the intended function [5].

This technological complexity can not be exposed to the users of the shared information service. They need simple, consistent interfaces to locate and extract data. However, the consumers of data from the service need transparency on where data came from and the transformations that were made to it by the information service in order to determine that the data they are using is fit for purpose. Similarly, information owners need to understand how their data will be used before they are willing to connect their systems to it.

Our third principle for a successful shared information service states that it must deliver three types of data:

The organization’s data used for its operations and analytics.

Metadata that describes where this data comes from, where it flows to and how it is being managed, transformed and processed inside the shared information service.

Technical blueprints and operational logs that enable the technical team to understand how the shared information service is constructed and how it is operating. Without this, it quickly becomes impossible to evolve the service over time.

B. Anatomy of a shared information service

The metadata, technical blueprints and operational logs collectively explain how the different parts of the shared information service are delivered. Organizations working on the shared information service focus these definitions on how systems connect to the shared information service and exchange data; how data flows through the information service; the lifecycles of the data and the processes using the data; and the special management necessary for different types of data.

1) Services and endpoints

The fourth design principle of the shared information service is that it offers well-defined interfaces to access and update the shared information. The more successful organizations include in this definition:

The structure of interface in terms of the operations that can be called and the parameters passed.

Descriptive information (metadata) about the data passed in the parameters covering the meaning of the data elements, the valid values and the profile of these values.

Description of the context information from the caller that must accompany each request. This context information typically includes authentication tokens along with identifiers and type information for the process that is using the service. It may also include location or other information about the caller. The context information is used to drive governance functions in the underlying service.

Description of the expected interaction patterns of use of related operations offered by the service.

Systems connecting to the shared information service often are using different data identifiers and structures and will need to map between their data representations and those of the shared information service. So published best practices for implementing these “adapters” can speed up the adoption of the shared information service.

2) Information supply chains

The flow of a particular type of information between systems is called its information supply chain [5]. Along the information supply chain, different processes transform, filter, combine and deliver data. Where a shared information service is in operation, the information supply chains pass from systems pushing new data values into the shared information service, through the stores and processes within the shared information service and to the systems consuming the data. The information supply chain will follow a different path for different types of data.

The design of an information supply chain is captured in metadata and is used to:

Understand the impact of change to the shared information services.

Understand the impact of an outage within the shared information service.

Demonstrate the integrity of data provided through the shared information service.

Track the correct operation of the shared information service by comparing the operational logs against the information supply chain design.

The information supply chains provide the core definition of the behavior of the shared information service. Thus designing with information supply chains is the fifth design principle of a successful shared information service.

3) Key lifecycles

The data and processes supported by the shared information service are operating a number of different lifecycles. These lifecycles determine how the data changes over time.

Software development lifecycle – this lifecycle controls code changes to the shared information service. New data types, interfaces and internal capability are introduced through the software development lifecycle.

Analytics development lifecycle – this lifecycle controls how analytics are developed and deployed using the data from the shared information service. Analytics enables an organization to continuously improve its operation, innovate and understand risks. The shared information service needs to include the ability to retain historical values for the data flowing through it, and its operational logs so that comparisons can be made over time [6].

Information lifecycles – not all data is the same. Some data is long lived and used in many systems. Other data is maintained within a small number of systems and then distributed through the shared information service. Other information never changes because it represents an event that occurred. Each type of data has a natural lifecycle that determines how it is typically created, maintained and deleted. The services of the shared information service need to support these lifecycles in the interfaces and internal processes.

These lifecycles define many of the interaction patterns with the shared information service. Design principle six recognizes that the more successful organization design these lifecycles into the shared information service early in its lifetime so that the appropriate organizational adjustments can be made to support any new roles they create.

4) Reference data

Every shared information service needs reference data to define the valid values for particular types of data. This reference data may reside outside of the shared information service but it is more coherent if the shared information service team is responsible for managing this reference data, and also provides mapping services for connected systems to maintain the mappings between the values they use internally and the values used in the shared information service.

Design principle seven states that a shared information service should maintain its own reference data and provide mapping services to allow a connected system to maintain relationships between their entity identities, code table values and other reference data values and those used by the shared information service. The benefit of this is:

It is cost-effective since this mapping service only has to be implemented once – rather than in each connected system.

It speeds up the development for the adapter for a system that is to be connected to the shared information service.

It provides a useful debugging and analysis resource for end-to-end diagnosis and optimization.

C. Common information model(s)

Design principle eight states that the shared information service should offer a harmonized definition of the information it shares. This definition is known as a common information model [7]. The common information model is misnamed because it is actually a collection of models offering difference views or perspectives of the data. These models are linked together through metadata to show how the corresponding perspectives related to one another. The common information model for a shared service should include:

A glossary of terms describing each of the data attributes and related concepts in prose. This glossary is used by everyone involved in the project to understand the meaning of data supported by the shared information service.

Subject areas and spine objects that group the glossary terms into topics and flattened business objects. This structuring is useful when assigning data subject matter experts and owners to the shared information service’s data landscape and as a starting point for data modelling.

Logical data element models detailing different levels of detail about a particular entity. Data element models are used as definitions for the parameters on the service interfaces. Typically there would be a data element model for each significant entity that describes the structure for identifying the entity, another for a summary of its principle attributes and then additional definitions coving increasing levels of detail.

Logical data models for any data stores maintained by the shared information service.

Physical models of service interfaces and data stores maintained by the shared information service.

D. Metadata management

The term metadata has been mentioned a number of times already. It describes the types of data supported by the shared information service, the valid values of this data, where it comes from (lineage), how it is structured by the different interfaces and stores and the processes that shape it.

Support for metadata as a first class capability of a shared information service is our next design principle (number nine). The focus it creates ensures a clean and clear definition of the data types and services. It also improves the usability of the shared information service.

1) Metadata catalogs

An organization-wide shared information service typically supports a wide range of services. The metadata of the shared information service can provide a data-centric catalog to the interfaces that helps to guide people to the appropriate service for their need. For example, they may be interested in understanding the refueling facilities at an airport. The metadata catalog would offer a search facility that would use the links from the glossary terms to the parameters on the service interfaces to locate

2) Metadata use and operation

Inside the shared information service, the metadata is driving the behavior of the internal processes and the access interfaces. For example, it may be used to control how data is routed, who has access to it and how long it is retained.

Many organizations implementing a shared information service quickly recognize the value of a comprehensive metadata capability. However, some make the mistake of underestimating the effort to maintain the metadata. Since technology from multiple vendors is likely to employed in the shared information service, it is important that metadata management is open and automated within that technology and related tools. Otherwise the shared information service project team must plan for the effort taken to maintain and manage this metadata [8].

E. Business-led governance

Since data drives an organization, it makes sense that the business is leading the governance program that controls it. To make this actionable, the shared information service needs governance functions that are driven by metadata and reference data settings that appropriate business leaders control through well-defined processes. This is design principle ten.

1) The role of classification

Classification of the different types of data is enabled in the metadata capability. Classification allows the appropriate business owner for a subject area to identify where data that requires special care is located. This includes different levels of confidentiality, retention requirements, confidence and integrity in the data supplied through the service. Linked to the different levels of classification are the rules that must automatically execute within the shared information service whenever data with that classification is encountered [9].

2) Owners, custodians and consumers

A key role of information governance is to assign roles and responsibilities to people and systems plus monitor and measure compliance. The metadata capability in the shared information service should manage the allocation of people to those roles that relate to data management and use and provide the processes they will use to perform these roles.

3) Self-service

Self-service describes a set of services for people to locate and extract sets of data values from the shared information service. These extracted values can be used for ad hoc analysis/investigation, analytics development and test data generation.

F. Iterative development

Given the size of an organization-wide information service, it needs to be delivered as a series of iterations – typically every three months or so. Some organizations iterate faster than that, but discover that the connected systems find it hard to keep up with a faster schedule. The pace of delivery needs to match the ability of the consuming systems to accommodate each iteration.

Due to the shared nature of the service, new developments need to be made openly with plenty of opportunity for review and feedback. The more advanced organizations have three staggered parallel tracks operating.

Data definition track – the most advanced track creates the common definitions for data passing through the shared information service.

Architecture track – the architecture track designs the technical implementation of new capability within the shared information service. They use the definitions from the data definition track in the design of the interfaces and stores within the shared information service.

Delivery track – This team drives the rollout of the shared information service and assists the teams connecting their systems to the shared information service.

The advanced work of the data definition and architecture tracks reduces the rework needed in the delivery track as it has clarity on the data and interfaces it is delivering. Using staggered parallel tracks for data definition, architecture and delivery is design principle eleven.

G. Leadership and management

Many organizations are hierarchical in nature. Data flows laterally between an organization’s internals silos in a shared information service and there must be consistency in the way information is maintained and used within each silo.

The result of an organization-wide shared information service is typically deeply disruptive to any organization, particularly those that have been highly decentralized and/or focused on functional, hierarchical command and control. This is because the way resources are organized and allocated changes, resulting in an equivalent change in the positions of authority and influence. This change can create huge resistance and inertia in the organization as people fear for their positions and future.

As such, the shared information services that had clear, visible support from the top of the organization met least resistance. This leadership included participation in studies and decisions about the service, communications to all employees, investment and recognition to initiatives that created greater information synergy and a clear path for individuals involved in the transition showing how they can adapt to the change. Inevitably, there will be people who can not or will not adapt and they were identified and accommodated early in the program. Design principle twelve is therefore that a shared information service needs both visible leadership and funding from the top of the organization.

H. Adaptability

As already stated, enabling a shared information service in a complex organization is typically a multi-year project involving many aspects of the organization’s strategic and operational functions. These projects have such a broad scope that they rarely run smoothly, and need to constantly adapt as the as priorities change and new technologies, legislation, competitors and partners become available. In fact, even once the shared information service is operational, it must adapt. This includes the structure of the service interfaces and payloads; the terminology used; the size of the supported workloads many change; new workloads will be added and others removed; the subject areas and types of data supported will evolve and new legislation and governance requirements will be defined. The scope of the organization will change as well as its internal structure and the individuals supporting it.

Thus adaptability is design principle thirteen in the shared information service. Much of the adaptability can be achieved by using metadata to define how the services are constructed and the functions that are called to transform and govern data. The rollout needs to be organized into a series of iterations. Each iteration should demonstrate some business value or a key principle of the shared information service.

IV. Conclusions

Whatever an organization’s starting point, it must eventually address six main cross-organizational initiatives to successfully operate a shared information service. These are:

A data driven culture, where individuals and teams expect data to be easily available and actively use it to make decision and solve problems. Tied to this is a sense of responsibility towards data, where individuals include in their perception of their role in the organization how they consume, maintain and improve the data that they use.

A business driven information governance program where there are business owners driving the definition of how data is to be used and managed in the organization. The business owners sent the controls that are encoded in metadata that drive how data is managed in the IT systems.

A comprehensive, open metadata management service that details where the data in the organization is located, where it came from, what it means, how it is used today, how to get access to it, who is responsible for it, how it is governed and what it can be used for.

A common information model, or more accurately, set of models that define the standards related to data. This includes the terminology used, how data should be grouped and organized and how it can be structured for display and exchange by IT services.

An open, shared information service that acquires, manages and delivers data wherever needed in the organization along well defined information supply chains that support the appropriate business, analytical, software development and analytical lifecycles. This service is pattern-based and blends technology to meet different workload requirements, both for operational needs and the analysis necessary to continually improve the operation of the organization. It supports its own reference data and provides mapping services for connecting systems

An IT operations and project management perspective that ensures all systems participate with the metadata service and data is accessible and exchanged

Partly because of the review of existing operations from a different perspective, and partly because the value of a data-centric approach to an organization’s operation, the benefits of a pattern-based, metadata driven approach to a shared information service that is accompanied by a comprehensive transformation program leads to a more adaptable and transparent shared information service and a flatter, simplified and more effective organization where data is deployed as an asset and used as a driver for success.

Acknowledgment

I wish to thank my IBM colleagues for their support in developing the information architecture concepts described in this paper. In particular, Tim Vincent, Dan Wolfson, and Bill O’Connell, who have each made a unique contribution to the art of information architecture.

Governance is a practice that you apply to “something”. Just like James Watt’s fly-ball governor for the steam engine, a governance program seeks to keep a engine in balance so it works effectively. This engine may be a process, organization, or flow of information. The important point is that the target of what you are governing is clearly defined.

Approaches to governance, particularly around a data lake, vary widely due to the different choices that organizations make in their definition of the engine being managed. For example, the IT department may see the data lake engine as a collection of technology working together. The business may see the data lake as part of an innovation engine helping them to create new value from data. So which is the right engine to govern? It depends on the objective for data lake.

A good starting point in defining the governance program for the data lake is to consider the perspective of each of the principle groups of users for the data lake and define the engine that each see and think what mechanisms it would take to create balance in each of these perspectives between effort and value.

So for example, the owner of a system that is supplying data to the data lake is required to maintain the catalog entry for the data coming from their system, and in return, they could get analysis on the quality or consistency of this data that helps them provide a better service to their users.

A data scientist may be restricted in how they work with sensitive data, but in return they get a rich catalog of data to choose from and easy processes to get permission to use the data sets they need. They may also be given the ability to contribute data and content for the catalog. The more they contribute, the easier the discovery process becomes.

By balancing the needs of the suppliers with the needs of the consumers, the balance of effort and value is achieved, creating a sustainable ecosystem.

In addition to designing the governance program to the perspective of the users, it is also necessary to decide who is in control of the data lake – is it IT or is it the business because that affect how the data lake is governed.

When IT is in control, then normal IT governance can manage many of the aspect of the data lake. However, when the business is in control, the mechanisms that operate the data lake, and the classification that identify the different types of data, need to be abstracted through services and metadata to create a view of the data lake that makes sense to the business and can be modified by them as needed. This view is then mapped to the actual data and technology through the metadata in the catalog and the metadata settings are used by the data lake services to drive the behaviour of the data lake.

Once the engine have been defined, the governance program is designed in the normal way:

Setting standards for the metadata, formats and best practices for the data lake.

Measuring and monitoring the adherence to these standards and

Taking action as appropriate such as managing exceptions, answering compliance questions and modifying the program based on feedback.

I would like to end by emphasizing the importance of feedback in achieving balance and value. Governance programs must be dynamic and demonstrating the value that they deliver. The feedback mechanisms should not be forgotten as they enable the governance program to stay relevant to the changing needs to the business which in turn changes the nature of the engines we need to govern.