Search

I fairly consistently get questions on the value of Machine Learning in data management and governance. Sometimes this question is framed at a high level in a very “buzz wordy” way. The person asking the question may not know what machine learning (ML) is. They have just heard the words so many times that they know it is good and should be part of the discussion. At other times, the person asking the question knows about ML and various other analytical techniques, but has never really thought of ML in the context of a data management tool. The challenge is that the emergence of IoT data, Customer 360 programs, and emerging best practices that focus on sharing semantically tagged data, all contribute to a fundamental need to do things differently. Machine learning is one of the tools in the toolbox to address the challenges related to scale, change velocity, and the consistent evolution of users and their use cases.

This post focuses on how we can automate the process of identifying data, classifying it, and linking it to internal and external references to provide semantic meaning. The goal of this post is simply to describe what machine learning is for the data manager, and what tasks it performs in the context of the standards based operational perspective.

From an operational perspective, the figure below presents the evolution of data from the “raw” transactional state to a highly labelled or curated state that can be shared between purchaser and vendor; or indeed any producer or consumer of data. Machine Learning plays a role in automating how data is curated and enriched across this lifecycle.

Figure 1: The Curation from raw data to sharable Information

If we drill down on the curation lifecycle, we can identify the various repositories that would be required, and a few of the key supporting standards. These standards and their roles are discussed more completely in a follow on post.

The database symbols outlined in blue (solid lines) represent data at rest. The rectangular items outlined in green (dashed lines) represent tasks that automate how data is augmented as it moves along this path. The focus of this discussion is on these green boxes.

Activities within the Data Quality Rules and MDM Rules tasks can be broken down into a number of functional capabilities as detailed below. Some of these capabilities are traditional data operations tasks; namely, persisting metadata in a database, and exposing the data through some sort of cataloging and publishing capability. The other items (outlined in blue) are those where machine learning approaches can be applied.

Figure 3: Functional capabilities supported by Machine Learning

First let’s start with a definition of Machine Learning as Machine Learning has multiple definitions within the popular literature. The website Techemergence provides a comprehensive definition:

“Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.”

Machine learning techniques play a major role in automating the process detailed above especially over unknown or new data sets.

For data management practitioners it is important to understand that no one machine learning technique is going to apply. In all likelihood multiple approaches will be chained together and invariably executed recursively to ensure that the data can be identified, classified and then linked to the appropriate unique identifier. In the ideal world, the algorithms will change or learn to accommodate changes in the data being classified. The figure below lists some of the machine learning techniques that may be applied.

Machine Learning Techniques

Unstructured Data

Structured Data

· Entity Tagging / Extraction

· Categorize

· Cluster

· Summarize

· Tag

· Linking

· Associate

· Characterize

· Classify

· Predict

· Cluster

· Pattern Discovery

· Exception Analysis

Note that these invariably interact with one another. If I tag people entities within unstructured text, I may wish to characterize them using structured technique: count of male names; frequency per document; frequency across documents, etc. This speaks to the layered and recursive nature of machine learning, and the richness of the metadata that the data team will need to manage. For a more technical view of ML techniques see this summary.

These are detailed below with considerations for program managers.

Capability

Considerations

Identify

Machine Learning approaches support the identification of instance data in order to classify the data. Is this personal Information? Does it look like a financial #? Does it reside in a financial statement?

For organizations where there is a significant installed legacy challenge. It will be important to have algorithms that identify data of interest. The identification of personal information is a current area of interest driven by the GDPR regulation.

Classify

Once data is identified, ML approaches support classifying the data within the data dictionary: data is in finance domain; it is in the “Deliver” phase of the Supply Chain Operations Reference (SCOR) lifecycle; etc.

Classification algorithms must exist that tag the data with the appropriate classifier. Capabilities must quantify and resolve those instances where there is uncertainty as to accuracy of the classification algorithm. For example, are we are 100% certain that this is a vendor and not a customer?

Resolve

The completed data dictionary will support entity resolution by providing a richer feature set against which MDM machine learning algorithms can be run.

Resolving the identity of the master data element may require a multi-tiered approach be run iteratively: apply Algorithm #1; for those that do not resolve with Algorithm #1, apply Algorithm #2; etc.

For example, now that I know that have classified the data item as vendor master data (previous step), can I resolve the identity with certainty to identify which vendor it is?

Link

The resolved entity must be linked to internal and external reference sources. Machine Learning techniques may be used to identify and resolve link candidates and specify link type / strength.

The analytical details of this may be addressed in the above “Resolve” capability. However, the focus here should be on identifying the correct link (or links) where there are multiple candidate reference sets where links could be established.

This is a critical step as the linkage to the internal reference “Concept System” is what describes the data element from a semantic perspective. It is also what links the data being described to a publicly available set of definitions that external parties can reference (See “Sharable Information” in figure above). These linkages cross walk an industry accepted definition between supply chain partners.

Example:

If a supply chain manager seeks to communicate the nature of a product requirement to a vendor – a machine screw for example. The ability to specify length of screw versus length of the “shoulder” on the screw; thread size (Metric, standard, imperial?); type of head (hex, square, pan head, etc.) is critical. The internal labels for these are linked to the industry agreed on labels available to the vendor community.

As long as the vendor is using the same reference concept system, both buyer and vendor can be assured that they are talking about the same machine screw.

Once these activities have been completed, the results need to be persisted in a metadata repository and published in a Data Catalog that will allow users to understand what data is available and how it can be accessed.

Some Closing Thoughts :: It’s all about the Ecosystem Maturity!

The above discussion and the content of the two posts in the works on MDM standards and data quality, identifies a set of standards and techniques that seek to streamline and automate the process of Master Data Management. However, these exist within the context of the organization’s data ecosystem. Data practitioners seeking to evolve master data management must ask some core questions regarding information architecture and data management maturity within their ecosystem:

How do these standards support my data strategy?

Do I have a business case?

Executive sponsorship?

Funding?

Does my information architecture support the capabilities that I need to manage Master Data as envisioned by the standards?

Will legacy systems impact how this gets executed?

Does the architecture support a “Service Oriented” metadata registry or catalog concept?

Do I have a metadata catalog?

What are the architectural boundaries and how do I share data across those boundaries?

Do I have the data management maturity to execute?

Identified and scalable processes?

Processes applied consistently across business units?

A governance operating model that can accommodate new functions and the change management overhead?

What controls and metrics exist? Need to be created?

Understanding how standards and machine learning fit within the information architecture and the organization’s capability maturity will enable the data team to define the right strategy and build out a realistic roadmap. For organizations with an established and mature governance function, many of the above questions will be resolved – or the mechanism to resolve them exists. However, for organizations that have less capability maturity, the strategy and roadmap will need to be explicit in identifying the business units where foundational capabilities can be created that can later be adopted across the organizations as the need and maturity evolve.

Every once in a while, I get asked about how to select between different types of databases. Generally, this comment is as a result of a product vendor or consultant making a recommendation to evolve towards a Big Data solution. The issue is twofold in that companies seek to understand what the next generation data platform looks like; AND, how or if their current environment can evolve. This involves understanding the pros and cons of the current product set and to what degree they can exist with newer approaches – Hadoop being the current platform people talk about.

The following is a list of data persistence approaches that helps at least define the options. This was done some time ago, so I am sure the vendors shown have evolved. However, think of it as a starting point to frame the discussion.

In general, one wants to anchor these discussions in some defined criteria that can help frame the discussion within the context of business drivers. In the following figure, the goal is to show that as data sources and consumers of your data expand to include increasingly complex data structures and “contexts,” there is a need to evolve approaches beyond the traditional relational database (RDBMS) approaches. Different organizations will have different criteria. I provide this as a rubric that has worked before – you will need to create an approach that works for your organization or client.

A number of data persistence approaches support the functional components as defined. These are described below.

Relational structures are best used when the data structures are known and change infrequently. Relational designs often present challenges for analysts when queries and joins are executed that are incompatible with the design schema and / or indexing approach. This incompatibility creates processing bottlenecks, and resource challenges resulting in delays for data management teams. This approach is challenged when dealing with complex semantic data where multiple levels of parent / child relationships exist.

Advantages: This approach is best for transactional data where the relationships between the data and the use cases driving how data is accessed and used are stable. In uses where relational integrity is important and must be enforced in a consistent manner, this approach can work well. In a row based approach, contention on record locking are easier to manage than other methods.

Disadvantages: As the relationships between data and relational integrity are enforced through the application of a rigid data model, this approach is inflexible, and changes can be hard to implement.

All major database vendors: IBM – DB2; Oracle; MS SQL and others

Columnar Databases

(Column Oriented)

Data organized or indexed around columns; can be implemented in SQL or a NoSQL environments.

Advantages: Columnar data designs lend themselves to analytical tasking involving large data sets where rapid search, retrieval and aggregation type queries are performed on large data tables. A columnar approach inherently creates vertical partitioning across the datasets stored this way. It is efficient and scalable.

Disadvantages: efficiencies can be offset by the need to join many queries to obtain the desired result.

•Sybase IQ

•InfoBright

•Vertica (HP)

•Par Accel

•MS SQL 2012

Defined

Pros / Cons

Vendor examples

RDF Triple Stores / Databases

Data stored organized around RDF triples (Actor-action-object OR Subject-predicate-Object); can be implemented in SQL or a NoSQL environments.

Advantages: A semantic organization of data lends itself to analytical and knowledge management tasks where the understanding of complex and evolving relationships is key. This is especially the case where ontologies or SKOS (1) type relationships are required to organize entities and their relationships to one another: corporate hierarchies/networks; insider trading analysis for example. This approach to organizing data is often represented in the context of the “semantic web” whose organizing constructs are RDF and OWL. when dealing with complex semantic data where multiple levels of parent / child relationships exist, this approach is more efficient that RDBMS

Disadvantages: This approach to storing data is often not as efficient as relational approaches. It can be complicated to write queries to traverse complex networks – however, this is often not much easier in relational databases either.

Note: these can be implemented with XML formatting or in some other form.

Native XML / RDF Databases

•Marklogic (COTS)

•OpenLink Virtuoso (COTS)

•Stardog (o/s, COTS)

•BaseX (o/s)

•eXist (o/s)

•Sedna (o/s)

XML Enabled Databases

•IBM DB2

•MS SQL

•Oracle

•PostgrSQL

XML enabled databases deal with XML as a CLOB in a table or organized into tables based on a schema

Graph Databases

A database that uses graph structures to store data.

See XML / RDF Stores / Databases. Graph Databases are a variant on this theme.

Advantages: Used primarily to store information on networks. Optimized for iterative joins; often in a recursive process (2)..

Disadvantages: Storage challenges – these are large datasets; builds through iterative joins – very processor intensive.

This has many of the characteristics of the XML, Columnar and Graph approaches. In this instance, the data is loaded, and key value pair (KVP) files created external to the data. Think of the KVP as an index with a pointer back to the source data. This approach is generally associated with the Hadoop / MapReduce capabilities, and the definition here assumes that KVP files are queried using the capabilities available in the Hadoop ecosystem

Disadvantages: Share nothing architecture creates complexity in uses where sequencing of transactions or writing data is important – especially when multiple nodes are involved; complex metadata requirement; few tool “packages” available to support production environments; relatively immature product set.

The classes of tools below are presented as they provide alternatives for capabilities that are likely to be required. Many of the capabilities are resident in some of the tool sets already discussed.

Data Virtualization

The ability to produce tables or views without going through an ETL process

Data virtualization is a capability built into other products. Any In- Memory product inherently virtualizes data. Likewise a number of the Enterprise BI tools allow data – generally in the form of “cubes” to be virtualized. Denodo Technologies is the major pure play vendor. The others vendors generally provide products that are part of larger suites of tools.

•Composite Software (Cisco)

•Denodo Technologies

•Informatica

•IBM

•MS

•SAP

•Oracle

Search Engines

Data management components that are used to search structured and unstructured data

Search engines and appliances perform functions as simple as indexing data, and as complex as Natural Language Processing (NLP) and entity extraction. They are referenced here as the functionality can be implemented as stand alone capability and may be considered as part of the overall capability stack.

•Google Search Appliance

•Elastic Search

Defined

Pros / Cons

Vendor examples

Hybrid Approaches

Data products that implement both SQL and NoSQL approaches

These are traditional SQL database approaches that have been partnered with one or more of the approaches defined above. Teradata acquired Aster to create a “bolt on” to a traditional SQL Db; IBM has Db2/Netezza/Big Insights. SAS uses a file based storage system and has created “Access Modules” that work though Apache HIVE to apply analytics within either an HDFS environment, or the SAS environment.

Another hybrid approach is exemplified by Cassandra that incorporates elements of a data model within a HDFS based system.

One also sees organizations implementing HDFS / RDBMS solutions for different functions. For example acquiring, landing and staging data using an HDFS approach, and then once requirements and the business use is known creating structured data models to facilitate and control delivery

Advantages: Integrated solutions; ability to leverage legacy; more developed toolkits to support production operations. Compared to open source, production ready solutions require less configuration and code development.

Disadvantages: Tend to be costly; architecture tends to be inflexible – all or nothing mindset.

What is Enterprise Data Management? In some organizations, this is an easy question to answer. However, in others – especially those with an analytical mission – it is much harder. Often the function is put under the Enterprise Architecture team. One often hears that “the Enterprise data folks just do not get it”. As one executive in a large financial organization put it: “EA is where rubber hits the air”. So how do we define a role for the data function within an Enterprise Architecture team?

This post is not about how to organize effectively in order to align with the business units to show business impact – although a worthy topic. This post is about suggesting a role for the enterprise data management team when that team is organized under the CIO within the Enterprise Architecture function of the organization. In order for the data function to be deemed valuable to the business stakeholders it must be understandable, actionable and tied to the business objectives.

Data is everywhere. When we talk about “Enterprise Data Management” the temptation is for managers to say – well that means we manage all data in all locations. As enterprise data managers, we must know all about everything! Really? Have you ever seen this work? This leads to the top down mandate of the “canonical” approach where the objective is a single standard, a single canonical model – a single ring to rule them all! This rarely works well (if at all). Business requirements, analytical activity, market trends, and evolutions in technology all lead to a core business requirement for flexibility. Additionally, there is a fundamental need to recognize that the “ground truth” is almost always with the business side of the house – and “truth” is often a shifting concept in the real world. This is part of the reason why the Enterprise Data Warehouse (EDW) “single version of the truth” is problematic for analytical and BI staff and for the rise of Hadoop as a more flexible environment. As an analytical or BI person, my version of the truth – or the right data – depends on the context of a particular decision.

So how do we focus the EDM team on what makes sense? The graphic in this excellent article on risk architecture caught my eye. I have modified it a bit to identify some core activities that I see as foundational for organizations seeking to mature their data management in general, and specifically, the integration of the enterprise architecture team in the data management process. The original graphic is attributed to Naomi Clarke currently at Credit Suisse.

Based on the above, the role of EDM is simple; manage data assets to expose those attributes that are needed to answer key business questions about data assets:

What are they?

Where are they?

What has happened to them?

What are they related to?

To do this, one needs a data management “hub”. I call it a Hub as this provides flexibility for discussion purposes. Some would call it a Managed Metadata Environment (MME), others perhaps a Metadata Registry. Regardless, the goal is the metadata ecosystem that can support key functions related to governance, curation, quality, usability and discoverability. This view suggests the following regarding the roles of the EDM team – especially when it is organized within the EA Team:

The Team needs to only manage three inputs: lineage metadata; definitions, and the physical location (what, where and change). The way the organization creates those three inputs is part of an overall data strategy, but not something the EDM team drives – these are driven by the business. By focusing in this way, the Enterprise team leaves it up to the business or operational components to determine the optimal approaches.

Definitions are aligned to business terms and to “Concept Systems” (as defined in the ISO 11179 Specs). This enables discoverability and complex search approaches based on an understanding of semantic equivalence.

Data Assets can be classified within the context of an enterprise data reference model (DRM). In most organizations, this supports the governance process. However, in government organizations, the DRM is also used to align policy and strategy objectives to IT activities. See Federal Enterprise Architecture framework for how this works in the US.

Capabilities to support governance functions must be provided: vocabulary management tools with the capability to curate and link ontologies, taxonomies, controlled vocabularies, etc, data quality tools, and governance tools.

If one can limit the INITIAL scope of the EDM team to these items, it is much easier to tie enterprise activities to the business needs, and provide a set of capabilities that address challenges of high value to the organization: search, discoverability and integration. Evolving the role once these benefits are established is a much easier task.

This short article provides an interesting perspective on how NoSQL differs from a data storage perspective, and why that is important. The article also points out that storing data on large clusters is very efficient from a storage perspective, but NOT if the data is relational in nature. In order to look at data across clusters efficiently, one needs to reorganize the data – this is where MapReduce comes in. Mapreduce is great at reorganizing data to feed a particular tasks – from my perspective a critical need for the analytical communities.

This links to a notion of “Polyglot Persistence” which accepts the notion that data will be stored in multiple mediums as new ways of persisting data evolve. I find this interesting as this mirrors what we are seeing today. Customers have Operational Data Stores – usually relational, and yet seek to perform tasks that are complicated by: 1) the size of the data, and 2) the constraints placed on how the data can be evaluated or analyzed by the data model or architecture. This motivates an exploration of new approaches; hence the discussions industry is having on NoSQL (or to use the buzzwords: Hadoop; Mapreduce; Big Data).

I may have simplified this a bit – apologies. At the end of the day, we are seeing a sea change in how organizations deal with data to more effectively apply it to the diverse needs demanded by the business side of the house. Explaining how organizations must change, but do so in a controlled risk reduced manner is the challenge.