Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.

In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data.

In particular, this webcast explores how to:

Combine machine learning with expert-sourcing to ensure a high-level of scalability and accuracy

Bring together disparate data sources within one system

Quickly integrate new data, with existing data sets

Utilize open APIs to integrate with a variety of existing systems

Using machines and people to unify data

Stevens notes in the webcast that Toyota Motor Europe’s customer data is organized at a retailer level. This manner of organization has resulted in a massive amount of segmented customer data being generated in various countries. As the auto industry has become digitized, Toyota’s segmented data couldn’t keep pace with advancing industry standards. Rather than applying a traditional top-down approach of standardizing data, Toyota chose to deploy Tamr to catalog, connect, and consume all of their customer data.

Tamr takes a “bottom-up” approach to data unification — using automated machine learning algorithms to provide the scalability required to ingest large data sets. To ensure accuracy, Tamr asks data experts (professionals who serve as the current owners of the data) to provide additional context about the information being processed.

Bringing together disparate data sources

Toyota Motor Europe’s challenge was bringing together disparate data sources at a European (national) level. Various countries had their own methods for integrating data, yet these approaches became problematic as the complexity and speed of data increased with the rise of new digitized processes. As an example, Toyota’s customers expect to receive relevant information when they use Toyota applications, visit retailer websites, and when they go to the dealerships. A seamless handover in the digital-to-physical customer process is crucial, yet the company was unable to provide this type of innovation because their data platforms were being managed separately in different countries.

By consolidating millions of data sources, the Tamr platform was able to ingest Toyota’s existing European customer data and integrate it within one system. This unified view allows Toyota to meet user needs and provide customer service during both digital and physical interactions.

Integrate new data — no restructuring needed

An important factor in integrating customer data at scale is being able to quickly match new data sets with existing information. In the webcast, Stevens explains: “Introducing new sources of data was becoming a real issue for us. We wanted to start to tackle data integration at the European level.” Tamr uses machine learning algorithms to analyze and detect similar patterns in new data sources, and then merges the new sets with existing data. This allowed Toyota to bridge the gap between new and existing data sources.

Easy implementation was something Toyota took seriously when choosing Tamr. Thanks to open API’s, the deployment process didn’t require restructuring of the systems that were already in place at Toyota. Stevens adds that because the Tamr platform understands the entropy of data, Toyota can continually gain value in the data unification process.

Watch a live demo

Gain a better understanding of the process and benefits of using machine learning to unify customer data at scale in the live demo featured in this free webcast:

As the amount of data continues to double in size every two years, organizations are struggling more than ever before to manage, ingest, store, process, transform, and analyze massive data sets. It has become clear that getting started on the road to using data successfully can be a difficult task, especially with a growing number of new data sources, demands for fresher data, and the need for increased processing capacity. In order to advance operational efficiencies and drive business growth, however, organizations must address and overcome these challenges.

In recent years, many organizations have heavily invested in the development of enterprise data warehouses (EDW) to serve as the central data system for reporting, extract/transform/load (ETL) processes, and ways to take in data (data ingestion) from diverse databases and other sources both inside and outside the enterprise. Yet, as the volume, velocity, and variety of data continues to increase, already expensive and cumbersome EDWs are becoming overloaded with data. Furthermore, traditional ETL tools are unable to handle all the data being generated, creating bottlenecks in the EDW that result in major processing burdens.

As a result of this overload, organizations are now turning to open source tools like Hadoop as cost-effective solutions to offloading data warehouse processing functions from the EDW. While Hadoop can help organizations lower costs and increase efficiency by being used as a complement to data warehouse activities, most businesses still lack the skill sets required to deploy Hadoop.

Where to begin?

Organizations challenged with overburdened EDWs need solutions that can offload the heavy lifting of ETL processing from the data warehouse to an alternative environment that is capable of managing today’s data sets. The first question is always, “how can this be done in a simple, cost-effective manner that doesn’t require specialized skill sets?”

Let’s start with Hadoop. As previously mentioned, many organizations deploy Hadoop to offload their data warehouse processing functions. After all, Hadoop is a cost-effective, highly scalable platform that can store volumes of structured, semi-structured, and unstructured data sets. Hadoop can also help accelerate the ETL process, while significantly reducing costs in comparison to running ETL jobs in a traditional data warehouse. However, while the benefits of Hadoop are appealing, the complexity of this platform continues to hinder adoption at many organizations. It has been our goal to find a better solution.

Using tools to offload ETL workloads

One option to solve this problem comes from a combined effort between Dell, Intel, Cloudera, and Syncsort. Together they have developed a pre-configured offloading solution that enables businesses to capitalize on the technical and cost-effective features offered by Hadoop. It is an ETL offload solution that delivers a use-case driven Hadoop Reference Architecture that can augment the traditional EDW, ultimately enabling customers to offload ETL workloads to Hadoop, increasing performance, and optimizing EDW utilization by freeing up cycles for analysis in the EDW.

The new solution combines the Hadoop distribution from Cloudera with a framework and tool set for ETL offload from Syncsort. These technologies are powered by Dell networking components and Dell PowerEdge R series servers with Intel Xeon processors.

The technology behind the ETL offload solution simplifies data processing by providing an architecture to help users optimize an existing data warehouse. So, how does the technology behind all of this actually work?

The ETL offload solution provides the Hadoop environment through Cloudera Enterprise software. The Cloudera Distribution of Hadoop (CDH) delivers the core elements of Hadoop, such as scalable storage and distributed computing, and together with the software from Syncsort, allows users to reduce Hadoop deployment to weeks, develop Hadoop ETL jobs in a matter of hours, and become fully productive in days. Additionally, CDH ensures security, high-availability, and integration with the large set of ecosystem tools.

Syncsort DMX-h software is a key component in the solution or RA. Designed from the ground up to run efficiently in Hadoop, Syncsort DMX-h removes barriers for mainstream Hadoop adoption by delivering an end-to-end approach for shifting heavy ETL workloads into Hadoop, and provides the connectivity required to build an enterprise data hub. For even tighter integration and accessibility, DMX-h has monitoring capabilities integrated directly into Cloudera Manager.

With Syncsort DMX-h, organizations no longer have to be equipped with MapReduce skills and write mountains of code to take advantage of Hadoop. This is made possible through intelligent execution that allows users to graphically design data transformations and focus on business rules rather than underlying platforms or execution frameworks. Furthermore, users no longer have to make application changes to deploy the same data flows on or off of Hadoop, on premise, or in the cloud. This future-proofing concept provides a consistent user experience during the process of collecting, blending, transforming, and distributing data.

Additionally, Syncsort has developed SILQ, a tool that facilitates understanding, documenting, and converting massive amounts of SQL code to Hadoop. SILQ takes an SQL script as an input and provides a detailed flow chart of the entire data stream, mitigating the need for specialized skills and greatly accelerating the process, thereby removing another roadblock to offloading the data warehouse into Hadoop.

Dell PowerEdge R730 servers are then used for infrastructure nodes, and Dell PowerEdge R730xd servers are used for data nodes.

The path forward

Offloading massive data sets from an EDW can seem like a major barrier to organizations looking for more effective ways to manage their ever-increasing data sets. Fortunately, businesses can now capitalize on ETL offload opportunities with the correct software and hardware required to shift expensive workloads and associated data from overloaded enterprise data warehouses to Hadoop.

By picking the right tools, organizations can make better use of existing EDW investments by reducing the costs and resource requirements for ETL.