The Evolution of Data Lakes

In this special guest feature, Glenn Graney, Director – Industrial & High Tech for QAD, suggests that manufacturers should be planning for all enterprise data sets to be part of the greater data lake. The transition to a data lake emphasizes flexible access to analysis tools and is less centered on data preparation. By definition, the data lake will be made up of a variety of data sources and the accessibility requirements and effort will only be defined at the time of the query. Glenn has 35 years of leadership experience in advanced manufacturing and the associated supporting automation, business and information systems. He has contributed to the successful design and implementation of solutions across an entire range of the most demanding of industrial environments. Most recently Glenn has been engaging with manufacturers on the exciting journey with the advanced technologies associated with Industry 4.0. He has both a BS & MS in Industrial & Manufacturing Engineering from Penn State.

The Internet of Things (IoT) and Machine Learning are key aspects of Industry 4.0. Both these technologies will result in the unprecedented collection and analysis of data to drive new insights and benefits for manufacturers. Interestingly, there is nothing particularly new about wanting to use manufacturing data to drive improvements. What is new is the transition away from the large data preparation effort that was often a large portion of Data Warehouse and even Big Data efforts. Data from disparate systems often went through multiple levels of aggregation and indexing in order to prepare it for answering traditional questions. That type of aggregation is no longer necessary.

What Does this Mean for Manufacturers?

Manufacturers should be planning for all enterprise data sets to be part of the greater data lake. A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined until the data is needed (unlike a traditional database). The transition to a data lake emphasizes flexible access to analysis tools and makes the process less centered on data preparation. By definition, the data lake will be made up of a variety of data sources and the accessibility requirements and effort will only be defined at the time of the query.

Information may be housed in traditional ERP data that can be a core part of the Industry 4.0 effort. For example, flexible manufacturing assets can be used to produce many different SKUs. There is a difference in the wear and tear on the asset while producing the respective SKUs. For example, a plastic extrusion process may be making a tube with an outside diameter of 1” in the morning and then changed over to make a similar tube with a 2” diameter. The process parameters for the two die sets, the stress on the associated hydraulics and the rate of flow for proper cooling are all contextual to the respective product SKU that is being produced. There can also be a difference between individual operators and how they run the asset. This data on variation between production orders and assigned operators often resides in the ERP system.

The onslaught of volumes of IoT-based asset-centric process variables must be balanced with traditional business data. ERP-based asset data is an essential element in developing a comprehensive view of asset performance. The form and format of this production data obviously looks completely different than the time-streamed scalar values that represent the feeds and speeds that come directly from the asset and IoT.

The difference in data format does not diminish the importance of considering all aspects of production to deliver complete understanding. All of this disparate but relevant data must be considered as manufacturers turn to machine learning technology to drive true predictive maintenance. The sensor data for any given time period can only truly be evaluated in the context of the production order that is being processed during that time. Together, advanced data analytics and machine learning can embrace this “lumpy” data set.

Data lakes can also alter the query process for the data to allow for temporary or short term inclusion of data sets without heavy preparation. For example, if you were to go to a search engine and type your first name and your last name then you would most likely get some result that finds you in LinkedIn or some other social media application. What you didn’t have to do was pre-define that the entry was a first name and a last name of a person. The entry itself dynamically aligned with a number of data sets that may be appropriate responses for the entry.

Data is the foundation! And, data continues to be the critical foundation of understanding and potential improvements. Legacy techniques based on batch export and import activities to a centralized and heavily indexed data warehouse need to be reevaluated. Data storage and evaluation tools need to evolve to match the dynamic data collection and query requirements brought on by Industry 4.0.

Industry 4.0 is still in its early stages and will intersect multiple advanced technologies as it grows — data lakes are only one of them.

Resource Links:

Industry Perspectives

In this special guest feature, Amnon Drori, Co-founder and CEO of Octopai, discusses how for many organizations, GDPR may be the biggest data challenge they have ever faced – but it also provides organizations with an opportunity to truly own their data. By implementing a smart system that will ensure that they are able to find the data in their systems at will, organizations will ensure that they are GDPR-compliant – and have the opportunity to utilize all their data to help their organizations run more efficiently and profitably. [Read More...]

White Papers

This insideBIGDATA technology guide explores how current implementations for AI and DL applications can be deployed using new storage architectures and protocols specifically designed to deliver data with high-throughput, low-latency and maximum concurrency.