I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.

Please check the box if you want to proceed.

By submitting my Email address I confirm that I have read and accepted the Terms of Use and Declaration of Consent.

attendant data management system that provides analytics about data -- a capability that is typically stripped from other analytics environments, like a data warehouse or data mart, as part of the data cleansing process.

For example, a data warehouse's extract, transform and load preprocessors eliminate the logs that tell when a system arrived or was inserted into an "operational data store."

But in the industry today, data lakes seem to have at least two definitions. One, which originated from storage companies, is that a data lake is a disk-storage infrastructure that allows for metadata storage. The other, which is primarily marketing-driven, is a lake mixing multiple data stores that aren't typically mixed. By my definition, there is no vendor that sells a full-scale data lake -- rather, people cobble them together using Hadoop and homegrown tools to access the data.

As the initial vendor hype gave way to real-world experimentation, users have found that best practices for data marts don't apply to data lakes. To avoid the mistakes of early users, address a data lake implementation modestly, rather than at a large scale.

Remember that data lakes are exploratory

A data lake implementation should allow organizations to extend existing analytics in an ad-hoc, exploratory fashion.

A data lake implementation should allow organizations to extend existing analytics in an ad-hoc, exploratory fashion. Grow the data types in the data lake from a core of highly current data -- for example, customer transaction logs -- that current analytic systems will not elicit in a timely fashion. Most existing analytics aren't sufficient for a true picture of how your applications behave. Data warehouses, "pure" Hadoop and other data management schemes lose important data.

On his blog, James Dixon, CTO at Pentaho Corp., a provider of big data analytics systems, cites an example: systems such as data warehouses don't capture each step in the buying process that a customer takes, but the transaction logs do. The design of such a buying process may seem straightforward to the typical data architect, but there can be minutes or even hours of infuriating lags in each step.

By discovering lags in the process, users can start the data lake implementation with customer-facing, buying-related transactions. It's important that the analytics are exploratory and important to an enterprise's overall analytics effort, because it's unclear what else will be uncovered once users analyze the customer log timestamps more thoroughly.

What's the difference between data marts, lakes and warehouses?

Data marts are variants of data warehouses. The data warehouse stores slightly older data from across the organization for reporting and analytics. Multiple data marts are a rough equivalent of a data warehouse, typically serving subsidiaries within their own IT environment. You can have multiple data marts feeding into a data warehouse, or just loosely coupled data marts.

Integration is key for data lake implementation

It's also important to fully integrate the data lake with the rest of your enterprise data architecture, including data governance and master data management. Understand which data types matter to the data warehouse or data marts and whether the data in raw form is correct and consistent. Implement data governance practices to avoid analyzing flawed data.

Data lakes in the long run

Data lakes have potential. But they're likely to be just a fad unless we get a much better idea of what they can deliver long term -- unless their benefits are much broader than have been concretely shown so far.

Dixon's example of data warehousing's problems when incorporating time sequencing and spacing is only one instance of how today's analytics continue to rely on simple statistics without considering what "bad" data can tell us. Since a data lake implementation can unearth key "gotchas" in analytics, it's worth exploring for any enterprise. In the long run, however, this requires both experimentation and careful balancing of the data lake and your overall information architecture.

1 comment

Register

Login

Forgot your password?

Your password has been sent to:

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy