Best Practices for Creating and Operationalizing Data Lakes

August 30, 2018

By: Farnaz Erfan

Data lakes have a wide range of use cases. For some, a data lake is an augmentation of an enterprise data warehouse. For others, it is a staging area accessible to technical teams for data science and machine learning. And for most, it has become a storage area for archiving data with the intention of unlocking its value at a later time. As Michelle Goetz of Forrester puts it, “ultimately, data lakes are a mechanism to rationalize data ecosystems, scale and democratize data, and serve a wider number of business use cases than previous data repositories.”1

In implementing data lakes, its best to think of a data lake as an application investment. You should always start with an evaluation phase, whereby all stakeholders have a chance to set their expectations and the team can assess requirements. Often, data lakes began as innovation grounds whereby the ideation for new data products or process optimization takes place. Regardless, in the evaluation phase, it is important to define the data lake success criteria in terms of its primary and secondary goals, and to continually show results. For starters, these questions can guide the implementation plans:

Who will the data lake serve and for which business use cases?

Will the primary users be data analysts, data scientists, data engineers, or a combination of these roles?

Do the users have the right skillsets? Are there technologies that can capitalize on the existing skillset of the team?

What discovery and exploration tools can help unlock the value of data lake quickly and continuously?

What ultimately defines the return on investment in the data lake? What are the short-term and long-term goals?

While the responsibility of implementing the data lake often lies with technical teams, the success of the project and its longevity depends on the adoption and on whether or not business teams and executive stakeholders can see the value in each phase of the implementation. This is often a challenge, as the tools and interfaces used to interact with data lakes are technical – too hard for business teams to use. To overcome this, perhaps the best piece of advice comes from Gartner: “Often, IT doesn’t understand what the data means, and some companies do not allow IT to know what it means. IT should architect the data lake with the focus on self-service capabilities so that the business can derive value from this data.”2 This is where self-service data preparation tools such as Paxata can accelerate the value realization of data lake projects.