Should I Use a Data Lake or a Data Warehouse? Six Questions to Ask

Data warehouse versus data lake – when should you use which for analytical insights? Compared to the venerable data warehouse, data lakes are a relative upstart in the business intelligence world. While similar in concept to warehouses, data lakes are quite different in structure and use. The terminology conjures up the appropriate imagery: brick-and-mortar warehouses are structured, efficient – and relatively inflexible. In contrast, lakes are catchalls that can hold just about anything, but like their real-world counterparts they can have many “unknowns” under the surface.

Both platforms have a place in the enterprise, and both can be applied to business intelligence applications, but their strengths differ. Data warehouses are fast, scalable, but expensive. They are best used when both the source data and outputs are well understood. Data lakes can handle a much greater variety of data types than warehouses and are less expensive because they use off-the-shelf hardware and storage. They scale well, but require a new way of thinking to support such a wide variety of inputs and the exploratory value they offer.

Which should you choose to support your analytics and BI needs? Here are six factors to consider.

1. What kind of data will you be using?

Data warehouses are essentially giant relational databases, and although these underlying RDBMSs have evolved, their primary design factor was based on data being clean and structured. Under the right conditions, they deliver excellent performance, but getting data into the right format can be challenging. Data warehouses are often ideal for running queries on recent data (though not necessarily immediate or real-time data), and for end users who need sub-second responses on all of their queries.

Data lakes can handle nearly anything you throw at them, including unstructured formats like log files, emails, text, audio (which can be converted to text and analyzed), and even video. The big advantage entails flexibility so that end users can have quicker access to the data in a faster analytic lifecycle, rather than waiting for extensive data modeling by the IT team just to get the data loaded. Part of the core value proposition of Apache Hadoop when it rose to popularity was “schema on read” instead of “schema on write”. So if you’re looking to analyze unstructured data formats, a data lake is the practical choice, as the effort to use those formats in a data warehouse is cost-prohibitive.

Interestingly, structured data formats do not preclude the use of data lakes. The concept of data warehouse optimization, for example, is necessarily about moving structured formats from an expensive data warehouse to a much more economical data lake. The trade-off for data lakes is performance degradation on certain types of queries, as the low-cost hardware used for data lakes typically don’t compare to the horsepower of data warehouse hardware (though Arcadia Data would like to show you how our technology mitigates that trade-off).

2. How clear are your objectives?

Are you interesting in data exploration, and potentially learning more about what you can gain from your data? Or do you tend to stick to the basics? If the output you’re looking for tends to be well-known and fixed, then a data warehouse works well. For example, your questions might be about asking for shoe sales by size by region over the past five years. If your objectives are less clear-cut, or if you want to incorporate non-standard data sources–like regional rainfall totals and predictions based on fine-grained reading from field sensors–into your calculations, a data lake is a better platform.

A plan to add more non-standard data sources should be top-of-mind for all organizations. You will find that as you incorporate more data sources into your analytics environment, you will uncover more questions to ask and more insights to act upon. You cannot necessarily follow the same strategy with data warehouses; data in data lakes is often the “data exhaust” from machines which typically has dynamic data structures which change over time and create a huge tax in terms of data preparation and modeling. Other data sources not ideal for inclusion in data warehouses include set top boxes, web servers, or even log files from Kubernetes jobs, all of which are more easily explored and analyzed in data lake.

3. How technical are you?

Many business users and analysts are comfortable with the mature and battle-hardened SQL language for retrieving data, particularly through a business intelligence (BI) tool which creates SQL based on a drag-and-drop dashboard environment which shields the end users from the complexity of the underlying data. SQL is a good general-purpose tool for most common business queries where the data is already well-structured in a relational format for easy querying. Working with Hadoop- or object store-based data lakes shouldn’t be viewed in the same way. The promise of Hadoop and data lakes was to store data in its original format and provide “schema on read” rather than “schema on write,” thereby reducing the up-front time and cost with data preparation and transformation. Legacy BI tools generally have not been designed to deal with “unstructured” (i.e., non-relational) data types in their native stage. Legacy BI tools treat the data lake just like a data warehouse and require up-front data preparation/modeling/sampling to provide adequate performance and usability. Therefore, until recently, unlocking value in a data lake required specialized analytical knowledge, as well as tools and languages like Python, Scala, and R that demand a good deal of training and experience to master. That’s one reason data lakes are often considered the domain of data scientists while warehouses are adapted more for mainstream business use.

However, modern BI and visual analytics technologies like Arcadia Data are able to read modern and more complex data formats with nested schemas such as Parquet which allow querying and analysis without as much up-front work. This makes data lakes much more valuable to a larger user audience of power users who want to explore data in a data lake without heavy data engineering or IT support. Modern visual analytics and BI tools can even be used by casual business users who just want to consume visual dashboards and analytics. When deployed with the right tools, data lakes are a great architecture for reducing the load on IT personnel and providing more self-service to end users.

4. How much money do you have?

Data lakes are typically built on commodity hardware and storage, along with open-source big data software platforms like Hadoop and Spark. In contrast, data warehouses are usually built on specialized infrastructure like massively parallel hardware combined with proprietary data management software. As a result, the cost per terabyte of a data warehouse can easily be 10 times that of a data lake. Unless your budget is infinitely large, data lakes are more desirable because of the economic advantage on the many different use cases that are well-suited for either environment.

5. How important is security?

In the early days of data lakes, security was seen as a big gap. Organizations felt they could only manage non-sensitive data in their data lakes. However, these days more robust security systems are available, especially on Hadoop, to ensure protection of sensitive data. So if data security is important to your organization (and it should be important to every organization), you can rest assured that data in your data lake can be protected in ways similar to your data warehouse. Many forms of protection are available, including role- and policy-based access controls, masking, tokenization, and encryption for both data-at-rest and data-in-motion.

It’s worth noting that quite often, organizations overlook the fact that security is more important from a process standpoint than it is from a technology standpoint. This means that even if you have advanced security technologies, the usefulness is diminished if you don’t use them properly. A popular example of this is the phenomenon of writing passwords on sticky notes attached to your desktop monitor. The “process” tells us to not share passwords with anyone, but the sticky notes violate the process, and thus defeat the value of the security. Another example pertaining to data lakes entails the process of moving data from the data lake to dedicated BI servers to run fast, production dashboards. This data movement creates a security and governance challenge because the copied data now requires a separate security effort to ensure its protection. This duplication of effort is not only an unnecessary administrative burden, but it also introduces opportunity for error, thus potentially exposing sensitive data to unauthorized users. Here’s where technology can help, in which a native BI platform like Arcadia Data obviates the need to move data from the data lake, thus ensuring it stays protected in the domain of the data lake’s security framework.

6. Is self-service BI important to you?

Topics like self-service BI, data agility, and information sharing have been thrown around for years. Terms like “democratization of data” and “consumerization of data” represent a more end user-oriented approach to analytics with reduced IT requirements. But these concepts aren’t always key tenets of an organization’s data strategy. Organizations may believe that these approaches are too hard to implement effectively, or that they introduce downsides that must be avoided. For example, do you give end users the freedom to download data at will to their desktop to perform analysis via Excel? Many organizations see this as a data governance risk, and would want to limit such freedom. Some organizations believe they already provide self-service BI, but their implementation is actually quite limited because the IT teams still are deeply involved in the analytical lifecycle for preparing and delivering the data.

So if you want to stick to a more traditional approach to data analytics with highly controlled and limited access to data, then the data warehouse might be the better approach. For organizations seeking to implement new paradigms that reduce the overhead and delays of IT intervention, the flexibility of the data lake will be appropriate.

Keep in mind that implementing a data lake doesn’t automatically come with the downsides of a self-service environment, you just need to make sure your processes and technology will help mitigate the risks. That said, using the wrong technology can limit your ability to properly deploy these new approaches. Legacy BI tools evolved at the same time as data warehouses, and most are well-suited to working with relational queries on large data sets. However, legacy BI software can be quite rigid because it was never designed to work with unstructured data, much less used for exploration in such an environment. That means that data must be extracted and prepared prior to analysis, a step that partially defeats the purpose of self-service BI. You should turn to a data lake with modern BI tools to give your end users more freedom with their analytics, greater opportunity to make faster decisions, and more knowledge sharing in a more open environment.

Conclusion As you can tell, data warehouses and data lakes are not mutually exclusive technologies for your analytical environment. Both play a role in delivering critical business insights, and which technology you choose for your next use case depends the factors described above. Hopefully the discussion above will give you a better sense of which platform to choose as you continue making valuable use of your data.

If you plan to add a data lake to your analytics arsenal, look at refreshing your BI toolset at the same time. Architecture matters when it saves you time and money. This is why we see many organizations choosing two BI standards – one for their data warehouse and one for their data lake. You can read more about this hot topic or watch the popular presentation (A Tale of Two BI Standards: Data Warehouses and Data Lakes) at O’Reilly Strata Data in New York which was literally standing room only and had to have a separate session created so as not to create a fire hazard at the event.