To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

Posted on: July 2, 2015

By Paige Roberts

Whether ‘tis nobler to suffer the slings and arrows of outrageous data structures, or to structure, to query, perchance to find answers from specific data more quickly. That is the quandary of the Hadoop data lake.

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface.

At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write.

Before we dig deeper into this Hadoop data lake conundrum, let’s start with some definitions.

What is Schema on Write?

Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and positions are all delineated. The schema; the columns, rows, tables and relationships are all defined first for the specific purpose that database will serve. Then the data is filled into its pre-defined positions. The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load).

That is why it is called “schema on write” because the data structure is already defined when the data is written and stored. For a very long time, it was believed that this was the only right way to manage data.

But there are more things on heaven and earth than are dreamt of in that philosophy.

What is Schema on Read?

Schema on read is the revolutionary concept that you don’t have to know what you’re going to do with your data before you store it. Data of many types, sizes, shapes and structures can all be thrown willy nilly into the Hadoop Distributed File System, and other Hadoop data storage systems. While some metadata, data about that data, needs to be stored, so that you know what’s in there, you don’t yet know how it will be structured. It is entirely possible that data stored for one purpose might even be used for a completely different purpose than originally intended.

The data is stored without first deciding what piece of information will be important, what should be used as a unique identifier, or what part of the data needs to be summed and aggregated to be useful. Therefore, the data is stored in its original granular form, with nothing thrown away because it is unimportant, nothing consolidated into a composite, and nothing defined as key information.

In fact, no structural information is defined at all when the data is stored.

When someone is ready to use that data, then, at that time, they define what pieces are essential to their purpose. They define where to find those pieces of information that matter for that purpose, and which pieces of the data set to ignore.

This is why it is called “schema on read” since the schema is defined at the time the data is read and used, not at the time that it is written and stored.

Advantages of Schema on Write

The main advantages of schema on write are precision and query speed.

Because you define your data structure ahead of time, when you query, you know exactly where your data is. The structure is generally optimized for the fastest possible return of data for the types of questions the data store was designed to answer. This means you write very simple SQL and get back very fast answers.

In addition, before data is stored in a database, the data must go through a rigorous process to make sure it matches the structure exactly, and will serve the purpose of the database as it is meant to. The data’s quality is checked and enhanced or scrubbed. Duplicates are found and resolved. The data is checked against business rules to make certain it is valid and useful for the purpose defined. This means that the answers you get from querying this data are sharply defined, precise and trustworthy, with little margin for error if your ETL processes and your validation checking have done their job.

Advantages of Schema on Read

The main advantages of schema on read are flexibility in purpose and query power.

Because your data is stored in its original form, nothing is discarded, or altered for a specific purpose. This means that your query capabilities are very flexible. You can ask any question that the original data set might hold answers for, not just the type of questions a data store was originally created to answer. You have the flexibility to ask things you hadn’t even thought of when the data was stored.

Also, different types of data generated by different sources can be stored in the same place. This allows you to query multiple data stores and types at once. If the answer you need isn’t in the data you originally thought it would be in, perhaps it could be found if you combined it with other data sources. This power of this ability cannot be underestimated. This is what makes the Hadoop data lake concept which puts all your available data sets in their original form in a single location such a potent one.

Disadvantages of Schema on Write

The main disadvantages of schema on write are query limitations and inflexible purpose.

The dark side of the tightly controlled precision of a schema on write data store is that the data has been altered and structured specifically to serve a specific purpose. Chances are high that, if another purpose is found for that data, the data store will not suit it well. All the speed that you got from customizing the data structure to match a specific problem set will cost you if you try to use it for a different problem set. And there’s no guarantee that the altered version of the data will even be useful at all for the new, unanticipated need. There’s no ability to query the data in its original form, and certainly no ability to query any other data set that isn’t in the structured format.

Also, to fit the data into the structure, ETL processes and validation rules needed to clean, de-dupe, check and transform that data. Those processes take time to build, time to execute, and time to alter if you need to change it to suit a different purpose.

There is always a time cost to imposing a schema on data. In schema on write strategies, that time cost is paid in the data loading stage.

Disadvantages of Schema on Read

The main disadvantages of schema on read are inaccuracies and slow query speed.

Since the data is not subjected to rigorous ETL and data cleansing processes, nor does it pass through any validation, that data may be riddled with missing or invalid data, duplicates and a bunch of other problems that may lead to inaccurate or incomplete query results.

In addition, since the structure must be defined when the data is queried, the SQL queries tend to be very complex. They take time to write, and even more time to execute.

As I said before, there is always a time cost to imposing schema. In schema on read strategies, that time cost is paid when you query the data.

So, Which Should I Choose, To Schema on Read or To Schema on Write?

That is the question. In my next blog post, I’ll look at some examples of schema on read and schema on write Hadoop SQL technologies, and discuss what criteria a person should use to choose between the two. Until then, remember this above all, to thine own business goals be true.

————–

Paige Roberts has spent a lot of her life stuffing her brain full of information about big data, data integration, data quality, and analytics software, markets, and systems, and is likely to tell you about it in great detail if you don’t run away fast enough. Find her on Twitter or LinkedIn at RobertsPaige, or check out her blog at bigdatapage.com.