View from the 'Xu

dataxu's thoughts on data, analytics and the industry

Technical Tuesday: Separation of metadata and data

For today’s edition of dataxu’s Technical Tuesday series, Dong Jiang, a Software Architect, dives into an in-depth exploration on the separation of metadata and data.

Separation of metadata and data: Building a cloud native warehouse, Part 3

This is part 3 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process. You can find part 1 here and part 2 here.

There have been many discussions in the industry on the separation of compute and storage. In this post, we will do an in-depth exploration on the less discussed separation of metadata and data.

Here at dataxu, to say we handle a lot of data is an understatement. But in our daily dealings with data and its metadata, we began to explore the topic, and potential advantages, of separating metadata from data.

Below, we will review how metadata is handled in a traditional database vs dataxu’s cloud warehouse, discuss design patterns and high-level implementation, and take an in-depth look at the benefits of decoupled metadata. But before we dig in, let’s start at the beginning—what exactly is metadata?

What is metadata?

In the most basic sense, metadata describes the structure of the data, like name for table/column/partition, datatype, and ordering.

In almost all traditional databases, the metadata exists as a catalog. In Oracle, there are catalog views, such as USER, ALL, and DBA views. In Postgres, there are information_schema and pg_catalog. With such catalogs, it is always possible to find answers to the following questions:

What are the schemas in the database?

What are the tables and views in each individual schema?

What are the columns in each individual table, (like data_type, ordinal_position, nullability, etc.)?

In a cloud-based warehouse, like our’s here at dataxu, Hive Metastore provides similar functionality. Either installed as a hive component backed by an aurora database or as a Glue Data Catalog, we can manipulate the catalog via Hive DDL. Beyond the obvious similarities and limitations (for example, Glue Data Catalog does not support views or UDFs yet), there is a significant distinction.

In traditional databases, the metadata and data is tightly coupled. You cannot have data without metadata. The pattern is “create-validate-insert”. Consider a typical database development process:

-- create metadata first
create table test (id int);
-- the following will succeed, db engine will validate incoming data as integer
and then insert, after which, we have data
insert into test select 1;
-- the next statement fails, db engine will validate the incoming data and reject
string type
insert into test select 'foo';

However, in the cloud warehouse, the metadata and data are separated. Data can exist independently on the S3. This means:

It is possible to have data but no metadata, i.e., no associated table definition in a data catalog. You might not think this is a very interesting use case. If you don’t have definition, how can you query? However, this point proves to be useful in cases of data corruption handling. We will explore this use case more in depth below.

It is possible to have data associated with one or more tables. This is interesting. A dataset can be defined as a table while at the same time be incorporated as a partition in another table. It all depends on the use case.

More commonly, we can have data generated first, validated, and then make it available in a data catalog to support additional processing.