Blog

Why is Data Ingestion Difficult?

February 18, 2019

Author: Andras Nemeth

Building an automated data ingestion system seems like a very simple task. You just read the data from some source system and write it to the destination system. You run this same process every day. And voila, you are done.

Better yet, there must exist some good frameworks which make this even simpler, without even writing any code. You just make a few clicks, enter the name of the source, tell the system where to write and you can move on with your life.

Unfortunately, the reality is very different. I have never seen any well performing real life data ingestion system that didn’t required a significant amount of expert effort to build.

But why? What makes ingestion so hard?

The aim of this document is to explain some aspect of building a robust ingestion system that is actually and understandably hard.

Problem 1. You are just like everyone else: different

One of the biggest misconception about ingestion is that it’s always the same. You just take the data from A and move it to B. It’s almost never as simple. Every time there is significant deviation from the above mentioned simple, default case. Some examples of some “innocent” requirements from previous clients:

Some fields need to be masked for data protection before ingestion.

Some fields need to be hashed in an identity preserving way.

Oh, look, there are JSON values in one of the columns. Could we quickly expand the contents into separate (hierarchical) columns?

We can only ingest the data about the users listed in this whitelist table. Oh and wait, by the way the id space is different in the whitelist than the id space of some of the tables we need ingested. But don’t worry. Just use these other three tables joined in the right way, and you have your mapping!

Of course, the usage pattern for the destination will be dramatically different than that of the source (otherwise why would we want to ingest in the first place?), so, let’s rethink partitioning, indexing, necessary freshness, consistency, etc.

Now that I think about it, using these big data technologies, a de-normalized structure would work so much better than the normalized one standard in the RDBMS stacks.

As you are ingesting this data anyways, would you mind adding these 100 derived fields to make data science work simpler/faster?

A few aggregated tables here and there would speed up certain queries a lot. Could you?

Do you want to read data in a huge batch from this source system? Are you crazy? This will totally break the DB (and kill our operations), break down the network, supercharge our routers or something else. No way. Please only:

read my DB between 1 am and 5 am at night

only read at most 10 mb/sec

only run at most 3 queries concurrently

…

And these are just the requirements that I can remember after 10 minutes of thinking that I have actually seen requested at real life clients.

And don’t assume these are just some crazy whims. All of these are actually super reasonable, critical requirements that need to be met, otherwise the whole exercise makes no sense, as the result is unusable for one reason or another.

There can be no single framework which can cater to all these different needs in a simple and foolproof way. So any ingestion framework is necessarily either:

Powerless. While it may impress in demos by its simplicity and speed of setup it will fail miserably when applied to a real world problem. OR

Complex. If a framework is powerful enough to handle all the above then it will necessarily be very complex, needing an expert to navigate all the possible options and landmines that come with being less restrictive.

Problem 2. The only constant is change

There is no such thing as an ingestion system “ready”. There is always need for a few more tables, data sources change, complete subsystems migrate to new vendors, use cases evolve.

A good ingestion system is one designed for change.

And change is hard. Interestingly, not just for people, but also for computers. When you change one corner of an IT system you can always be sure of the existence of a fantastically subtle side effect which destroys the opposite corner of your system. Downstream systems will crumble due to some unexpected schema differences. Upstream systems will die from a butterfly effect from a small change in load patterns.

To manage change in IT well, you need at least the following things:

Good, comprehensive testing, so most problems surface during development time

Strong ability to know exactly what version of the system was running successfully and which version failed

Being able to precisely and comprehensively understand the differences between two versions of the system

Ability to know who did some changes for what reason

Any system dependent on configuration-by-mouse-clicking on a UI is notoriously bad at all these points. Someone will go to the UI, change something innocent and you won’t even know what hit you when the whole system melts.

Problem 3. Don’t ever do the same thing twice!

It may seem that doing a few tens of clicks on a UI is so much easier than writing some code to do the same job.

But the equation changes completely when you have to do almost the same tens of clicks a hundred times for hundred different tables. If you have code, then, with a bit of oversimplification, all it takes is just a loop calling that same code with different parameters for all 100 tables. While on a UI there is not escaping. You will have to do the same thing over and over again. It’s not only frustrating: it’s also very-very error prone. No human can do the exact same steps exactly the same way hundred times. A computer on the other hand excels at this. 🙂

But say you have some superhuman employee with an unusually high tolerance for monotony and he manages to set up the 100 tables for you without an error. The real mayhem is yet to come now! Try innocently telling your guy that you would like to change the ingestion logic just a tiny bit – the same way for all the 100 tables. This is a trivial issue if you use code. It is almost unmanageable on a UI.

To be fair, it is possible that a UI supports some sort of templating mechanism to handle such issues. But most of the time they don’t: that would hurt their claim for fame for simplicity. And even if they do, the power of any custom built templating system can never be a match of the well established abstraction mechanisms included in any decent modern programming language.

Problem 4. Anything that can go wrong will go wrong

Ingestion systems are no different. They break.

The question is not whether your ingestion system will break. The question is what happens next.

Imagine you have a monolithic, opaque ingestion framework optimized exactly to hide as much of its complexity from its users as possible. Now imagine that you have to figure out what’s going on when (not if!) it breaks down. It is like trying to diagnose a blue screen of death as an ordinary Windows user.

On the other hand, if you build your system using well understood primitive steps combined in well understood ways (i.e., a computer program written in a programming language) then you can dig down into the details and zero in on what went wrong much more easily.

The solution: good code by good engineers

The good news is that all the above problems have well understood solutions, or at least alleviation in the discipline of computer programing.

Programming languages are extremely expressive and flexible, you can express whatever crazy requirement you may come across.

Software development processes are pretty refined with lots of well established best practices for managing change. Rigorous version control and testing are great ways to solve issues mentioned above. You can follow what changed exactly, when and by whom. The change log also gives you an idea about the motivation of the change. But all this requires a well defined but human readable textual representation of your ingestion system: its code.

Programming languages also provide very rich abstraction techniques. You can make sure that no code (and effort) is duplicated and that sweeping changes are easy to do.

Finally, there are well established ways for debugging programs. A good engineer can see fairly quickly which part of the program failed and can test hypotheses on what went wrong.

But of course, there is a downside. Writing high quality code is not easy. It’s not just a few clicks with a mouse. To reap the above benefits, you need skilled and experienced engineers.

One final note. I am not arguing here to re-implement fully from scratch each new ingestion system. One can (and should) reuse various existing libraries and tools for particular sub-tasks of an ingestion system. There is a plethora of good quality open source tools out there which can solve some well defined parts of the problem. But they need to be well understood, transparent technologies and they should be glued together by some custom code at the end which has the ultimate control over what happens in the system.

So what are your experiences and challenges with data ingestion? Let us know in the comment section below.

(This article expresses the author’s personal opinion and does not represent the views held by Lynx Analytics Pte Ltd.)