Under Air - Data Processing In-The-Box, an "ELT" Journey

For a brief history of why ELT (that is, in-box-data-processing) is even a topic of discussion, we must recognize that the high-powered appliances such as Netezza have not only made such implementation viable, but even desirable.

Just so we level-set on what one of these looks like, it's a SQL statement. Usually an insert/select but can also include updates and the like. Many of you recognize these as multi-stage operations inside a stored procedure. The sentiment of course is that such an animal can perform better inside-the-machine than to take it out of the machine (through an ETL tool), process it, and put it back. This may be true for smaller data sets, but you aren't reading this because you have smaller data sets, are you? Netezza users are big-game hunters. Pea-shooters are for small animals, but if we want big game, we'd better bring a bigger game with us, put on a bigger game face, and bring bigger equipment.

But is it really the size of the equipment? Netezza users know that the size and the architecture are keys to success. Dog-piling a bunch of CPUs together does not a power-frame make.

Okay, back to the storyline here - in-the-box SQL-transforms, in a traditional RDBMS platform, are the realm of small game. Once the game gets bigger, these transforms degrade in performance, and rather rapidly. Watch as a swarm of engineers tries to reconstruct and rebuild the procedures, the SQL, the tables and even upgrade the hardware in a futile attempt to achieve incrementally more power. Emphasis on incrementally-more. Not linearly more.

As they grow weary of this battle, the ETL tools start looking better and better. We reach a critical mass, and the ELT-in-the-box is summarily forsaken as we stand up the ETL tool.

Sometime after this transition, the data volumes once again overwhelm the environment. One thing leads to another, and one day the Netezza machine arrives on the floor. Hopes are high.

But notice the transition above - ELT was forsaken for ETL, likely never to return. But wait, now we really have the power to do the ELT SQL-transforms, but we've mentally and perhaps emotionally (yeah, verily even politically) moved away from SQL-transforms.

Some reading this might scratch their heads and think, What's he talking about? We've been doing ELT in the machines using (PLSQL, etc, name your approach here) for many years. Why would we shy away from it?

Several reasons:

Hand-crafted SQL statements. They are easy to deploy but difficult to maintain and re-deploy. The moment we have to change the data model, this becomes painfully obvious.

Proliferation of one-off operations, in cut-and-paste form, increasing inefficiency and complexity of the overall implementation

Lack of overarching control - stored procs run and go "black" for an hour or more without any clue as to their status

They don't scale in a traditional (SMP-based) RDBMS - the larger the data gets, the more the implementation will suffer

ETL tools don't support them in the quantity and logistical quality we really need, but they are getting there.

Why not use an ETL Tool? I mean, they handle push-down SQL-generation right?

I can perhaps summarize this situation in a single conversation I had with a ETL tool vendor who was hawking the capability for his own tool, and after showing me the mechanics of making one of these little jewels operate with aplomb, I asked him, "So what about doing a few hundred of these in a sequence, or branching them into multiple sequences?" The vendor rep looked almost hurt. "Why would you want to do that?"

Well, if we're really talking about migrating transforms into the machine, this can grow into hundreds of operations rather quickly. This situation apparently overwhelms the logistical capability of even the most powerful ETL tool. But I have hope that they will solve this situation. Eventually.

I am not holding out hope that it will happen soon, or voluntarily. These tool vendors have invested millions in the performance boosting of their own products and will not likely toss this investment on the flames of the appliance movement, even if said flames are the exhaust flames of the appliance's rocket engines. This is why I say "eventually". They don't really have a marketable reason to embrace this approach.

Another problem of the ETL tool is that it is so divorced from the appliance's infrastructure that it cannot control the cross-environment logistics. This is especially true of the "virtual transaction" - that is - multiple flows arriving in multiple tables that are each in context of one another, yet are individually shared-nothing operations. If one of the flows should fail, how do we back out of this? Can we do a rollback of the tables where their flows succeeded? No we cannot. We could certainly implement an approach, but this is neither inherent nor intrinsic to the ETL tool. We need a shared-nothing virtual transaction that will control all of the flow in a common context, commit them in that context and faithfully rollback the results in that same context. ETL tools don't go there. Unless we implement the tool that way. As an application. Once implemented, how do we reuse this for the next application, and the next? We can see that it's not part of the tool itself.

If next-generation "ELT" scenarios are to be successful, they need several very important capabilities that are simply non-optional and non-trivial:

Catalog-aware - The entire application must be coupled to the catalog definitons so that changes in definitions will automatically drive the resolution of their impact. We don't want to go on a submarine hunt to find every place where a particular column is used. The tool should already know.

Logistical control - the ELT management needs to be able to support hundreds of transforms with arbitrary branches and synchronization

SQL-generation - from templates that are tied to the catalog - If the table changes, we can propagate the changed automatically into the transforms.

Proper handling of shared-nothing operations into the target, including virtual transactions that we can replicate to other databases, or rollback completely.

Treat internal appliance databases as "waypoints" rather than "endpoints". That is, multiple databases are in play and used for various roles, such as intake, intermediate workspace, and persistence. These resources should be configurable and relatively transparent to the tool.

Support of major patterns such as CDC, Dedupe, column-level scrubs and transforms, masking, column remapping, soft-delete, slowly-changing-dimensions and of course, analytic functions etc.

I am sure the visitors of this blog have even more aspects of a "wish list" that they have implemented (perhaps painfully so) and want more control over the data, its processing logistics and error control and recovery. Feel free to add your own comments and suggestions here.