ETL: The Most Important Acronym You've Never Heard Of

“Data-Driven Thinking" is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Mike Driscoll, CEO and founder of Metamarkets.

Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: "Sat Jan 22 10:37:13 PST," "2014-01-22T1837:13.0+0000" and "1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It's no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

1. Stay Close To The Source

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

2. Avoid Processed Data

Just like food, data is best when it's minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world's lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

4. Put Business Questions First (Don’t Let Data Wag The Dog)

Too many organizations, upon recognizing that they've got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that "once we have all the data, we can answer any question we'd like." This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

5. Avoid ETL Where You Can

While for some organizational processes there's no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else's problem, freeing up our businesses to specialize in what we do best.

7 Comments

Interesting article, thanks for publishing it. I'd be interested in your views of ETL on Hadoop. I've found that, while Hadoop users don't call it ETL or even know they are doing ETL, the number one use case for Hadoop is ETL. Aggregating web logs, join large data sets, cleansing data, etc. All ETL, but the users will call it data refinement, data processing, etc. Is this your experience as well?

Keith - Hadoop and ETL are not only related, I would argue that Hadoop's raison d'etre is performing ETL at scale.

Most commonly, data pipelines begin with unstructured log data, collected from a variety of sources, from which features are extracted and transformed (the ET in ETL) with Hadoop, then loaded into a data warehouse such as Vertica. These transformations typically dramatically reduce the size of the data -- through aggregation, selective filtering, and/or sampling.

Hadoop is powerful, but because its batch ETL cycles typically refresh only a few times per day, many companies are investing in real-time ETL. Spark and Storm are just a couple of the new open-source frameworks that developers are turning to -- at places like Yahoo and Twitter -- to power always on, continuously up-to-date data pipelines. (At Metamarkets, we use both Hadoop and Storm for ETL pipelines that process over a trillion events a month).

Mike-
I disagree with your sampling comment. I understand that computing and storage is way cheaper then it use to be, but it still can be very challenging to examine large sets of data. If sampling will produce the sames results in seconds vs hours, this is highly preferred over exact numbers. A good implementation of this is http://blinkdb.org/. If I want to know the bid request count by geography why do I need exact numbers? There are a vast amount of use cases sampling would be preferred over exact numbers.

You're right that I may have come down a bit harsh on sampling: it certainly has its place in the toolkit of data scientists; Knuth's reservoir sampling algorithm is both beautiful and useful. Sam Madden and Ion Stoica, two of the academics behind the BlinkDB project you cite, are some of the smartest folks in database theory. But in my real-world experience, I've rarely found a sampling approach that didn't backfire on me at some point.

That said, there's another approach that I didn't mention that the speed advantages of sampling, while avoiding its pitfalls: summarization. In your example, it's true that there's no need to scan all the data if you're only interested in per-country summaries. Many of the leading analytical databases -- like Vertica, ParAccel, and our own Druid --store materialized, but precise, summaries of data that can retrieved quickly.

Thank you for pointing out the huge and little-discussed "gotcha" in the fascination with so-called Big Data and First Party Data. For my own company's dynamic ad platform, a huge share of resources in terms of human effort, software, and processing power go towards ETL (upwards of 50% I'd say).

One of our favorite data sources are Google Product Feeds (powered by Google Merchant Center). When retail advertisers are using Google's spec as their product XML feed "template," it's much easier to migrate and interpret data than when advertisers have their own home-grown XML formats, which tend to be unwieldy and not very well thought out.

I would imagine that in the quest to unlock First Party Data and make it actionable, poor data formatting is perhaps the top barrier, since most ad vendors have neither the ability nor the inclination to provide robust ETL services and untangle an advertiser's messy data. It shuts down campaigns in the planning stages.

Myles - I couldn't agree with you more regarding the unseemly mess that awaits advertisers seeking to integrate their own first party data into programmatic campaigns. CMOs are in for a world of hurt as they begin to automate data collection from a hairball of enterprise software systems.

Per your own experience, though, I predict that rather than embarking on a losing battle to reform old systems and make them produce clean data exhaust, smart CMOs will aim to replace them. Increasingly, a selling point for software won't be the quality of functions it provides (in an e-commerce server, for example), but the interoperability of data feeds it produces.

This is the first time since the Big Data hype started that I've seen Hadoop, Storm et al categorised into what they really are - ETL systems. Your article will make most enterprise data administrators happy since I'm sure they have've been struggling to understand what the difference between ETL and Hadoop et al is. This article deserves to receive wider coverage.

In my machine learning projects, can agree that approximately 80% of effort is with ETL.