Big Data Debate: End Near For ETL?

Extract, transform and load processes are the backbone of data warehousing, but with Hadoop on the rise, some see a new way to transform data. Two experts share opposing views.

Extract, transform and load (ETL) processes have been the way to move and prepare data for analysis within data warehouses, but will the rise of Hadoop bring the end of ETL?

Many Hadoop advocates argue that this data-processing platform is an ideal place to handle data transformation, as it offers scalability and cost advantages over conventional ETL software and server infrastructure. Defenders of ETL argue that handling the transformation step on Hadoop does not do away with the need for extract and load; nor does it address data-quality and data-governance requirements information management professionals have been working on for decades.

In our debate, Phil Shelley, the chief technology officer at Sears Holdings and the CEO of its big data consulting and services offshoot, MetaScale, says we're witnessing the end of ETL. James Markarian, chief technology officer at information management vendor Informatica, says ETL is changing but will live on.

What's your view on this raging debate? Use the commenting tool below the article to challenge these experts and share your view.

For The Motion

Phil Shelley
CTO Sears Holdings, CEO Metascale

ETL's Days Are Numbered

The foundation of any IT system is the data. Nothing of value can be done without generating, manipulating and consuming data. When we lived in the world of monolithic mainframe systems that were disconnected, data mostly stayed within that system and was consumed via screen or paper printout. Since that time ended, we live in a world of separate systems and interconnects between them. ETL (extract-transform-load) was born, and we began to copy and reuse data. Reuse of data rarely happens without some form of aggregation, transformation and re-loading into another system.

The growth of ETL has been alarming, as data volumes escalate year after year. Companies have significant investment in people, skills, software and hardware to do nothing but ETL. Some consider ETL to be a bottleneck in IT operations: ETL takes time as, by definition, data has to be moved. Reading from one system, copying over a network and writing all take time -- ever growing blocks of time, causing latency in the data before it can be used. ETL is expensive in terms of people, software licensing and hardware. ETL is a non- value-added activity too, as the data is unusable until it lands in the destination system.

So why do we still do ETL? Mostly because systems that generate data are not the ones that transform or consume data. How about changing all that, as it seems to make no sense that we spend time and money on non-value-added activities?

Well, historical systems were not large enough to cost-effectively store, transform, analyze and report or consume data in a single place. Times and technology change, of course, and since Hadoop came to the enterprise, we are beginning to see the end of ETL as we know it. This is not just an idea or a desire, it is really possible and the evolution is underway.

With Hadoop as a data hub in an enterprise data architecture, we now have a cost-effective, extreme-performance environment to store, transform and consume data, without traditional ETL.

Here is how it works:

Systems generate data, just as they always have.

As near to real-time as possible, data is loaded into Hadoop -- yes, this is still "E" from traditional ETL, but that is where the similarity ends.

Now we can aggregate, sort, transform and analyze the data inside Hadoop. This is the "T"and the "L" from traditional ETL.

Data latency is reduced to minutes instead of hours because the data never leaves Hadoop. There is no network copying time, no licenses for ETL software and no additional ETL hardware.

Now the data can be consumed in place without moving it. There are a number of graphic analytic and reporting options to consume data without moving large amounts of data out of Hadoop.

Some subsets of data do have to be moved out of Hadoop into other systems, for specific purposes. However, with a strong and coherent enterprise data architecture, this can be managed to be the exception.

So, ETL as we know it is gradually fading to be the exception rather than the norm. This is a journey, not a binary change. But in our case at Sears and for other companies, case-by-case, gradually, but certainly, ETL is becoming history.

Phil Shelley is CTO at Sears Holdings, leading IT operations. He is also CEO of MetaScale, a Sears Holdings subsidiary that designs, delivers and operates Hadoop-based solutions for analytics, mainframe migration and massive-scale processing.

Against The Motion

James Markarian
CTO, Informatica

Don't Be Naive About Data Integrity

The stunning thing about the current buzz and questions heralding the end of ETL and even data warehousing is the lack of pushback and analysis of some of the outlandish comments made. The typical assertion is that "Hadoop eliminates the need for ETL."

What no one seems to question in response to these sorts of comments is the naive assumptions these statements are based on. Is it realistic for most companies to move all of their data into Hadoop? Given the need to continue to use information that currently exists in legacy environments, probably not. Even if you did move everything into Hadoop, a path that will take years, if not decades, for most companies with existing databases, you still have to manipulate the data once it is there.

So is writing ETL scripts in MapReduce code still ETL? Sure it is. Is running ETL faster (in some cases, and slower in other cases) on Hadoop eliminating ETL? No. Or is the introduction of Hadoop changing when, where and how ETL happens? Here the answer is definitely yes.

So the question isn't really, are we eliminating ETL, but rather where does ETL take place and how are we extending or changing its definition. The "E" represents the ability to consistently and reliably extract data with high performance and minimal impact to the source system. The "T" represents the ability to transform one or more data sets in batch or real-time into a consumable format. The "L" stands for loading data into a persistent or virtual data store.

Information needs to be standardized, with regards to semantics, format and lexicon, for accurate analysis.

Operational results need to be consistent and repeatable.

Operational results need to be verifiable and transparent -- where did information come from, who touched it, who viewed it, what transformations and calculations were performed on it, what does it mean, etc.?

What we ordinarily hear regarding new big data environments is that the data appears by some form of osmosis. We want every last bit of it for new insights, and don't worry about semantics and terminology -- those discrepancies just make the results more interesting. These kind of dreamy aspirations are seductive but deceptive. It's also just the start of a path toward relearning all the reasons why data practitioners developed best practices around accessing data, profiling data, discovering relationships, handling metadata, explaining context, transforming data, cleansing data, governing data for compliance and delivering information at various latencies using current-generation integration technologies.

Modern data integration tools and platforms ensure timely, trusted, relevant, secure and authoritative data. Modern integration technologies use optimizers to process information in both scale-up and scale-out architectures, push processing into database management systems, and push processing -- not just data -- into Hadoop. They broker and publish a data layer that abstracts processing such that multiple applications can consume and benefit from secure and curated datasets.

ETL no doubt needs to continue to evolve and adapt to developer preferences and the performance, scale and latency needs of modern applications. Hadoop is just another engine upon which ETL and its associated technologies (like data quality and data profiling) can run. Renaming what is commonly referred to as ETL, or worse, ignorantly dismissing data challenges and enterprise-wide data needs, is just irresponsible.

James Markarian is executive VP and CTO at Informatica with responsibility for the strategic direction of Informatica products and platforms. He also runs the corporate development group, including acquisitions.

Further to this thread and between/before Hadoop and traditional DW is IRI CoSort (www.iri.com), which doesn't rely on more machines or big ETL package costs. It combines transforms in the file system on huge file and relational sources, and addresses semi- and un-structured data. Using an Eclipse front-end, it also addresses the issues Informatica's CTO correctly identifies. ELT isn't the way to go either, since big transforms tax the DB (thus query response), or requires a costly appliance (which like Hadoop, throws hardware at a software problem). The benefits of a paradigm shift to new IT fabric are not always worth the risk.

Hadoop is not a Data Integration Solution," I will describe the gaps between Hadoop and a proper Data Integration. To be sure, there are many, many gaps in Hadoop when compared to a traditional data integration solution. But, what is it about the Hadoop infrastructure that is attracting such interest despite these significant gaps? There is a reason Sears has made the decisions it has. There is a reason why many more organizations are aggressively pushing forward to integrate data in Hadoop despite Hadoop's functional gaps.

In the era of Big Data, Hadoop's architecture is fundamentally superior for supporting many of the most commonly deployed data integration functions. First and foremost, it can deliver the scale and compute capabilities required to support the information the business demands at a cost that is sustainable. For this reason, organizations are flocking to Hadoop even if key functional capabilities must be written by hand today. Hadoop makes it easy to scale computing power horizontally with low cost components. This architectural benefit is absolutely core to successfully performing the large-scale ETL required for processing Big Data. Hadoop's ability to persist data „Ÿ lots of it in any format – is a new architectural component long missing from traditional data integration platforms. More importantly, this architecture looks like it will also support a broader range of data integration functions.

The compute and analysis capabilities of the Hadoop architecture support the requirements of data profiling and data quality. In many ways, data profiling and quality are Big Data problems, particularly with today's growing data sets. This is being tested in our ETL Solution at NSFAS, why profile a sample when I have the entire dataset? The ability to support metadata seems obvious and while HCatalog is immature, it is evolving. Witness the introduction of Navigator 1.0 in Cloudera's 4.2 release, which provides basic data governance capabilities. Not only does the core architecture support advanced data integration functionality, but it also offers a superior framework to do so, enabling vendors to deliver these features at a rapid pace.

The main problem Big Data creates is an architectural one, not a functional one. Perhaps it is fair to say that today, Hadoop is not a Data Integration solution

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.