Thursday, November 15, 2012

Why Sears Is Going All-In On Hadoop

Why Sears Is Going All-In On Hadoop is an interesting, if ‘rose coloured’ view of Hadoop from Phil
Shelley, CTO at Sears. Note that he also leads a Sears subsidiary called
MetaScale – which is offering Big Data architecture, consulting & services
to companies outside the retail space.

A few choice quotes:

Moving up the stack, Sears is consolidating its
databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL
Server, and Oracle (including four Exadata boxes) are on their way out, Shelley
says.

"The Holy Grail in data warehousing has
always been to have all your data in one place so you can do big models on
large data sets, but that hasn't been feasible either economically or in terms
of technical capabilities," Shelley says, noting that Sears previously
kept data anywhere from 90 days to two years. "With Hadoop we can keep
everything, which is crucial because we don't want to archive or delete
meaningful data."

"ETL is an antiquated technique, and for
large companies it's inefficient and wasteful because you create multiple
copies of data," he says. "Everybody used ETL because they couldn't
put everything in one place, but that has changed with Hadoop, and now we copy
data, as a matter of principle, only when we absolutely have to copy."

Shelley sees Hadoop as part of a larger IT
ecosystem, too, and says systems such as Teradata will continue to have an
important, focused role at Sears. But he's on the far end of the spectrum in
terms of how much of the legacy environment Hadoop might replace. Countering
Shelley's sometimes sweeping predictions of legacy system replacement, Mike
Olson, CEO of Cloudera says: "It's unlikely that a brand-new entrant to
the market [like Hadoop] is going to displace tools for established workloads”.

MetaScale also offers data architecture,
modeling, and management services and consulting. The big idea behind Hadoop is
to bring in as much data as possible while keeping data structures simple.
"People want to overcomplicate things by representing data and dividing
things up into separate files," says Scott LaCosse, director of data
management at Sears and MetaScale. "The object is not to save space, it's
to eliminate joins, denormalize the data, and put it all in one big file where
you can analyze it." It's an approach that's counterintuitive for a
SQL veteran, so a big part of MetaScale's work is to help customers change
their thinking: You apply schema as you pull data out to use it, rather
than take the relational database approach of imposing a schema on data before
it's loaded onto the platform. Hadoop holds data in its raw form, giving users
the flexibility to combine and examine the data in many ways over time.