When Data Hoarding Makes Sense

To hoard or not to hoard? EMC's Bill Schmarzo defends the practice of saving all the data you can, even when its value is uncertain.

10 More Big Data Pros To Follow On Twitter

(Click image for larger view and slideshow.)

Which data should you keep? Which should you toss? And how can you determine which data will deliver actionable insights at some point down the road, and which is merely taking up valuable storage space?

There's no easy answer, of course, and the solution will almost certainly vary by organization or industry. But two camps are emerging in the what-to-keep debate. One professes that big data's real value comes from near-real-time analysis of information and that archived data doesn't deliver a lot of bang for the buck. Another, however, argues that it's good business sense to store all the data you can.

Bill Schmarzo, chief technical officer of EMC Global Services, is a proud member of the data hoarder camp. "I'm a hoarder; I want it all," Schmarzo told InformationWeek in a phone interview. "And even if I don't know yet how I'll use that data, I want it because I can store it so cheaply. My data science team might find a use for it."

Perhaps it's not surprising that an executive of EMC, a major player in the data storage industry, would be a strong proponent of the save-it-all philosophy. But Schmarzo does make a compelling argument backed by a key technological trend: The cost of data storage continues to plummet dramatically.

One reason is that the Hadoop Distributed File System (HDFS) stores data at a much lower cost than traditional RDBMS systems. Schmarzo passed along an anecdote about a friend of his, an executive in charge of analytics for a national insurance company: "He found that it cost the same to store four terabytes on his enterprise data warehouse as it did to put 200 terabytes on Hadoop -- that's a 50x improvement."

Greatly reduced storage costs allow you to think differently about how you approach and monetize data, Schmarzo added. "We need to have a data-abundance mentality. We want it all. We want to share [data], grab it, play with it, and figure out what's there. And if it's not useful, shove it back into its bin and go onto the next data source."

Schmarzo provided an example of how a large grocery chain mined 15 years' worth of data on its customers' buying habits. Thirteen months of this "loyalty card" data was stored in the company's data warehouse; the rest was archived on tape drives. "Their key business initiative is around personalized marketing offers," said Schmarzo. "They wanted to leverage their [mobile] app to deliver personalized based on all this customer data they have."

An analysis of the data revealed an interesting, and actionable, insight: The 15-year time period included two recessions. This allowed the grocery chain to determine when shoppers were first impacted by the economic slowdowns, and when they started to recover from them.

"We identified three things that people do when they start struggling," Schmarzo said. "First off, they stop buying higher-quality products and start buying lower-quality products. Two, they start buying private-label products. Tissue paper is the first one to go, for some odd reason. Three, they start using coupons more often."

In retrospect, this behavior makes sense. "That's exactly what the data told us," said Schmarzo. "You can extrapolate based on demographics, geography, and behavioral groups -- all different ways to slice and dice the data, once you have it at that very low level of granularity."

He added, "We would not have had [those insights] if we had not had access to 15 years of data. So I'm a big believer in 'I want it all.' "

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek. View Full Bio

Great article and topic, Jeff. I agree with Bill – this is one case where it's good to be a hoarder. In my opinion big data requires having access to as much data as possible, retained over the longest period of time – it's no longer enough to just toss what you can't store. With the meteoric rise of data, companies can't expect to get by with the bare minimum. That said, it's important that companies focus on placing data in the appropriate tiers. I have been watching the growth in "cold data" in which companies store less-frequently accessed data at a lower price point with only moderately reduced access times – created to tackle this exact problem. If you are interested, we recently conducted research on this exact topic: http://storiant.com/resources/Storiant-CIO-Survey-Report.pdf

The long-standing argument against data hoarding, going back to the early days of data warehousing, was grounded on two points: 1) the excessive amount of up-front business analysis needed to properly categorize, organize, cleanse, move, and then store operational data in some form of an analytical database; and 2) the cost and complexity of the storage and databases needed to do so. However, as Mr. Schmarzo points out, Big Data technology effectively neutralizes those two arguments, opening up new possibilities for insights drawn from "broad and deep" retained data. Not every modern data repository will in fact yield outsized results from its cache of "hoarded" data, but that doesn't matter; the opportunity cost of not doing so should be enough to trigger business and technology strategists to "think like a hoarder" and leave behind those long-standing arguments against doing so.

The more negative data hoarding behaviors in business are more about business units hoarding data as part of a power play, as in keeping it to themselves, refusing to share with others who could make productive use of that data. What you're talking about here is less selfish, more generous, data accumulation for the (potential) good of the organization.

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.