3/23/2009 @ 6:00AM

Taming The Data Beast

The amount of data inside of large corporations is exploding, driving up the cost of expensive equipment in a data center, the power needed to run them and the real estate to house this ever-expanding infrastructure.

Around the globe, CIOs are wrestling with the same problem: how to reduce the amount of data they store and process each day. Driven by the digitization of documents, images and even videos as well as technology inefficiencies that allow that data to be copied and stored multiple times, the problem is growing exponentially.

Forbes caught up with Stephen Brobst, chief technology officer at Teradata, to talk the rising floodwaters of data and what needs to be done.

Forbes: What’s the largest problem facing CIOs?

Stephen Brobst: Problem No. 1 is the explosion of data. It’s happening across all industries. Telecommunications is a good example. Ten years ago they were keeping the billing history. Now they’re keeping the billing history and the call records, and they’re starting to add signaling data, which is the low-level network data. Each one of those steps is an order of magnitude increase in the amount of data that’s being analyzed to understand customer behavior and quality of service delivery. We’re seeing that kind of problem in all industries.

Why is it so hard to solve?

Along with the volume of data, there’s a proliferation of islands of data or data marts. Because of lack of strong governance, or for political reasons, organizations have started to see proliferation of data aligned to specific departments or analytic applications within a specific business. The redundancy causes confusion because there’s no single source of truth for doing analytics in the organization. CIOs are working really hard to eliminate these redundant repositories and consolidate different data marts. Having the right technology to do that is important, but a lot of it is getting the right enterprise information management strategy and strong governance in place. It’s the organizational will to get things done.

So the warnings years ago from centralized IT gurus came true?

There is an analogy to that. When end-user computing first came out, it was unmanaged and in some cases without standards, and that led to cost problems. You don’t want the end-user computing to go away. But with analytics you want an overall enterprise strategy so you can be more cost-effective.

Where is the data being created today?

It’s being created in the old data-processing systems, and then it’s being extracted into analytic repositories. The problem is everyone is extracting it into their own repository. The network management people extract into their repository, the finance people extract into theirs and the marketing people extract into theirs. That’s where you get these data marts. Everyone has their own copy of the data in a slightly different format, versus an enterprise approach to extract it into one repository and then reuse that content across many different knowledge-worker communities.

How much redundancy is out there?

There have been studies done, and in some cases it’s scary. There are very large, well-known financial institutions that have as many as a dozen copies of their highest-volume transaction data for different purposes like risk and marketing. The cost implications of that are very large. And that doesn’t include the confusion caused by different rules and timing for extracting that data.

Are companies setting up effective policies for getting rid of data?

There’s certainly a range of how far companies have gone with that. But the current economic situation has put an emphasis on finding inefficiencies with redundant data and ferreting them out. That’s happening more now than anytime over the past five years. It’s not that everyone is done with the consolidation, but there certainly are a lot of people working on it.

How long does it take to fix this problem?

Normally you don’t do it in one fell swoop. You have to migrate any analytics or reporting that were sitting on top of that redundant data, and you have to move those to the centralized shared copy of the data. All that has to be done before you get the cost, space and power savings. It’s six months before you see the initial benefit, and then you take new iterations every 90 to 180 days. In total, it can last a couple years before the majority of the data marts are consolidated.

Is it ever fully under control, given all the new data being generated?

In addition to a consolidation strategy, you have to have a governance strategy in place so the problem doesn’t continue. The approach is to clean up the sins of the past and at the same time put in stronger governance to protect those sins from cropping up in the future.

What do you have to watch out for if you’re shutting off servers?

There is a lot of data in the enterprise. What you don’t know is who are the users of that data and what are the dependencies. It’s not uncommon to find that even though you’ve built this non-critical repository, the analysis being done on it is so vital that you can’t live without it–even though it was never designed for that level of availability or mission-critical value.

If you consolidate data, is it more secure?

Yes. The more copies of the data you have floating around the organization, the more vulnerable it is. If you have an enterprise strategy for managing the data, that should include security and the right controls and polices for privacy management, as well. Departments or individuals probably don’t have the same level of training that an enterprise has.

If you were starting a company from scratch, what would be the ideal setup? Would you even have the data in-house?

On an organizational side, I’d want to make sure I had an enterprise information management strategy. That includes proper governance at the architectural level with IT and separate data governance in the business, not in IT. You need checks and balances between architecture and data in the business. That’s the first step. Then, on the technology side, you need a blueprint for how the data is going to be reused. That needs appropriate incentives and controls to prevent duplication. And you need security policies in place, which can be enforced with the appropriate technologies like encryption, active directory services and LDAP (Lightweight Directory Access Protocol) standards. And then I would build a semantic layer that would allow knowledge workers to look at the data they are authorized to look at in the form they want to see it, but with one single core so those businesses do not feel a need to duplicate it.

But do companies have to keep records forever?

The cost of storing data is out of control, so most companies don’t do that. But there’s a strong demand to keep some data for defined periods of time. In the E.U., there is the E.U. retention law that requires telecommunications companies to keep two years’ of call detail records for later use in law enforcement. It creates demands to do that efficiently. You only want one copy of that data, and you want multi-temperature data management techniques. Even though you have two years of data, you don’t need to store all of it at the same price/performance level. You might only need ready access to the last 90 days of data.