3 keys to keep your data lake from becoming a data swamp

Data lakes can store all your business data with ease, but beware: That massive repository can get bogged down, choking off your users. Here’s how to prevent your data lake from turning into a data swamp.

"I think the way to avoid it is actually to turn the spigot way, way down," Perez says. "Work on the presumption that just because it's cheap to collect the data does not necessarily make it cheap to use it. It could actually be quite expensive. So don't collection information from everywhere and all the time. Keep it focused with a data set where you have a specific plan as to how you're going to mine it."

2. Adopt a machine learning strategy

Even with a focused data set, gleaning insight from data at scale requires automation.

"You need an automated system to clean it up," Perez says. "AI, machine learning, deep learning, whatever term you want to use, it's the magical solution for wading your way through your information. I maintain that the easiest way to get the value out of your huge 5PB data lake is to start with having a technique for how you're going to learn from it."

To start, Perez says, pick a data set you know and select a machine learning technique for going through it. You will likely have to acquire new skills to do it effectively, either through training or hiring.

"Machine learning is a black art," he says. "It's not easy to do. You need very specific skills."

3. Determine the business issue you're trying to address

Here's where everything comes full circle: You need to start with a clear vision of the business problem you're trying to solve. With an objective in mind, it should be relatively easy to zero in on the data you need to collect and the best machine learning technique for gleaning insight from that data.

For instance, Perez says, imagine you're a big-box retailer. You might decide that you want to understand what kind of customers are coming into your stores. You could capture photographs of customers entering your stores and then use a convoluted neural network (CNN) — a type of deep learning neural network that excels at computer vision problems — to process the images. The CNN could determine whether any individual image is male or female, a child or an adult, a child and an adult, a young person and an old person, etc.

"Once you've got all of that done, tie it up with a business initiative and give that capability to your business users," Perez says. "It could help you determine, 'We need to market more to men because we're not getting enough men.' You really need to have a clear strategy in advance. If you don't, just the mere collection of things becomes a huge negative to the process."

Once you've built a capability with a business initiative in mind, it's often possible to iterate on that capability to provide the business with even more targeted solutions. For instance, once you can identify who's coming into your stores, you can apply that same capability to determining who's walking past your cosmetics counters.