Eating the Elephant - Variety

Author: Jeremiah Evans, Senior Applications Developer

So far in this series we have introduced the “3 V’s” of Big Data: Volume, Velocity, and Variety, and focused on Volume and Velocity. Recall that we encouraged you to ask yourself, “What data am I missing because it’s not tracked when it doesn’t fit into my existing data models?”

Traditional data warehouses are good for traditional data. But as we discussed in the previous article on Velocity, the sources of data are increasing, and along with that, the types of data are also increasing. New research into predictive analytics is finding value in data that previously didn’t fit into standard models, or wasn’t included because it was deemed unimportant.

As I mentioned back in the article on Volume, a modern data architecture offers numerous techniques and tools to store, manage and query your data. But it’s not just the volume of data, it’s the way that data looks that is changing. In this article we will discuss the importance of not filtering out data just because it doesn’t look like any other data we have, and how to quickly leverage multiple sources to derive business insights.

It’s a Jungle Out There

What are you missing in your data right now because you can’t search through both a database and all your stored PDFs? What insight is just around the corner, if only you could connect all of your sources? Much like a jungle, your data can look like an overgrown tangle, no two inputs looking quite the same. Traditional data warehouses required their designers to go in with a machete and hack out the bits that were “in the way.” While this works in an environment where most of your data looks alike, that is less and less the landscape we are living in.

We’ve all heard the saying, “you don’t know what you don’t know,” that there is always some blind spot that you’re unaware of. With data, it’s often the case that “you don’t know what you do know.” You’re not getting the most out of the data you already have, either because you can’t analyze it all together, or because you’ve had to manipulate it to fit into your existing data models.

In a traditional data warehouse, not only do you have to figure out the question to ask, but the data you need might not be there. Today, to onboard your data you probably have to go through a lengthy process analyzing the data and its structure, and either defining a new location for it in your warehouse, or determining the proper transformation to fit it into an existing bucket. This “schema on write” data classification makes sense when the shape of your data dictates its usage.

In a modern data lake, technologies like Apache’s Hive and Drill allow you to do “schema on read” queries, including auto-discovery of the structure of a raw file. Bringing data into the platform may involve some transformation and enrichment, but it doesn’t have to. New data sources can be landed as is, and queried dynamically. Not only does this allow for new data, it also allows you to be flexible when the format of your data changes. What questions could you be asking today if you could quickly add a new - completely unique - data source?

Liven up the Party

Think of the data you’re collecting like guests at a party. Suppose you threw a party where there were exactly 3 rooms, completely separate and walled off from each other. Each room has a very specific feel, and isn’t very inviting for guests who don’t exactly fit in. The party’s OK, but it fizzles out by 9 P.M.

On the other hand, what if you had everything set up in a more open space, where instead of being walled off, each unique space blended into the other, so if you wanted to listen to the music, or have a conversation, or do both at once, you’re good to go! Your guests flow more easily from one space to the other, mingling and interacting. You find out the next day that a brand new business venture started at your party, because two of your guests were in the right place at the right time, the space providing just the right environment to bring together two people who otherwise might never have met.

What questions could you be answering right now if your analysts were free to blend data beyond the scope of a traditional data warehouse?

What’s Next

We’ve talked in this series about how data is changing, and how companies are going to need to change to keep up with it. It seems like a lot to change, but our phased approach helps to eliminate the barriers to entry, letting you eat the Big Data elephant one bite at a time.