Why Big Data is Like Seawater

The truth is there is nothing inherently good or bad about seawater. If you want to have a place where octopus and starfish thrive, then seawater is a pretty good place. If you want a place where you can transport goods from San Francisco to Hong Kong, then seawater serves your purpose. If you want a place where great white sharks do what great white sharks do, then seawater is your place.

But if you are thirsty, then you want to stay away from seawater. Seawater won’t slake your thirst and might even make you sick. So beware of seawater.

Why are we talking about seawater? We are talking about seawater because in many respects seawater is like “big data.” Every vendor around is enamored with big data. It is undoubtedly the buzzword of the year; and if you believe the pundits, big data is going to replace everything under the sun.

(Have we heard this before? How many times have we heard that the technology du jour is going to replace everything that came before it? You would think that the vendors and the venture capitalists would have a longer memory than they seem to have.)

Why is big data like seawater? First, like the seawater in the ocean, there is a lot of it. Once upon a time we measured things in gigabytes, but with big data terabytes and petabytes are common.

There is another important way in which seawater is like big data. Like seawater, big data is impure. Big data contains every type of data that is imaginable. And all (or at least nearly all) of the data from big data is unstructured.

But what is the problem? The problem arises when you start to feed seawater to thirsty people. Human beings can’t drink seawater. Human beings need their water to be refined. There is too much salt in the seawater to be useful as drinking water, and there are other little creatures that live in seawater that you do not want crawling around your body. (You know – those microorganisms that Nova shows you in color.)

Here’s my advice:

Drinking seawater? BEWARE!

Doing analysis on text found in big data? BEWARE!

Just what are the challenges with using the text found in big data for analysis? It turns out that when it comes to doing analysis, there are many challenges. Certainly big data has unstructured data in it, but what are the challenges with the unstructured text in big data? There are many. First, there is the plain old physical irregularity of the text. Standard databases like to operate on data that is physically repetitive. And the text found in big data is anything but physically repetitive.

But even if you get past the physical ugliness of the data found in big data, there is the issue of disambiguation of text. Raw text has the problem of being unusable unless you establish the context of raw text. Raw text by itself is almost worthless. In order to use raw text as a basis for analysis, you need to disambiguate the raw text.

As a simple example of disambiguation, consider this case: Two men are on a street corner and a lady walks past. One man says to the other, “She’s hot.” What is being said here? It could be that the lady is young, slender, and attractive. Or it could be that this scenario takes place in Houston, Texas, and it is 98 degrees with 100 percent humidity, and the lady sweating profusely. Or it could be that the lady has just gotten a parking ticket and is angry. It could be any of those circumstances. Merely looking at the words “she’s hot” doesn’t tell you anything. Those words need to be disambiguated before they mean anything.

Disambiguation is not the only issue. When you look at a document, it often times happens that the document has a structure that needs to be accounted for. A cookbook holds recipes. A contract has sections and exhibits. A book has chapters, and so forth. In most cases, the logical structure of a document is an important piece of information.

If you want to drink seawater, you need to refine it. Similarly, while looking at the data found in big data is the first step in trying to make sense of it, if you want to analyze the text found in big data, you need to refine it.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.