How Does Big Data Relate to Text, and What Is It Exactly?

You Don't Need to Cross the Barrier into Big Data to Analyze Text

Mentioning the term 'big data' whisks us into a terrain full of confusion. For a time, this topic attracted a storm of contradictions, hype and impossible-to-fill promises. While this thankfully seems to be quieting, many attempts to define this term do not appear immediately helpful. For instance, an often quoted definition (sometimes in without quotes or attribution) goes like this:

From this, we would conclude that if you just had an enormously large dataset, you still would not have big data. And apparently, if you did not use “innovative forms of information processing” (whatever those might be), then you similarly did not pass the threshold. And then, if the datasets are huge, and arriving quickly, but not high in variety . . . presumably, you get the idea.

Most authors agree on size, speed and variety as key, though. One article holds up the self-driving car as a prime example of how big data can be put into action. The car is sampling and processing many types of data many thousands of times a second, so we indeed have the three needed ingredients. Yet the car is deciding, not you (which is the whole point of its driving itself), and cars are generally not credited with much insight.

Down to a conclusion

We can say with some confidence that, when we need to apply big data to our decisions, it will be hard to manage. In part this arises because there will be more of it than you can handle comfortably. The definition of high quantity seems to be shifting. A few years ago, terabytes of data seemed to suffice. Now that we can process at least a few terabytes with a powerful enough PC, perhaps we need to have petabytes. (A petabyte is 1000 terabytes. Next up is the exabyte, 1000 petabytes.)

One slightly facetious conclusion is that big data is whatever requires new big storage, new big software, and big expenses. Also, this means big data is something that many friendly vendors are very interested in your getting—immediately.

Keep focused amid the plenty

You can get lost in the amount of text available for analysis. Quantities are staggering. If you care to dig into Twitter, for instance, you can wallow in some 500 million tweets per day—which sums to around 200 billion per year. This is almost unimaginable.

One last point about big data: it is best to exercise a great deal of caution. Much of the data being stockpiled at such a staggering rate has not been gathered with any form of analysis in mind. And in fact, much of it keeps track of places, things and events—a thorough status and history. Yet, in theory, decisions based on analysis are forward-looking.

Current events can do well in forecasting what will happen in the short term, and where the situations likely to arise are well known—as with the self-driving car. However, collected data does not provide good guidance for responding to changed conditions. There are many though who claim it does just this. After all, the thinking goes, with so much raw material, there must be a multitude of ways to apply it.

Attempts to “just do something” with heaps of data may underlie the steady appearance of articles with titles like, “Why most big data projects fail.” Using Google searches as a yardstick, the results in this arena are not terribly reassuring. The terms “big data” and “success stories” together do return some 1.4 million results. However, “big data” and “failure stories” bring back over 77 million.

Text and big dataYou do not need to cross the barrier into big data to analyze text. The methods work admirably with smaller datasets, and can be scaled up to handle as much data as your computing equipment (and budget) will allow. Practical Text Analytics has real examples ranging from just a few hundred text comments to about 270,000—all within the smallish to more-or-less-large range.

Text can and should have a role in focused analyses that direct new actions. In Practical Text Analytics, we show how text analytics worked with other forms of data to give better guidance for decisions. These successful applications were based on data collected with specific analyses in mind. The answers using text combined with other information were better than those with the other information alone or with the text alone. This is not big data—but it is big.

About the Author:Dr. Steven Struhl has been involved in marketing science, statistics and psychology for over 25 years. Before founding Converge Analytic, he was Sr. Vice President at Total Research/ Harris Interactive for 15 years, and earlier served as director of market analytics and new product development at statistical software maker SPSS. He brings a wealth of practical experience at the forefront of marketing science and data analytics. Steven is also known for academic work, having written books on market segmentation and text analytics, as well as over 25 articles for academic and trade journals. He has taught at both the graduate levels and in a research certification program. He is a regular speaker at trade conventions and seminars.