Defining the term: Big data

Big data is difficult to define because everybody seems to look at things differently.

Cox and Ellsworth (1997) made the distinction between big data collections and big data objects. The latter are sets, which were too large to be processed by standard algorithms and software on the hardware that was available at that time. Based on this definition several data sites may have to be used to do the job.

Cox and Ellsworth defined big data collections as aggregates of several data sets, such as multi-source, multi-disciplinary or stored on different sites and in disparate types of data repositories. While a single data object or data set might be manageable by itself, aggregating several of these makes data analysis a challenge.

More than 12 years later, Jacobs (2009) pointed out that any type of definition of big data was a moving target. Thanks to ever faster memory chips, what was not easily processable only a year ago might well be today. In other words, big data for a mainframe computer in 1981, might be analysed and processed with ease using a MacBook Pro in 2012.

Much of data we collect accrues during a transaction, such as a user logging into an e-commerce site. Here account data is retrieved and session information is added to a log. This allows the user to search for a product and possibly purchase something. In the purchase instance, payment details are added, including updating of user data. Such databases have been maintained for years through customer loyalty programs.

Jacobs pointed out that the challenge was not transaction processing or data storage. His reasoning was that few companies acquire such data in volumes that processing and storing them would pose a challenge. The challenge starts when we want get answers to all kind of questions from these data within seconds or minutes.

Based on the above (see also below), it is probably safe to define big data using three main features:

Velocity: Data is produced at high speed, such as thousands of closed-circuit television (CCTV) cameras across London, UK monitoring things to help prevent crime.

Volume: All those CCTV cameras across London produce vast amounts of visual data that must be interpreted, for instance to help solve crime (see y-axis below).

The biggest challenge is to match structured, machine-readable data with unstructured text or images (e.g., video feed). Examples include pictures that were not tagged with keywords.

This challenge is not new, however. Security services have tried for ages to match different data sources (e.g., telephone conversations and email) to gain valuable insights. All this is being done to better manage risks society faces when it comes to terrorist threats (e.g., Boston Marathon bombings) or hacker attacks.

For instance, forecasting influenza trends using various data points including Google Flu Trends is interesting. But how much Google’s search data as a whole add to our ability to forecast trends for next winter is unclear. If forecasting fails, how will this help health policy makers put the right strategy in place? What about helping us reduce the risk for a flu epidemic next winter (see below)?

Customer loyalty programs such as those offered by Safeway, Tesco or Migros, may give us lots of data, but unfortunately they fail to give us the insights we need for marketing. This is especially true if a client does 60 percent or less of their total household shopping at your store. In turn, giving them a toothpaste discount coupon on the back of the cashier receipt sounds great. But what if they just purchased enough supplies for the next 12 months? Worse, what if this happened by taking advantage of a special at the store down the road last week? Of course, it is impossible for you to know, but the result is that your marketing efforts may come across as simply a nuisance.

Customer loyalty to shops, airlines, hotels and so forth is low these days. Big data may not give us a true picture of what is happening because for starters our data may not be giving us an accurate picture (see example above; our records show they shop for this brand and are ready for a refill – NOT). In fact, big data predictions about people’s need (e.g., toothpaste example) will be incorrect. As well predictions about their behavior (e.g., risk of joining a terrorist cell) will punish or make inferences about them inaccurately and punish them before they have acted. Besides it being a nuisance, it negates ideas of fairness and justice.

Big data can be as fickle or useless as trying to predict what fashion trends will get consumers excited next Spring in stores or be sure to flop. Unless we gain more insight (are these data accurate, valid???), we are unable to manage this risk better.

Fact is, gaining such insights may not require tons of data. Talking to a few clients to better understand why they do certain things a certain way may, however, be critical for a thoughtful analysis. And no, doing a telephone marketing survey will not get answers from those customers you want replies from (e.g., successful professionals).

Do you agree with my definition of big data?
Do you know about a big data fail?
What is a great big data case study that benefitted you? Thanks again for sharing your thoughts and insight – I appreciate it, as always.

The author: This post was written by social media marketing and strategy expert Urs E. Gattiker, who also writes about issues that connect social media with compliance, and thrives on the challenge of measuring how it all affects your bottom line.

Great post. I particularly like your review of some research papers and how they define big data.

The cube would suggest that this is a moving target as you suggest. And if @Viktor_MS:twitter should be right, though, this requires that we have reliable and valid data. But as your example from the store shows, it ain’t accurate most of the time.

Of concern is also that we often do not have a clue what algorithms are used to make predictions or calculate certain rankings (e.g., Klout, Google, etc.).

What you think will happen – Trend?

http://blogrank.cytrap.eu/ig/4yt/*/*/*/CEO/top100 Urs E. Gattiker

Thanks Roberto for your comment.
I agree with you as far as your concern regarding how much we should trust algorithms (i.e. rules). Their use is ever more prevalent to get ratings to rank individuals as the Klout score suggest. t

Klout dangles a classical seductive feedback loop, almost making sense…. but not quite. In that way you are influenced by a phantasm such as a person’s Klout score, even though these data make no sense, do they?

Trend will be that ever more people will release that that the saying: garbage in, garbage out — applies to gibe data as well. In other words, if these are inaccurate or give you an inaccurate picture, how much trust should we then put into these numbers?

Finally, I see ever more younger folks when being asked at the check-out if they have a customer loyalty card the cashier should scan in to credit their purchases (i.e. to get bonus points or discounts in return) replying: “I do not have one.”

Again, there seems to be a growing group of shoppers that just do not want to participate in such schemes. Those will probably be the same individuals that will not want in-store tracking devices or their smartphones collecting data about their shopping habits either.

If people continue to be concerned about privacy and data security, they will resist any such attempts to gain more information about their behaviors as consumers. The recent loss of credit card information data files at Target in the US suggests they are well advised being vigilant about their privacy.

Who writes here?

With two decades of application inside blue chips and FT Global 500s, his pioneering work in the field of corporate blog benchmarking and the social media audit (see his books) are now a recognized gold standard.

He is known for a high energy style that combines humor, street smarts, and board room wisdom.
Read more...

ROI: Manage bottom line

Case studies, tips and tools you can use. Pre-order now and save 25% and shipping!