Big Data is THE biggest buzzword around at the moment, and I guess it makes sense to start my new ‘The Big Data Guru’ column with a post that goes back to basics and establishes what big data really is, what is isn’t and why it matters to everyone.

One thing is certain: big data will impact everyone's life. Having said that, I also think that the term 'big data' is not very well defined and is, in fact, not well chosen. I also think the term is completely over-hyped but this just comes with the territory (software vendors and consulting companies need these buzzwords to generate interest and sell new products and services). Let me use this article to explain what's behind the massive 'big data' buzz and hopefully demystify some of the hype.

Introduction to Big Data

Basically, big data refers to our ability to collect and analyze the vast amounts of data we are now generating in the world. The ability to harness the ever-expanding amounts of data is completely transforming our ability to understand the world and everything within it. The advances in capturing and analyzing big data allow us to e.g. decode human DNA in minutes, find cures for cancer, accurately predict human behavior, foil terrorist attacks, pinpoint marketing efforts, prevent diseases and so much more.

You might ask: So what is new here? Haven’t companies and organizations captured and analyzed data for a long time? Yes, but there are three things that are changing at the moment and are making the phenomenon of ‘big data’ real:

The rate at which we are generating new data is frightening - I call this the ‘datafication’ of our world.

We generate more complex forms of data

Our ability to analyze data has been transformed in recent years.

The Complete Datafication of Our World

Day after day our world is filled with more and more data and the pace of the data growth is accelerating week by week. Data on every aspect of our life is now being generated. Here are just some examples that illustrate what I mean by the datafication of our world:

We increasingly leave digital records of our conversations: Emails are stored in corporate systems, our social media up-dates are filed and phone conversations are digitalized and stored.

More and more of our activities are digitally recorded: Most things we do in a digital world leave a data trail. For example, our bowser logs what we are searching for and what websites we visit, websites log how we click through them, as well as what and when we buy, share or like something. When we read digital books or listen to digital music the devices will collect (and share) data on what we are reading and listening to and how often we do so. Or when we make payments using e.g. credit cards the transactions are being logged.

A lot of photos and videos are now digitally captured and stored. Just think of the millions of hours of CCTV footage captured every day. In addition, we take more videos on our smart phones and digital cameras leading to around 100 hours of videos being up-loaded to YouTube every minute and something like 200,000 photos added to Facebook every 60 seconds.

Companies and organisations are creating vast repositories of data, keeping a digital record of everything that is going on: Just think of all the data generated daily in our financial systems, stock control systems, ordering systems, sales transaction systems and HR systems. These data repositories are growing by the minute.

We generate data using the ever-growing amounts of smart devices and sensors: Our smart phones track the location of where we are and how fast we are moving, there are sensors in our oceans to track temperatures and currents, there are sensors in our cars that monitor our driving, there are sensors on packaging and pallets that track goods as they are shipped along supply chains. Smart watches, Google Glass and pedometers collect data. For example I wear an Up band that tells me how many steps I have taken, the calories I have burnt each day as well as how well I have slept each night, etc. Many devices are now internet-enabled so that they self-generate and share data. Smart TVs and set-top-boxes, for example, are able to track what you are watching, for how long and even detect how many people sit in front of the TV.

I am sure you are getting the point. The volume of data is growing at a freighting rate. Google’s executive chairman Eric Schmidt brings it to a point: “From the dawn of civilization until 2003, humankind generated five exabytes of data. Now we produce five exabytes every two days…and the pace is accelerating.”

Not Only More Data But More Complex Data

So yes, we are generating unimaginable amounts of data. The other thing that has changed is that we generate new and more complex types of data such as digital phone records of conversations, video and photo images, conversations using social media speak (hashtags, LOL, etc.) In the world of ‘Big Data’ we talk about the 4 Vs that characterize big data:

Volume – the vast amounts of data generated every second

Velocity – the speed at which new data is generated and moves around (credit card fraud detection is a good example where millions of transactions are checked for unusual patterns in almost real time)

Variety – the increasingly different types of data (from financial data to social media feeds, from photos to sensor data, from video capture to voice recordings)

Veracity – the messiness of the data (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech)

So, we have a lot more data than ever before, in more complex formats, that are often fast moving and of varying quality – why would that change the world? The reason is that we now have the technology to bring all of this data together and analyze it, something we could never do before.

We Can Now Analyze and Make Sense of ‘Big Data’

In the past we had traditional database and analytics tools that couldn’t deal with extremely large, messy, unstructured and fast moving data. We now have new tools that allow us to analyze vast amounts of data by breaking the analysis up into different parts where individual computers and processors perform small parts of a larger analysis. The task of processing very large data sets is performed in smaller tasks that are run in parallel using a large cluster of computers. In my next post I will go into more detail and discuss why ‘Big Data Analytics’ and not ‘Big Data’ as such are the real game changer. In the meantime, let me leave you with some real-life examples of how big data is used today:

The FBI is combining data from social media, CCTV cameras, phone calls and texts to track down criminals and predict the next terrorist attack.

Facebook is using face recognition tools to compare the photos you have up-loaded with those of others to find potential friends of yours (see my post on how Facebook is exploiting your private information using big data tools).

Politicians are using social media analytics to determine where they have to campaign the hardest to win the next election.

Video analytics and sensor data of Baseball or Football games is used to improve performance of players and teams. For example, you can now buy a baseball with over 200 sensors in it that will give you detailed feedback on how to improve your game.

Artists like Lady Gaga are using data of our listening preferences and sequences to determine the most popular playlist for her live gigs.

Google’s self-driving car is analyzing a gigantic amount of data from sensor and cameras in real time to stay on the road safely.

The GPS information on where our phone is and how fast it is moving is now used to provide live traffic up-dates.

Companies are using sentiment analysis of Facebook and Twitter posts to determine and predict sales volume and brand equity.

Supermarkets are combining their loyalty card data with social media information to detect and leverage changing buying patterns. For example, it is easy for retailers to predict that a woman is pregnant simply based on the changing buying patterns. This allows them to target pregnant women with promotions for baby related goods.

A hospital unit that looks after premature and sick babies is generating a live steam of every heartbeat. It then analyses the data to identify patterns. Based on the analysis the system can now detect infections 24hrs before the baby would show any visible symptoms, which allows early intervention and treatment.

Final Thought

Finally, no discussion about Big Data could be complete without mentioning the increasing concerns about privacy. Many concerns have been expressed about how retailers, credit card companies, search engine providers and mail or social media companies use our private information. However, the privacy concerns around big data started to explode with the revelations by Edward Snowden on how the U.S. National Security Agency (NSA) collects and analyses big data including the phone records and social media activities of millions of Americans. But because this is another massive issue in its own right I will address this in my third post. In the meantime, please ensure you follow me to make sure that you receive the future posts in my Big Data Guru column and feel free to also connect via Twitter, Facebook and The Advanced Performance Institute

Bernard Marr is a globally regognized big data and analytics expert. He is a best-selling business author, keynote speaker and consultant in strategy, performance management, analytics, KPIs and big data. He helps companies to better manage, measure, report and analyse performance. His leading-edge work with major companies, organisations and governments across the globe makes him a globally ...

Nice piece. Looking forward to future posts. Great to see the industry adopting Gartner's original "V"s of bigdata, albeit 12+ years since I first posited them (http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/)

Note also that Veracity was added by others trying to be clever or avoid the professional courtesy of proper attribution. Unfortunately, Veracity, Value and other recently suggested V's are not definitional characteristics of bigdata. Value is a goal and Veracity unfortunately is inversely related to the actual "bigness" characteristics. Conflating them only confuses people--as Seth Grimes pointed out in his recent piece.

Bernard, excellent article! At LexisNexis Risk Solutions we are actively engaged in using the open source HPCC Systems data intensive compute platform along with the massive LexisNexis Public Data Social Graph to tackle everything from fraud waste and abuse, drug seeking behavior, provider collusion to disease management and community healthcare interventions. We have invested in analytics that help map the social context of events through trusted relationships to create better understanding of the big picture that surrounds each healthcare event, patient, provider, business, assets and more. For an interesting case study visit: http://hpccsystems.com/Why-HPCC/case-studies/health-care-fraud