Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity

We have all heard of the the 3Vs of big data which are Volume, Variety and Velocity. Yet, Inderpal Bhandar, Chief Data Officer at Express Scripts noted in his presentation at the Big Data Innovation Summit in Boston that there are additional Vs that IT, business and data scientists need to be concerned with, most notably big data Veracity. Other big data V’s getting attention at the summit are: validity and volatility. Here is an overview the 6V’s of big data.

Volume

Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Yet, Inderpal states that the volume of data is not as much the problem as other V’s like veracity.

Variety

Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data. Jeff Veis, VP Solutions at HP Autonomy presented how HP is helping organizations deal with big challenges including data variety.

Velocity

Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.

Veracity

Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.

Validity

Like big data veracity is the issue of validity meaning is the data correct and accurate for the intended use. Clearly valid data is key to making the right decisions. Phil Francisco, VP of Product Management from IBM spoke about IBM’s big data strategy and tools they offer to help with data veracity and validity.

Volatility

Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.

Big data clearly deals with issues beyond volume, variety and velocity to other concerns like veracity, validity and volatility. To hear about other big data trends and presentation follow the Big Data Innovation Summit on twitter #BIGDBN.

Validity: also inversely related to “bigness”. So can’t be a defining characteristic.

Volatility: a characteristic of any data. No specific relation to Big Data.

Yes they’re all important qualities of ALL data, but don’t let articles like this confuse you into thinking you have Big Data only if you have any other “Vs” people have suggested beyond volume, velocity and variety.

Glad to see others in the industry finally catching on to the phenomenon of the “3Vs” that I first wrote about at Gartner over 12 years ago. For proper citation, here’s a link to my original piece: http://goo.gl/ybP6S. Other have cleverly(?) added other “Vs” but fail to recognize that while they may be important characteristics of all data, they ARE NOT definitional characteristics of big data. Adding them to the mix, as Seth Grimes recently pointed out in his piece on “Wanna Vs” is just adds to the confusion. –Doug Laney, VP Research, Gartner, @doug_laney

Resource Links:

Industry Perspectives

In this special guest feature, Eric Frenkiel of MemSQL champions the use of Apache Spark in the enterprise coupled with in-memory database technology to achieve the promise of real-time analytics. [Read More…]