The Fourth V For Big Data

Researchers at Lawrence Livermore National Laboratories couldn’t pinpoint why they experienced rapid variations in electrical load when bringing up Sequoia, one of the world’s most powerful supercomputers. Power would drop from 9 megawatts to a few hundred kilowatts around the same time every week.

By cross-checking different data streams, the source of the problem became apparent: the droop coincided with scheduled maintenance for the massive chilling plant.

The incident brings up a critical, but often, overlooked, problem with Big Data. Big Data software platforms and applications are supposed to deliver on the three Vs of volume, variety and velocity.

But they also have to deliver on a fourth V: visibility. Information needs to be delivered and served up in a way that makes sense to humans. If employees can’t “see” it to put A and B together, what’s the point? Visibility is the linchpin for Fast Data.

Visibility is already becoming a major drag for some organizations. Data scientists—some of the most highly recruited employees in the world—regularly spend 50% to 80% of their time performing “data janitor” work, i.e. the mundane tasks of filtering and cleaning records before people can start swimming through it.

The IDC digital data chart. Because it's tradition.

“It is something that is not appreciated by data civilians,” said Monica Rogati, vice president of data science at Jawbone. “At times, it feels like everything we do.”

One of the big barriers to visibility is the sheer size of the task. Vibration analysis systems can detect when parts on an assembly line are out of alignment or when something might be ready to implode. But they also might soak up hundreds of thousands of signals a second.

A smart mining sight might generate two petabytes a day in a 16 hour shift. Some wind developers are monitoring over 300,000 data streams. Self-driving cars can produce 1 GB a second.

Or think of home energy consumption. Software that pings your meter every 15 minutes can generally track total consumption, according to a study titled “Got Data? The Value of Energy Data Access to Consumers.” Software that looks every second can uniquely identify major appliances. At one millionth of a second, you can identify on individual lightbulbs. The more granular the data, the more you can massively and invisibly cutting emissions and costs.

A great opportunity, no? The volumes, however, are astronomical. A single smart meter report can generate 50 to 100 kilobits. New York City homes and apartments reading every second would generate 1.2 exabytes a year, a surge of data that could stress even advanced networks.

The more information you get, the deeper insights, in theory, you can achieve. But more information can mean a longer voyage to get to the truth.

You also need to address multiple audiences. Engineers in a control room need to see exactly what pieces of equipment are out, which ones might go out, and where the maintenance crews are located. The volume of data and number of variables is one of the reasons control room screens cover entire walls.

To executives, or to the people conducting public outreach, however, it becomes white noise. They need a subset of the information, usually color coded, served up on a smartphone. What do you take out without dumbing it down too much?

Is visibility easy? No. It sits on the border of computer programming and human psychology. Companies will find themselves assembling teams of graphic artists, mathematicians and anthropologists.

Giving people better insight will also pave the way to tackle the next problem: how do you know if what you’re looking at is an accurate assessment of the situation. That would be the fifth V: veracity.

I have been writing about the intersection of science and ambition for over 20 years. (Disclosure: I am currently a technology analyst and head of communications at OSIsoft, but this has nothing to do with that and I rigorous avoid conflicts of interest.It's my promise to y...