When big data becomes vast, what’s your data dropping strategy?

When big data technology is pitched, it’s often said that this is the best way to store all your data. The idea is that you will not use an expansive database, but commodity hardware that is expandable at a low cost, and runs on open source software. Therefore, you would create a data lake to store all your data, and that will serve all of your access needs in the future.

But it is simply not true that we can store all the data we generate. A fundamental truth is that the price for generating data will always be lower than the price for storing that data. The above assertion stems from simple laws of physics: any physical device that generates data needs to assume only a temporary state – often lasting only a few milliseconds – after which it is ready to generate new data. Microphones, cameras, and even radar are all great examples of this. By contrast, any physical device that stores data has to keep a given state for a long time; for as long as that data needs to be stored, which could mean years or even decades.

Sensor versus storage pressure ratio

This years-versus-milliseconds contrast poses a problem. For example, if we need to store data for 10 years, and generate these data with a rate of one new measurement per second, the demands on sensor vs. storage hardware form a ratio as follows:

(365 x 24 x 60 x 60):1

or:

3 x10^7:1

In other words, this problem places 10 to the power of 7 more pressure per physical unit on the storage hardware than on the sensor hardware. As a consequence, it will always be far cheaper to create data than to store data.

A real-life example comes from self-driving cars. The sensors necessary for them to operate effectively generates something like four terabytes of data per day, per car. With the average life expectancy of a driverless car assumed as similar to the average car today (eight years), a fleet of just 100 cars would generate more than an exabyte of data – or over one million terabytes – all of which requires storage. That is a lot of data storage.

If these estimates are even close to the truth, it’s likely that much of the data would need to be abandoned.

Bright future?

Don’t think that future improvements in technology will solve this problem. As memory capacities grow, so too do the capabilities of sensors to generate data.

It is naïve for businesses to assume they will always be able to store all their data. One can’t just create a data lake and go on a sensor shopping spree, expecting everything to be fine forever. It won’t be.

Even an entry-level 4K camera at a cost of only $10 will create around 375MB each minute. That equates to 197.1 terabytes of continuous video over one year. To accordingly expand the memory capacity of your data lake, the extra hard disc space alone could cost around $5000, at today’s prices. You see the discrepancy?

Big data becomes vast

When you get to this point, and your data lake can no longer keep pace with the data you generate, you have moved beyond big data and into the realm of vast data.

I introduce here the following definition: vast data are those that cannot be reasonably handled by big data technologies. Vast data has already become a reality for businesses today.

By necessity, some data will need to be dropped. In fact, the physics considerations above tell us that in many cases, the majority of data will need to be dropped. It is true that vast data may be processed in some way before being dropped, but this regretful dropping step will inevitably come.

Therefore, here follows a piece of obvious wisdom: it will always be preferable to have a well thought strategy for dropping data, rather than not having one. If one ignores the problem, things cannot possibly improve.

Danko Nikolic is a brain and mind scientist, as well as an AI practitioner and visionary. His work as a senior data scientist at Teradata focuses on helping customers with AI and data science problems. In his free time, he continues working on closing the mind-body explanatory gap, and using that knowledge to improve machine learning and artificial intelligence.