Thursday, December 30, 2010

Machine vs. human generated data

Curt Monash has recently been discussing the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.

In short, the following are the main reasons why machine-generated data is important:

Machines are capable of producing data at very high rates. In the time it took you to read this sentence, my three-year old laptop could have produced the entire works of Shakespeare.

The human population is not growing anywhere near as fast as Moore’s law. In the last decade, the world’s population has increased by about 20%. Meanwhile transistor counts (and also hard-disk capacity since it increases by roughly the same rate) has increased by over 2000%, If all data was closely tied to human actions, then the “Big Data” research area would be a dying field, as technological advancements would eventually render today’s “Big Data” miniscule, and there would be no new “Big Data” to take its place. (All this assumes that women don’t start to routinely give birth to 15 children, and nobody figures out how to perform human cloning in a scalable fashion). No researcher dreams of writing papers that makes only a temporary impact. With machine-generated data, we have the potential for data generation to increase at the same rate as machines are getting faster, which means that “Big Data” today will still be “Big Data” tomorrow (even though the definition of “Big” will be adjusted).

The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until race-track memory becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans.

Curt attempts to define “machine-generated data” in his post as the following:

Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.

He then goes on to include Web log data (including user clickstream logs), and social media and gaming records data as examples of machine-generated data.

If you agree with the three reasons listed above on why machine-generated data is important, then there is a problem with both the above definition of machine-generated data and the examples. Clickstream data and social media/gaming data are fundamentally different from environmental sensor data that has no human involvement whatsoever. Certainly the scale of clickstream and gaming datasets is much larger than the scale of other human-generated datasets such as point of sale data (humans can make clicks on the Internet or in a computer game at a much faster rate than they can buy things, or write things down). And certainly, for every human click, there might be 5X more network log data (as Monash writes about in his post). But ultimately, without humans making clicks, there would be no data, and as long as the additional machine-generated data is linearly related to each human action (e.g. this 5X number remains relatively constant over time) then these datasets are not always going to be “Big Data”, for the reasons described in point (2) above.

The basic source of confusion here is that click-stream datasets and social gaming data sets are some of the biggest datasets known to exist (eBay, Facebook, and Yahoo’s multi-petabyte clickstream data warehouses are known to be amongst the largest data warehouses in the world). Since machines are well-known to have the ability to produce data at a faster rate than humans, it is easy to fall into the trap of thinking that these huge datasets are machine generated.

However, these datasets are not increasing at the same rate that machines are getting faster. It might seem that way since the companies that broadcast the size of their datasets are getting larger and gaining users a rapid pace, and these companies are deciding to throw away less data, but over the long term the rate of increase of these datasets must slow down due to the human limitation. This makes them less interesting for the future of “Big Data” research.

I don’t necessarily have a better way to define machine-generated data, but I’ll end this blog post with my best attempt:

Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.

Machine generated “Big Data” is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.

Under this definition, stock trade data (independent computation agents), environmental sensor data, RFID data, and satellite data all fall under the category of machine-generated data. An interesting debate could form over whether genomic sequencing data is machine-generated or not. To the extent that DNA and mRNA are being produced outside of humans, I think it is fair to put genomic sequencing data under the machine-generated category as well.

11 comments:

I wonder if another way to look at it would be the distance from a "conscious" decision to the generation of the data. This would handle the case of data about biological systems easier, since they would be generated quite far from a conscious decision, while click data or game data is generated much closer to conscious decisions. Stock data would be somewhere in there, too. I suppose the real question in how well this distance can be quantified and how well it can then classify the nature of the datasets involved.

Your comment seems to imply that while the human conscious mind is limited, the subconscious is infinite. This could take us down an interesting philosophical road :)

Seriously though, human biological systems would only seem to fall under the machine-generated category if we can measure them with an increased precision at the rate of Moore's law. Otherwise, we still have the long-term human limitation.

What you've omitted are data set that are considered too large to be treated as data just yet. For example, the set of all surveillance video. Most of this is stored on a tape loop which is automatically aged away.

I think that trying to split data into "machine-generated" versus "human-generated" is pointless.

It's such a fuzzy distinction. Is stock trade data human generated? It is if a person enters the order, right? But the majority of NYSE orders are from algorithmic trading. Same for the trades that appear in the books -- human or high-frequency black-box?

I agree that big data is a tough problem. I talked with Jim Gray about it many years ago when I looked at it. I was worried about the problems with 100s of GBs. A trifling amount today (buy one more 2010-era disk!) but the principles are the same.

Adding a time dimension does help capture flavor of "machine generated". But that opens more confusion. Where's the line for "faster than a human"? Anything less than once a minute? Once a second? Or is that dependent on the task? Secondly, does that measure one person's input or a crowd's?

Maybe the answer is to solution is to use some comparative measurements. N * Library of Congresses.

There's also a dimension of "worth" to the big data. Is each piece of data important? Can it be deleted without consequences? It seems that human-generated data is more likely to be worth more than machine-generated. But this distinction is very loosey-goosey too.

In my personal opinion, Machine-Generated data is always under the influence of humans. Satellite Telemetry Data was the byproduct of human coding for events relevant to humans. What differentiates "machine-generated" from "non-machine generated" is the intervention/requirement for a human to supply/update data to complete the process.

For example, adding info to a twitter account or facebook account cannot be considered machine-generated. However, the apache log lines created for that event can be considered Machine Generated since their values are collected as a byproduct of the occuring event.

Long story short, Machine-Generated data is the end result of code creating information as a response to an event without requiring human oversight and intervention. By oversight and intervention, I mean the human doesn't easily or readily modify the created information once the information has been submitted.

Jeff, Hegemonkey, Wayne, thanks for sharing your opinions. I'm not sure that I agree that apache log entries that are generated as a direct result of a human action should be classified as machine-generated, but differing opinions are welcome in this forum.

Apache logs are the simplest case of web logs. While humans data consumption and decision making rates aren't tracking Moore's law (or maybe they are, but with a really small constant), the fidelity of computers recording them is. For example, clickstream logs now contain information about what the user was served. They can also contain information about the underlying services used to serve the request and the performance of each of those services. Rich client-side UIs can report information on what parts of the page the human interacted with, and the number of "page loads" an Ajax UI makes can be very large. Much like the resolution of surveillance video is increasing with Moore's law, the resolution of clickstream data and other observations of humans is increasing.

I like Curt's definition. Data about observing humans with increasing fidelity is going to be bigger and more valuable than data about machines doing things for their own sake. At least until machines start voting and buying Beanie Babies.

If the resolution of click-stream data is indeed increasing with Moore's law, then I would agree that it should be classified as machine-generated data. Where we disagree is the precise rate of increase. I agree that it is increasing rapidly, but not at the rate the same rate that computers are getting faster.

I also agree that human-tracking data will be more valuable. I just don't agree that it will be bigger.

Daniel Abadi

About Me

Daniel Abadi is an Associate Professor at Yale University, doing research primarily in database system
architecture and implementation. He received a Ph.D. from MIT and a M.Phil. from Cambridge. He is best known for his research in column-store database systems (the
C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, which was commercialized by VoltDB),
and Hadoop (the HadoopDB project). Abadi has been a recipient of a Churchill
Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD
Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award. His
research on HadoopDB is currently being commercialized by Hadapt, where Abadi
also serves as chief scientist. He blogs at http://dbmsmusings.blogspot.com and
tweets at http://twitter.com/#!/daniel_abadi.