Thursday, November 20, 2014

springer | It is estimated that humanity accumulated 180 EB of
data between the invention of writing and 2006. Between 2006 and 2011,
the total grew ten times and reached 1,600 EB. This figure is now
expected to grow fourfold approximately every 3 years. Every day, enough
new data are being generated to fill all US libraries eight times over.
As a result, there is much talk about “big data”. This special issue on
“Evolution, Genetic Engineering
and Human Enhancement”, for example, would have been inconceivable in
an age of “small data”, simply because genetics is one of the
data-greediest sciences around. This is why, in the USA, the National Institutes of Health (NIH)
and the National Science Foundation (NSF) have identified big data as a
programme focus. One of the main NSF–NIH interagency initiatives
addresses the need for core techniques and technologies for advancing
big data science and engineering (see NSF-12-499).

Despite
the importance of the phenomenon, it is unclear what exactly the term
“big data” means and hence refers to. The aforementioned document
specifies that: “The phrase ‘big data’ in this solicitation refers to
large, diverse, complex, longitudinal, and/or distributed data sets
generated from instruments, sensors, Internet transactions, email,
video, click streams, and/or all other digital sources available today
and in the future.” You do not need to be an analytic philosopher to
find this both obscure and vague. Wikipedia, for once, is also
unhelpful. Not because the relevant entry is unreliable, but because it
reports the common definition, which is unsatisfactory: “data sets so
large and complex that they become awkward to work with using on-hand
database management tools”. Apart from the circular problem of defining
“big” with “large”, the definition suggests that data are too big or
large only in relation to our current computational power. This is
misleading. Of course, “big”, as many other terms, is a relational
predicate: a pair of shoes is too big for you, but fine for me. It is
also trivial to acknowledge that we tend to evaluate things
non-relationally, in this case as absolutely big, whenever the frame of
reference is obvious enough to be left implicit. A horse is a big
animal, no matter what whales may think. Yet, these two simple points
may give the impression that there is no real trouble with “big data”
being a loosely defined term referring to the fact that our current
computers cannot handle so many gazillions of data efficiently. And this
is where two confusions seem to creep in. First, that the epistemological problem with big data is that there is too much of them (the ethical problem concerns how we use them; see below). And second, that the technological solution
to the epistemological problem is more and better techniques and
technologies, which will “shrink” big data back to a manageable size.
The epistemological problem is different, and it requires an equally
epistemological solution.