March 23, 2007

Factual information, especially information organized for analysis or used to reason or make decisions. (Answer.com) (Webster is similar)

There are many versions of it that define Data using the word “fact” or “factual information”. This is unacceptable. For data is itself carry no assertion about the quality of it, whether it is fact or not is an after fact as long as the definition of data is concerned.

Using “information” to define data is not proper either, for whether data is information or not is relying on the users and how users understand the data: data is more “primitive” than “information”, not the other way around.

I like the following better, although I am not perfectly happy about it: Data is a structured form consisting of datum. I like it because it does not imply any implicit relationship between its explicit forms and the external world. It does not say limit its structure to any kinds, table, row, collection, independent observations etc. are all artificial frame, not general enough to be considered in the definition. It also does not imply the present of any external knowledge, or preprocessing routines or any specialized observers.

Give me some example, you ask. First of all, the simplest data example is a datum – the atom as far as data is concerned. Because it is datum, itself should not have any sub-structure, so this is saying that it can have nothing but a name or a label. As to the form of datun, it really does not matter, as far as it is looked at as simplest data. Datum can have name and value.

Next, data can be a collection of observations (datum).

Next, datum can have attributes which “describe” observations. An example of it will be “continuous”, “discrete”, “ordered” etc. Attribute may have name and value as well.

What we called “data table” is just one common form of data. Other forms of data include: network data, transaction data, graph data, time series, text data.

“All these are common sense”, you said. “What’s new?”

Well, all the common data analytics are analytics on “Table-like” data. The analytics for other forms of data are so much behind. This is a problem, this is an opportunity.