During my research I have searched for a definition of data. Most of the times I found definitions that say something like “data is a collection of characters according to predefined syntax rules” (cf. Bodendorf 2005). But does this technical definition really define what data is? Doesn’t it rather say what data looks like? And how does data relate to information then? Is information really data + context / semantics? Of course, when I interpret data, I use my context knowledge and add semantics to it. But isn’t there a much more obvious relationship between data and information?

Data is materialized information.

While thinking many times about it, I finally found a simple, yet useful definition of data. I discovered the definition while thinking of the basic purposes of an information system: People use information systems to store and retrieve information. For this purpose, information is materialized into physical data, e.g. when people fill out forms. So thinking this way data is materialized information. And yes, it is of course also a collection of characters…

There are several different opinions and definitions on what data quality is supposed to be. Most of the time, we adapt the “fitness for use by data consumers” definition as defined by Richard Wang and Diane Strong [1] who investigated the subjective quality perception of data quality by data consumers. Although this research work has been a milestone in data quality research, in my opinion data quality is not necessarily always subjective. Imagine valid combinations of cities and countries or accurate population values. Who defines these quality measures? Surely not an individual. These examples are rules that have been derived from public knowledge or stated by natural circumstances. So besides individual requirements from data consumers, the following things may at least also be sources for data quality rules:

Real-world phenomena (e.g. city/country combinations)

Organizational policies (e.g. all TV’s in my data must have a screen size)

Legal regulations (e.g. all groceries must have an expiration date)

IT-needs (e.g. URI’s must be dereferencable)

Standards (e.g. the syntax of xsd:dateTime or ZIP codes)

Task requirements (e.g. population data for all populated places must be complete to calculate the world population).

So reflecting the sources of data quality definitions, I would define data quality as the degree to which data fits to the composed requirements for the task at hand. Thereby, many data quality requirements may be derived from the sources listed above. When using these requirements as data quality rules, we should be aware that they may contradict each other and change over time. Hence, we must manage our data quality rules just like our data.

As part of my PhD-thesis, I am currently investigating the quality of Semantic Web data sets on instance-level. I have also published a data quality constraints library at http://semwebquality.org/ontologies/dq-constraints# which may be used in conjunction with SPIN ( http://spinrdf.org/ ) to define data quality requirements as constraint rules based on the knowledge dervied from the sources cited above. The constraints do not directly restrict the openess of the web, since it is up to the data owner/provider whether the instances with potential data quality problems shall be cleansed. The constraints shall rather help to identify incorrect or suspicious data and raise transparency about the quality state of the Semantic Web data sets in first place. If you are interested in this kind of quality assessment, please see my publications and/or contact me.

Today, I have published a SPARQL query library that contains generic queries for the identification of certain type of data quality problems, namely

syntax violations,

functional dependency violations,

illegal values,

uniqueness violations,

value range violations, and

missing values.

The constraints library is based on the SPIN framework. In conjunction with SPIN it facilitates the easy definition of data quality rules which can be tested immediately against the own data set. Some of the constraints, such as for the identification of functional dependency violations, have been designed to make use of other Semantic Web sources, so that the manual effort for the identification of poor data remains limited.