The Data Quality Trap

There is a lot of work in the data management community that looks at data quality as a first-class problem, capable of being solved largely as an independent problem that can then benefit many downstream systems. There is certainly a lot going for this line of thinking.

My own view, drummed into me by a former NASA scientist who I worked for early in my career, is that data is data and one should never try to change them. What we can validly change is only the inference we draw from data. In practice, this means building noise models into our statistical inference / machine learning models instead of trying to change the data before doing inference.

So we should do this

raw data -> statistical inference that has a noise model as a parameter

The advantage of the former is that the noise model can be adjusted easily depending on the business problem and the context behind that. This also relates to the weakness of the second approach, which is that the cleansing is usually done in a way that is largely insensitive to the larger context behind use of data, and the cleansed data often becomes the de facto data source and issues or signals in the raw data can get lost along the way.

The advantage of the second approach is that it is easier to do, both technically and psychologically. But sometimes we confuse what is correct and what is easy.