Data Massage

This is something we can practice on with existing social survey data, but in fact what we do will not be practice. This data will prove invaluable. How good it will be will depend on how limited the vision of the people who created the survey instruments (questionnaires) was, and how much access to raw data we can have. As I wrote on my main blog, http://www.SocialTechnology.ca/wordpress, raw data, raw data, raw data. Sing the data miner’s lament with me “Oh my deepest data mine, lost forever data mine.” Once the data has been truly thrown away, it is gone. When the teacher throws out the test pages, to the incinerator or landfill, so much of what we want is gone.

But inevitably data in otherwise well-collected datasets will be missing. And what we have will often need much massage, especially linearization.

One of the first things to do is to fill in holes in the data. The basic method is this: where a column of data representing one user is missing a datum, find several similar columns which do have the answer to that question, then take the mean, median or mode, whichever seems best, filling in the hole from the others. Fill in all the holes this way, or as many as possible, then iterate. Where there was a hole that was filled in, throw away that guess, and fill in the hole again, using the improved dataset which has had many other holes filled in. Do this for all the holes, throwing away and filling in with improved estimates. That is one iteration. Make several passes through the data. The process is likely to converge to some fairly reliable estimates for all the missing data. Since much of this data is multiple choice, individual columns of choices should be orthogonal, this can be used to check and fix this process as necessary. All of this can be automated, and must be. We are talking about technology here, not science. We need to do something with this data, not just spend months examining, diddling with it and writing dissertations about it.

Another important form of data massage is linearization. We eventually want to use linear algebra on the data, especially factor analysis or principal components analysis, which are more or less but not exactly the same thing. To do that we want linearized data. A way of getting this is to assume that it is already linear, select each column of data in turn and try and estimate it using lots of other columns. Often the original column of data can survive linear prediction, but sometimes it will be revealed as the logarithm or cubic function of what is estimated using the other columns of data. In those cases a linearization function is obvious, and the column of data can be changed into one more useful for linear algebra. This can be done over and over, using the linearized columns as they are created to help doing re-estimations of other columns. Eventually we will have transformed the dataset into one very suitable for PCA or other forms of analysis. Note that linearization functions must be recorded, so they can be used in undoing what has been done. For example, a data simplification method can be this: linearize the data, perform PCA, throw away the lowest weight factors, rotate the data back, undoing the PCA, then delinearize the data, producing a simplified and actually corrected version of the original data. Note that this process can also be automated, and must be, again because we are doing technology, not science.

Note that the correction through such a simplification process can improve the estimates made of missing data and can correct the data in other ways, though it cannot eliminate systematic errors resulting from poorly designed questionnaires which have a bad response set (response bias), in which for example, people are asked for intimate details of their sex lives and rarely answer correctly unless presented with equally appealing or unappealing choices. For the social sciences, such poorly designed questionnaires are a disgrace, but can still be useful, and much is made over using the biased data. For technology we need other methods. I will go into that in another post, but basically the method is to use internal consistency checks to find the most reliable responders, then use their answers as a clue to biases in the responses to some questions. This is harder to do, but I believe it can be done well enough, and can and must also be automated.

So, there is a quick survey of data massage for social technology, as distinct from social science. Writing software to do all of this is not as difficult as it seems, from my own limited experience with admittedly somewhat crude software which I’ve worked on over the years. — dpw