Combining Data Sources

This will be discussed primarily in terms of working with existing social survey data, but will apply just as much to newly collected data, once we get past a single collection instrument. As noted before, existing social survey data will be very useful. How?

It is easiest to understand this in terms of a machine-generated dialogue system than in a questionnaire based ssytem. Given a mass of processed social survey data, preferably from a multitude of data sources, it is not hard to find key questions, ones which correspond to principal factors in question space. One such key question is, “Gender, male, female, no-answer”. Note: “no-answer” is data too. Knowing this question is important, our question generation software can ask it. From that answer, other question immediately arise. If the user is male, the best next question is almost certainly age. That may be so for a woman as well, but almost important would be “Have you given birth to any children?” or some variant of that. Given the answers to those questions, otherw will arise, the exact order and need for them depending on the individual and on previous answers. So to generate questions, we look at existing social survey data, if for no other reason, though there are many other reasons, of course.

Combining data from existing data sources (or, eventually, from our own) is done this way. Two questionnaires (survey instruments) will have some overlap. Most will at least ask the user’s gender. Most will ask something about age or date of birth, probably just expecting an age range. Whether asked or not, actual birthdate is rarely made available to users of the datasets. Two different question and answer sets with some overlap can be considered supersets of a smaller set given to a larger population. With many overlapping questionnaires, a lot of overlap can be used to estimate questions missing from each set. Let us suppose that instrument A asked a person’s income range, and instrument B did not. But suppose that both questionnaires asked about age, gender, home location, employment record, work location. Then the whole column of data about income range missing from instrument B because it was not a question ever asked, can be treated as missing data, and estimated using the methods given in the last post.

As discussed in that post, missing data should be “recovered” or estimated iteratively. From all the datasets which overlap more or less in different areas, we can crudely estimate columns of data missing from each one of them, columns full of holes because of unasked questions. It is possible in this way to provide estimates for the whole grand superset of all questions, the set including of the questions asked on any one of them. For scientific study, this would be outrageous, but for technology, it is an appropriate thing to do. It is much more valid, appropriate and useful if done iteratively. Essentially we are going to be reconstructing a big structure from little pieces of it.

Think of a statue or pot in an archeological dig. Let us say that it exists in pieces, some of which may be missing, some of which may be extraneous. Gradually, piece by piece, we reconstruct the pot. But don’t use a very permanent glue, because once it is reconstructed (first guess) you may need to remove and replace some pieces, moving them around, discarding them as extraneous, perhaps manufacturing ones whose size and shape can only be seen once the first attempted reconstruction is done.

Gradually, iteratively over time, the amount of reconstructed data can be enormous, even if most of it is apparently extrapolation at first. As we extrapolate to fill in gaps from just two datasets, then move on, bringing in more and more, soon we are interpolating between well-understood points. This is a complicated process, which I will need to describe in more detail, and will at a later date. As I said in my earlier post, we can and must automate this, so as to do technology, not science, though the technology will also provide tools for the scientists. Again, I have done some limited work with rather crude research-only software that I’ve written, but I am quite sure it can all be made to work. It will require much more than just expanding on what I have already done, but that is a start — dpw