Why all that stuff about data?

I am afraid that my last two posts about massaging and combining data were far from clear. What is it all for?

There is a fundamental concept, easy to state. We will often want to match two people, for a platonic friendship or for an intimate sexual one, probably a romantic one. In doing this it would be very nice to have a lot of empirical data about couples. There is very little available, but suppose we did have it. Then we could say that a person who will be a good match for you is some very much like those who are good matches to people similar to you. If A and B are similar, X is a good match for A, and Y is similar to X, then Y is likely a good match for B. This one inference is not sufficient but it is evidence. If we found several people like you and knew that each had good matches who were in some ways similar, then another person similar to them would much more likely be a match for you.

This is a matter of probability, of course, of inductive inference, not logical deduction. But it can work. All that is needed is that elusive data about couples. Soon we will be able to collect that data for ourselves, but to get started we need enough data that we can offer people something useful. That data is very hard to find. Couple studies are rare and inadequate. But by sufficient work on the data we do have, we can make estimates.

There are a number of questions we would like to have the answers to, questions rarely asked or in data unavailable to us, or both. Having asked a woman about her life, we would have prefered the investigators asked about her husband and the success of her marriage. A very small number of these questions were asked in various surveys. The idea of making use of all that social survey data seems to be a poor one. But after much study, I have concluded that the information is actually there, buried in the mountain of social survey data, from which only a tiny bit of good ore has been mined. The problem is to extract it.

The term ‘data mining’ has become popular. I think it a good one. Let me push the analogy a bit. In one of my novels, I wrote about fully automated self-reproducing factory making factories. Most importantly, I wrote about these becoming a part of a large space ship, which, in the book, flew to an asteroid and mined it for material to reproduce itself. Then there were two self-reproducing space ships, which would multiply exponentially, gradually turning the asteroid belt into a fleet of space ships. In this book I contrasted this approach to the more traditional one in which asteroid mining ships descended on an asteroid, mined the metals out of it, and left the worthless remains behind, just as miners on Earth do. What I noted was that an asteroid miner would be less like an Earthbound strip miner than the Earthbound meat packing company which “uses every part of the pig but the squeal”. Given the virtually limitless power of the sun, every bit of an asteroid could be mined, and most valuable of all would be just the parts traditional science fiction imagines to be left behind, the bare rock. Rock usually contains silicon and oxygen, both vital materials, either in combination or separated. It may contain calcium instead of silicon, but that that is also a useful material, a lightweight metal. What part of an asteroid is not of some use? Asteroid mining is essentially just the separation of components, not the mining of good from bad.

Similarly, what part of a mass of social survey data is not useful? It is all information. What part of it is useless information? Prove to me that any of it is useless. Please.

Given all this data, is data mining just the extraction of good information from bad? Not at all. It is just the separation of data. What we want to know is in there. More is in there, useful to somebody. Indeed, social technology will eventually find uses for all of it. That was the main reason for the last two, rather obtuse posts. They were about collecting, massaging and combining data, prior to data separation, preparing for the data mining process — dpw