I wrote about using variable responses as they were recorded, which would probably work, but I don’t quite trust it, since it is not quite clear how many coordinates would be needed to fully represent the rows (individual people). And it seems that adequately representing those values will take up more memory than by using what I call microvariables, which are columns of single bits.

The best way of representing the whole dataset seems to be as a single two dimensional array of bits. Each variable is replaced by several columns of single bits, each representing one possible value of the variable. Each individual person will be represented by one row of bits.

I have written about this before, but now think it the only way to go.

It is hard to find a nice way to do this. Pascal provides lovely arrays of booleans, but each one actually takes up one byte, not a single bit. I do miss VAX Pascal, with it’s Packed Array of Boolean data type, in which each bit occupied just a single bit of memory.

Python does have a nice package, of course, but I run on on a 64-bit machine and don’t quite trust the 64-bit version of the compiled package, which is only at version 0.3.5 anyway. If anybody has experience with this package, Bitarray, I would like to know about it. See http://pypi.python.org/pypi/bitarray/ for information. It is available as a precompiled binary for Windows at http://www.lfd.uci.edu/~gohlke/pythonlibs/ a very nice page, the best way to access all the well-know packages, (and a few good ones, not so well-known).

Anyway, using big bit arrays, I think I can guarantee that 16 coordinates would be enough to represent all possible rows of bit data. There will be fewer, I think, though I haven’t actually looked for duplicates.

Whether I trust the Python package or not, it seems to be the thing to use, so I will. I’ll report my results as I go along. — dpw

Share this:

I have worked out a complete approach for dealing with the current dataset, the one chosen to start with, the Wisconsin Longitudinal Study. First is a method I have used before to get coordinates for rows, that is to say for individual people, in this kind of dataset. One uses the full list of variables, almost 14 thousand of them, to compare rows. Every variable which exactly matches is counted, then the count determines similarity between rows. The first step is to find the most distant two rows, by comparing each row with all others. Then add extremes, by comparing each row with all extracted extremes, starting with the first two. The row that is the farthest from all of the existing extremes is a new one. This process slows down considerably as each new extreme is added. Similarity to the final set of extremes is used to give coordinates. For a dataset of this size at least 10 coordinates must be extracted, in my experience.

The next step involves what I call microvariables, which are simple boolean variables based on the variables defined in the survey. A variable will have several responses. Each possible response defines a microvariable, a column of booleans, single bits. A set of coordinates for each column can then be defined as the mean of the row coordinates for all bits set in this column. Thus, each microvariable will have a set of coordinates, equal in number to the number of row coordinates. It is important to have these coordinates for the microvariables, because they are comparable, where the boolean columns are not — they are apples and oranges.

This is a fairly simple process, all of which I have done before on other data, so it should not be too hard for me to implement — dpw

I am still looking for better sources of the information most important to me, but I have had some luck with the Wisconsin Longitudinal Study. Just a bit, but some. On question, repeated on different instruments asks how close the respondent feels to his or her spouse. A very few other questions refer to amount of sexual pleasure with a spouse. Taken together, they do give some very limited indications of compatibility. What makes this useful is that there are a number of questions asked the respondent about their spouse, and a few questions in which the spouse give some data about himself or herself. It’s not a lot of information, but it is some, and since it is a longitudinal study, there is some indication about how this changes over time. It will take me a while to make use of the data and I sure wish there was more, but there is enough to get started.

I have had much more luck locating information about jobs. The WLS asked a lot about occupation and job satisfaction. I’ll work with that data as I find time, but for now I am anxious to see if I really can get anything useful out of the limited interpersonal data. — dpw

Oops. This just happens sometime. I get involved in writing about something mathematical and that part of my mind gets trapped there. On another blog I am am writing a novel, an online chapter-by-chapter novel, with daily installments. Today I was trying to deal with concept that was mathematical enough to occupy that part of my mind until it became too late in the day to work on software development or this blog. For the curious, see Chapter 14, at http://SocialTechNovel.SocialTechnology.ca/ — the previous few chapters talk about advanced social tech hardware, by the way. I”ll get back to software and this blog tomorrow. — dpw

Share this:

I don’t know who might be following this or stumble across it, but it does get some notice from somebody, since a mistake gets drawn to my attention quickly enough. I may have driven people away by making too many assertions and not enough requests for anything. This may be because I have too much of some things, too much social survey data, for instance. There is a mountain of it. Over the years I’ve collected some, but never found quite the right data. So let me just ask —

Does anyone know of an education dataset with raw test answers in it? Not scores, actual answers. With the questions and correct answers, too, of course. That would be especially nice if some other information about the students were contained. Has anyone got some of their own material, not a published dataset, but just test answers from one or more of the tests they have administered themselves?

I am always looking for compatibility data. I’m not comfortable with trying to derive this information from more general surveys. I can do that using marriage-success data, but that is a stretch.

Lots of what I want to do involves using data which just doesn’t exist, never having been collected, to the best of my knowledge. But there really is a mountain of data out there, and who knows what might have been collected. Let me just fantasize about something I have never found, that might exist —

I would like a dataset which includes individual characteristics, interpersonal compatibility information, occupation and job history information, and answers to questions of fact.

I can find nothing with all of this, but I do have a tiny bit of information from electoral studies, in which people stated something about themselves and were then asked what they knew about the candidates. Since some of this is factual information which can be checked — candidates, party, incumbency, age, views on various topics, it is possible to see factual errors, almost like the answers to test questions. That, combined with information from the user about him or her self, is definitely useful, and some occupation information sometimes goes with that, but nothing whatsoever about interpersonal compatibility.

I wrote before about patching together the data from many sources to get what I want. I am sure I will still have to do that, but it doesn’t hurt to ask. Please help if you know of any available datasets that I might use. — dpw

This blog is about social software, which is only one part of social technology. I am still struggling to deal with social hardware. Nor am I entirely comfortable writing about software which runs on the limited social hardware we have now, the newer cellphones and their descendents.

I have focused instead on software which could run on a large social utility like Facebook, or a search engine, like Google, as it evolves more social capabilities.

The key notions I have spent the most time investigating are profiles and suggestions. By profiles I do not mean the answers to a handful of simple questions, like “What is your favourite book?” Instead I mean large mathematical models, such as could be boiled down out of thousands of questions asked on a questionnaire or through a dialogue.

By suggestions I do mean something simple, but not created in a trivial way. A suggestion might be to connect with a specific person. But it would not be arrived at by the simplistic method of noting the friends of your friends.

Instead the large mathematical model which is your profile would be compared with many others and compatibility predictions would be made. For interpersonal matching, perhaps the most important use, your profile might be compared with a million or more other people’s.

This is where social survey data can come in. We need to be able draw an inductive interference “These people are like you. These are the people they are compatible with. So people like them should be compatible with you.”

Eventually it will be possible to use data we collect for that purpose, but for the time being, only social survey data is available to help fill the gap. If the social survey data was sufficiently complete, then a profile of you could be compared with the profiles of people surveyed, and then information about those compatible with them could be used.

Unfortunately this information is not readily available. It may have been collected, but since it might reveal the identities of individual people, it is not available to the public. Usually it is not even available to other academic researchers.

The best we can do is extract some information from surveys like the otherwise very helpful Wisconsin Longitudinal Study, in which small amounts of information about spouses is available, plus information about the success or failures of marriages.

There will not be and should not be any way of filling in the rest of the personality and compatibility gaps, which might indeed make it possible to identify some individuals. But it is important to milk the publicly available data for every drop of compatibility information.

We do not need private information about individuals to know, for example, that marriages between Catholics and fundamentalist protestants are rare and when they occur are often unsuccessful.

I wish that more such information had been collected, and think it should have been. Other information, such as that about matches between people and jobs is more readily available, but again, some of it has been hidden in the interest of protecting the privacy of individuals.

There are ways of making that information available, but first it requires different collection methods, then different selection methods.

In my earlier posts I wrote about mathematical methods for the collection, massage, combination, additional processing and use of the available data. From looking over the available data from the surveys I’ve chosen to use, most especially the WLS, but also the General Social Survey, the GSS, I have come to realize that a combination of mathematical and human work will be necessary.

I will have to feed in some guesses about what data I think most relevant and most different from previously selected data. This is going to require the creation of a tool, a program for the interactive selection and processing of data.

Such tools have already been created and used by the people who analyze survey information for scientific purposes, but this must be more than that. As I say repeatedly, this is not science, it is technology. I think I will have to create artificial user profiles – test profiles, then at each data selection step see how well the current selection of variables permits matching two of these people.

This requires more thought, but in the meanwhile, to further this thought, I will spend more time examining the available social survey data. — dpw

The staff of the Wisconsin Longitudinal Survey told me proudly that they had more than 14 thousand variables. In a sense that is true, but I am going work with them in a way that makes the number more like 100 thousand. What the scientists who do social surveys consider a variable is not really a variable in the mathematical sense. One should imagine not a column of codes like: 1, male; 2, female; 3, refused. Instead, a mathematician might think of these as three columns of a matrix. The fact that only one answer appears in the set of three columns does not necessily change this. Instead the numbers in each column may represent probability. A zero in one column means certainty that this attribute does not apply to the individual, a one in another column represents certainty that this attribute does apply.

Now as we all know, there are few certainties in this world. People make mistakes, they lie, their response is coded incorrectly. So various data massage operation can be performed, as described in an earlier post. The results can still be interpreted as probabilities, but will be more realistic. Doing this and the other operations discussed in that earlier post are part of data correction. The social scientist may argue against this level of data correction, but it has great value, especially when done automatically for the purposes of social technology.

I will have no access to computers which can do this level of correction or make proper use of a survey anywhere near the size of the WLS, but I will be able to use some of it. I am still not sure how to select that to use. A tentative plan is to collect the columns representing the questions and responses of most significance, such as those for gender, age, race, religion and education. Given these, it should be possible to make predictors for the other columns. When a predictor is quite accurate, is corresponds to a column which gives us little information. When a predictor fails almost completely, then it corresponds to important new information. Repeating this process over and over, I should be able to locate perhaps as many as 4,000 colums which my machine will be capable of processing.

It may be that this can be done iteratively, storing as sufficiently processed the less useful columns after each processing step, then seeking replacements. I am not sure if this will work or not. Even if it does, the amount of processing may be excessive.

This is something I will be working on for quite a while, but at least I know enough about the variables to make a good start. — dpw

I think we will be able to use the data from the Wisconsin Longitudinal Study (WLS) without heroic measures like dealing with an enormous XML file. It seems that the Comma Separated Value (CSV) files (which are actually tab separated, not comma separated), can be combined with the catalog files in the SAS distribution to produce something adequate for our purposes. The catalog files are less than full codebooks, but are somewhat descriptive of the data.

Reading over the variable status and description documention and small parts of the enormous cross-reference tables, I am disturbed to find that many variables are actually constructed ones, combinations two or more distinct questions. Nevertheless, what is available will do for protyping and testing.

Once again I have a wish list — I would still like raw data and easy access to the actual questions asked, but this seems a generic problem with social surveys , as far as I can tell. I have downloaded and spent some time on serveral of them, and except for some smaller studies like simple election studies, I could find nothing available which just tells us the basic facts, “Here is the question, exactly as asked”, and “Here is the answer received.” Just the facts, ma’am, just the facts. I recognize that this is difficult for in-person and telephone interviews, where the temptation to prompt the respondent may be overwhelming, but still, a single should be asked, recorded verbatim, the result recorded, and that question should become a single variable.

I’ll write more about this in later posts, but for now I have work to do, making use of what is available. For the WLS data set covering 1957 to 2007, that is 12988 variables, with data obtained from 10317 respondents, a very impressive collection indeed. — dpw

I seem to have gotten into trouble by not being clear about what I would hope social survey data might contain. Here are some proposed standards:

everything, including questions and response sets shall be kept in machine readable form

it shall be possible to reconstruct the entire survey procedures and instruments from the stored data without human intervention

if this information is kept maintained within a statistical package or in a markup language, programs to translate it into a simple open human and machine readable form like CSV format shall be available, with their inverses, so that the the translations can be verified

routine regression tests shall be made by translating the simple machine-readable files back into the archive or working format, so that changes to the statistical packages or the markup language which affect the data can be discovered.

whether intended for distribution or not, raw data shall be kept in this way, in addition to any corrected data that has been prepared for general use

meta-data to describe all the data including questionnaire and codebook data shall be maintained as data according to these same standards

a description of filters used to subset the data into private, academic and public releases shall be maintained as data according to these standards

The total effect of these standards is to produce a collection of data and meta-data which can be processed automatically, without human intervention, by relatively simple programs whose operation can be checked and debugged easily. The purpose of these proposed standards is to support social technology, the automated use of this data, instead of merely social science, although the social technology to be developed will also provide tools for the scientific researcher.

I have been well upbraided and justly so by Professor Robert M. Hauser for my comments about the WLS codebooks not being available in digital form. What I should have said is that they are not in any simple digital form that anyone can use, such as the CSV format in which their actual response data is made available.

I do have a problem with using the codebook data, which is not in any easy to use format, but I should have been more careful how I said it. What I meant by my comment was that the PDF and HTML files are not in an easily machine readable format, and there was not much I could do with the the codebooks available on the site in those files print them out.

I did ignore the existence of large statistical packages that would give me access to the codebook data along with the rest of the data. I can’t affort to spend a lot of money on a package whose only purpose would be to translate the codebook data into the kind of simple digital format I could use directly.

I did indeed mention one of those packages, Stata, in the immediately previous post, where I noted my hope that if the freely available stats package R could import Stata files, I might be able to get what I wanted, since from R it is easy enough to export it in a format that I can read and manipulate using Python.

I should have spelled out my frustration at finding only PDF and HTML files, useless as data for my intended Python program, though I did that more clearly in the previous post, which did outline that plan for getting codebook data from Stata files via R.

Even that now seems futile, because the R which I used to use will not run only the only machine I have now, a 64-bit one. If anyone knows of an inexpensive package which will run on a 64-bit AMD machine and read the files which are available, please let me know.

My real problem is trying to start a project which has no budget for big statistical packages like SAS, SPSS or Stata. This would have been so simple to do if they had made codebook data available in something as simple the CSV format used for the tabular data itself. As it stands, I find myself sitting here with a powerful computer and a good programming language which runs on it, but no way of accessing the codebook data.

That was the the source of the rather severe frustration which led me to say the wrong thing about the WLS codebooks — for which I am very sorry. Yes, they are in a digital format, though not one useful to me. — dpw