The Data We Need

Almost all research into human behavior focuses on particular behaviors. (Yes, not extremely particular, but also not extremely general.) For example, an academic journal article might focus on professional licensing of dentists, incentive contracts for teachers, how Walmart changes small towns, whether diabetes patients take their medicine, how much we spend on xmas presents, or if there are fewer modern wars between democracies. Academics become experts in such particular areas.

After people have read many articles on many particular kinds of human behavior, they often express opinions about larger aggregates of human behavior. They say that government policy tends to favor the rich, that people would be happier with less government, that the young don’t listen enough to the old, that supply and demand is a good first approximation, that people are more selfish than they claim, or that most people do most things with an eye to signaling. Yes, people often express opinions on these broader subjects before they read many articles, and their opinions change suspiciously little as a result of reading many articles. But even so, if asked to justify their more general views academics usually point to a sampling of particular articles.

Much of my intellectual life in the last decade has been spent in the mode of collecting many specific results, and trying to fit them into larger simpler pictures of human behavior. So both I and the academics I’m describing above in essence present themselves as using these many results presented in academic papers about particular human behaviors as data to support their broader inferences about human behavior. But we do almost all of this informally, via our vague impressionistic memories of what has been the gist of the many articles we’ve read, and our intuitions about what more general claims seem how consistent with those particulars.

Of course there is nothing especially wrong with intuitively matching data and theory; it is what we humans evolved to do, and we wouldn’t be such a successful species if we couldn’t at least do it tolerably well sometimes. It takes time and effort to turn complex experiences into precise sharable data sets, and to turn our theoretical intuitions into precise testable formal theories. Such efforts aren’t always worth the bother.

But most of these academic papers on particular human behaviors do in fact pay the bother to substantially formalize their data, their theories, or both. And if it is worth the bother to do this for all of these particular behaviors, it is hard to see why it isn’t be worth the bother for the broader generalizations we make from them. Thus I propose: let’s create formal data sets where the data points are particular categories of human behavior.

To make my proposal clearer let’s for now restrict attention to explaining government regulatory policies. We could create a data set where the datums are particular kinds of products and services that governments now provide, subsidize, tax, advise, restrict, etc. For such datums we could start to collect features about them into a formal data set. Such features could say how long that sort of thing has been going on, how widely it is practiced around the world, how variable has been that practice over space and time, how familiar are ordinary people today with its details, what sort of justifications do people offer for it, what sort of emotional associations do people have with it, how much do we spend on it, and so on. We might also include anything we know about how such things correlate with age, gender, wealth, latitude, etc.

Generalizing to human behavior more broadly, we could collect a data set of particular behaviors, many of which seem puzzling at least to someone. I often post on this blog about puzzling behaviors. Each such category of behaviors could be one or more data points in this data set. And relevant features to code about those behaviors could be drawn from the features we tend to invoke when we try to explain those behaviors. Such as how common is that behavior, how much repeated experience do people have with it, how much do they get to see about the behavior of others, how strong are the emotional associations, how much would it make people look bad to admit to particular motives, and so on.

Now all this is of course much easier said than done. Is it a lot of work to look up various papers and summarize their key results as entries in this data set, or just to look at real world behaviors and put them into simple categories. It is also work to think carefully about how to usefully divide up the space of actions and features. First efforts will no doubt get it wrong in part, and have to be partially redone. But this is the sort of work that usually goes into all the academic papers on particular behaviors. Yes it is work, but if those particular efforts are worth the bother, then this should be as well.

As a first cut, I’d suggest just picking some more limited category, such as perhaps government regulations, collecting some plausible data points, making some guesses about what useful features might be, and then just doing a quick survey of some social scientists where they each fill in the data table with their best guesses for data point features. If you ask enough people, you can average out a lot of individual noise, and at least have a data set about what social scientists think are features of items in this area. With this you could start to do some exploratory data analysis, and start to think about what theories might well account for the patterns you see.

Now one obvious problem with my proposal is that while it looks time consuming and tedious, it isn’t obviously impressive. Researchers who specialize in particular areas will complain about your data entries related to their areas, and you won’t be able to satisfy them all. So you will end up with a chorus of critics saying your data is all wrong, and your efforts will look too low brow to cower them with your impressive tech. So I can see why this hasn’t been done much. Even so, I think this is the data set we need.