Parsing complex social study data

NPR’s #15girls project looks at the lives of 15 year old girls around the world. The reporting team was interested in using data from the World Values Survey (WVS) to help inform the reporting and produce a “by-the-numbers” piece.

Analyzing the World Values Survey data represents a fairly typical problem in data journalism: crunching numbers from social science surveys. Social science surveys have some typical features:

The data is in proprietary/non-standard formats like those used by Stata or SPSS. The WVS, happily, distributes comma separated value (CSV) files as well as SPSS and Stata files.

The data has hundreds of columns per respondent that correspond to responses to each question. The WVS has 431 columns and over 86,000 rows.

The possible responses are coded in a separate file, known as the codebook, which match a numerical or text code with the response value.

Parsing and analysis requirements

To crunch such numbers, we need a process that accounts for the issues inherent in importing and parsing data with these qualities. Our end goal is to get all this stuff into a Postgres database where we can analyze it. Here’s what we need to do that:

Implicit column creation: Typing in a schema for hundreds of columns is no fun and error-prone. We need some way to automatically create the columns.

Fast import: Importing tens of thousands of rows with hundreds of columns each can get pretty slow. We need efficient import.

Generic analysis: We need a way to match responses for any given question with the possible responses from the codebook, whether it is a free-form response, Likert scale, a coded value, or something else.

Importing the World Values Survey response data

We use a three-step process to get implicit column creation and fast import.

Dataset, a Python database wrapper, auto-magically creates database fields as data is added to a table. That handles the schema creation. But because of all the magic under the hood, Dataset is very inefficient at inserting large datasets. The WVS data – with over 86,000 rows with 431 columns each – took many hours to import.

The Postgres COPY [table] FROM [file]command is very efficient at importing data from a CSV, but notoriously finkicky about data formatting. Instead of hours, COPY runs in seconds, but your data needs to be perfectly formatted for the table you’re importing into.

The good news is that the WVS provides CSV data files. If they didn’t provide CSV, we’d use a tool like R to convert from Stata or SPSS to CSV. The bad news is that the WVS data files use inconsistent quoting and contain a few other oddities that causes the Postgres COPY routine to choke.

To get the advantages of both tools, we took a hybrid approach. It’s a bit ugly, but it does the job nicely. Our import process looks like this:

Open the dirty source CSV with Python

Read the file line-by-line:

On the first data row:

Create a single database row in the responses table with Dataset which creates all the columns in one go.

Delete the row from the responses table in the database.

Write each cleaned line to a new CSV file, quoting all values.

Use the Postgres COPY command to import the data.

Importing the World Values Survey codebook

The codebook format is fairly typical. There are columns for the question ID, details about the question, and a carriage-return separated list of possible responses. Here’s a simplified view of a typical row:

Note that the potential responses have a complex encoding scheme of their own. Carriage returns separate the responses. Within a line, # characters split the response into a response code, optional middle value (as seen above for the year of birth question), and verbose value. We’re still not sure what the middle value is for, but we learned the hard way we have to account for it.

Our codebook parser writes to two tables. One table holds metadata about the question, the other contains the possible response values. The conceptual operation looks like this:

For each row in the codebook:

Write question id, label, and description to questions table.

Split the possible responses on carriage returns.

For each row in possible responses:

Split response on # character to decompose into response code, middle value (which we throw out) and the real value (the verbose name of the response).

Analyzing the data

Now we have three tables – survey responses, codebook questions, and potential responses to each question. It’s not fully normalized, but it’s normalized enough to run some analysis.

What we need to do is write some code that can dynamically generate a query that gets all the responses to a given question. Once we have that, we can summarize and analyze the numbers as needed with Python code.

The helper query dynamically generates a query against the correct column and joins the correct survey responses using subqueries:

result=db.query("""
select
countries.value as country, c.value as response
from
survey_responses r
join
(select * from categories where question_id='{0}') c
on r.{0}=c.code
join
(select * from categories where question_id='v2a') countries
on r.v2a=countries.code
order by
country
;
""".format(question_id))

The results look like:

Country

Response

Brazil

Agree

Brazil

Agree

Brazil

Neither

Brazil

Disagree

…

We could have expanded on the SQL above to summarize this data further, but using a little basic Python (or a slick analysis tool like Agate) has some advantages.

Specifically, because of our database structure, caculating percentages for arbitrary response values in pure SQL would have led to a rather ugly query (we tried). Post-processing was going to be necessary in all events. And the relatively simple format let us use the query results for more advanced analysis, specifically to add “agree/strongly agree” and favorable Likert scale responses into a composite values for reporting purposes.

Here’s a snippet from our processing code that adds up the counts for each response (initialize_counts is a helper function to create a dict with zeroed out values for all possible responses; you could also use Python’s DefaultDict):

If you were to present the counts dict as a table, the processed data looks like this:

Country

Agree

Neither

Disagree

United States

1,043

683

482

…

A query that returns partially processed data turned out to be the best option for the full range of analysis we wanted to do.

Half-way solutions for the win

None of these techniques would be considered a best practice from a data management standpoint. Each step represents a partial solution to a tough problem. Taken together, they provide a nice middle ground between needing to write a lot of code and schemas and complex queries to do things the Right Way and not being able to do anything at all. The process might be a little ugly but it’s fast and repeatable. That counts for a lot in a newsroom.