Random thoughts of a computer scientist who is working behind the enemy lines; and lately turned into a double agent.

Wednesday, June 10, 2015

An API for MTurk Demographics

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuously surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:In [1]:

importrequestsimportjsonimportpprintimportpandasaspdfromdatetimeimportdatetimeimporttime# The API call that returns the last 10K survey responsesurl="https://mturk-surveys.appspot.com/"+ \
"_ah/api/survey/v1/survey/demographics/answers?limit=10000"resp=requests.get(url)json=json.loads(resp.text)

Then we need to reformat the returned JSON object and transform the responses into a flat tableIn [2]:

# This function takes as input the response for a single survey, and transforms it into a flat dictionarydefflatten(item):fmt="%Y-%m-%dT%H:%M:%S.%fZ"hit_answer_date=datetime.strptime(item["date"],fmt)hit_creation_str=item.get("hitCreationDate")ifhit_creation_strisNone:hit_creation_date=Nonediff=Noneelse:hit_creation_date=datetime.strptime(hit_creation_str,fmt)# convert to unix timestamphit_date_ts=time.mktime(hit_creation_date.timetuple())answer_date_ts=time.mktime(hit_answer_date.timetuple())diff=int(answer_date_ts-hit_date_ts)result={"worker_id":str(item["workerId"]),"gender":str(item["answers"]["gender"]),"household_income":str(item["answers"]["householdIncome"]),"household_size":str(item["answers"]["householdSize"]),"marital_status":str(item["answers"]["maritalStatus"]),"year_of_birth":int(item["answers"]["yearOfBirth"]),"location_city":str(item.get("locationCity")),"location_region":str(item.get("locationRegion")),"location_country":str(item["locationCountry"]),"hit_answered_date":hit_answer_date,"hit_creation_date":hit_creation_date,"post_to_completion_secs":diff}returnresult

# We now transform our API answer into a flat table (Pandas dataframe)responses=[flatten(item)foriteminjson["items"]]df=pd.DataFrame(responses)df["gender"]=df["gender"].astype("category")df["household_income"]=df["household_income"].astype("category")

We can then save the data to a vanilla CSV file, and see how the raw data looks like:In [3]:

# Let's save the file as a CSVdf.to_csv("data/mturk_surveys.csv")!head -5 data/mturk_surveys.csv