Jekyll2017-11-03T17:28:16+00:00http://goodbody.io/doctornerdis.github.ioLanguage Learning, Data Science, Assessment, AccreditationWorking with the Census API2017-11-03T00:00:00+00:002017-11-03T00:00:00+00:00http://goodbody.io/working-with-the-census-api<p>A friend of mine was complaining to me the other day about trying to get data from the <a href="https://en.wikipedia.org/wiki/American_Community_Survey">American Community Survey</a>. Fortunately, it turns out that the Census has a fantastic and well-documented API. So I volunteered to help. I read through <a href="https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf">this guide</a>, found a <a href="https://census.missouri.edu/geocodes/">helpful page</a> for looking up geocodes from the Missouri Census Data Center, and looked through a <a href="https://api.census.gov/data/2015/acs1/cprofile/variables.html">rather cumbersome list</a> of the variables you can request.</p>
<p>My friend needed population, housing info and income brackets for several metropolitan areas in the US: Boston, Denver, Jacksonville, Miami, Pittsburgh, Sacramento and Springfield, MA.</p>
<p>Here’s what I came up with:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import pandas as pd
import json
import requests
# API request, divided into manageable chunks
url_string = str("https://api.census.gov/data/2015/acs1?get=" +
"NAME," +
"B01003_001E," +
"B01003_001M," +
"C25032_001E," +
"C25032_001M," +
"C25032_002E," +
"C25032_002M," +
"C25032_003E," +
"C25032_003M," +
"C25032_007E," +
"C25032_007M," +
"C25032_008E," +
"C25032_008M," +
"C25121_012E," +
"C25121_012M," +
"C25121_017E," +
"C25121_017M," +
"C25121_022E," +
"C25121_022M," +
"C25121_027E," +
"C25121_027M&amp;" +
"for=urban%20area:69697,56602,42346,23527,09271,83926,77068&amp;" +
"key=get_your_own_census_key")
# Dictionary for renaming columns
col_dict = {
"NAME" : "URBANIZED_AREA_NAME",
"B01003_001E" : "POPULATION",
"B01003_001M" : "POPULATION_ME",
"C25032_001E" : "TOTAL_HOUSING_UNITS",
"C25032_001M" : "TOTAL_HOUSING_ME",
"C25032_002E" : "OWNER_OCCUPIED_TOTAL",
"C25032_002M" : "OWNER_OCCUPIED_TOTAL_ME",
"C25032_003E" : "OWNER_OCCUPIED_1",
"C25032_003M" : "OWNER_OCCUPIED_1_ME",
"C25032_007E" : "RENTER_OCCUPIED_TOTAL",
"C25032_007M" : "RENTER_OCCUPIED_TOTAL_ME",
"C25032_008E" : "RENTER_OCCUPIED_1",
"C25032_008M" : "RENTER_OCCUPIED_1_ME",
"C25121_012E" : "INCOME_20_34K",
"C25121_012M" : "INCOME_20_35K_ME",
"C25121_017E" : "INCOME_35_49K",
"C25121_017M" : "INCOME_35_49K_ME",
"C25121_022E" : "INCOME_50_74K",
"C25121_022M" : "INCOME_50_74K_ME",
"C25121_027E" : "INCOME_75_UP",
"C25121_027M" : "INCOME_75_UP_ME",
"urban area" : "URBANIZED_AREA_CODE"
}
# Get info from Census website
response = requests.get(url_string).json()
# Match Census column names with readable ones
col_list = []
for item in response[0]:
col_list.append(col_dict[item])
# Create DataFrame
df = pd.DataFrame(response[1:], columns=col_list)
# Write DataFrame to CSV file
df.to_csv("/filepath/census data.csv", index=False)
</code></pre></div></div>
<p>The first chunk of this code generates a long URL address which asks the Census API for the data. You can <a href="https://api.census.gov/data/2015/acs1?get=NAME,B01003_001E,B01003_001M,C25032_001E,C25032_001M,C25032_002E,C25032_002M,C25032_003E,C25032_003M,C25032_007E,C25032_007M,C25032_008E,C25032_008M,C25121_012E,C25121_012M,C25121_017E,C25121_017M,C25121_022E,C25121_022M,C25121_027E,C25121_027M&amp;for=urban%20area:69697,56602,42346,23527,09271,83926,7706">click here</a> to see what it looks like. If you’re familiar with <a href="http://www.numpy.org">NumPy</a>, it’ll look like a NumPy array — or a list of lists in which each list constitutes a row in the data table.</p>
<p>The second chunk is a dictionary that associates the Census column codes with readable titles. This enabled me to take the first row of the array I got from the Census and create a list of column names. From there, I turned the array into a DataFrame and then into a CSV.</p>
<p>But wait, there’s more! I’ve been learning how to do data visualization with <a href="https://matplotlib.org">matplotlib</a>. Using <a href="https://matplotlib.org/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py">this example</a> as a model, I made the following bar chart to look at income brackets (I’ll spare you the code):</p>
<p><img src="/images/census_figure1.png" alt="" /></p>
<p>This is helpful, but it’s hard to compare the different urban areas because of the differences in population. This second figure solves that problem by looking at income brackets as a percentage:</p>
<p><img src="/images/census_figure2.png" alt="" /></p>
<p>This is much more helpful. For example, look at Jacksonville vs. Boston. The income distribution is skewed towards the highest income bracket in the latter.</p>A friend of mine was complaining to me the other day about trying to get data from the American Community Survey. Fortunately, it turns out that the Census has a fantastic and well-documented API. So I volunteered to help. I read through this guide, found a helpful page for looking up geocodes from the Missouri Census Data Center, and looked through a rather cumbersome list of the variables you can request.Aggregating Spanish Placement Exams2017-10-10T00:00:00+00:002017-10-10T00:00:00+00:00http://goodbody.io/aggregating-spanish-placement-exams<p>This week, I began exploring our backlog of language placement exams. I think the best way to talk about his is to walk you through the process of answering a sample question. For instance, how many students have taken our Spanish exam since we started collecting data?</p>
<p>To answer this question, I needed to combine all of the fixed-width text files into a single data set that I could analyze. I did this using <code class="highlighter-rouge">glob</code>. The below script combines all the results into a single text file.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file_pattern = "*.txt"
list_data = []
text_files = glob.glob(file_pattern)
with open("result.txt", "wb") as outfile:
for file in text_files:
with open(file, "rb") as infile:
outfile.write(infile.read())
clean_results_list = []
with open("result.txt", "r") as outfile:
data = outfile.readlines()
for line in data:
if len(line)==62:
clean_results_list.append(line)
clean_results_file = open(pd.to_datetime("today").strftime("%Y-%m-%d") + " - all_results.txt", "w")
for line in clean_results_list:
clean_results_file.write(line)
</code></pre></div></div>
<p>One drawback of this code is that it filters out all lines that aren’t 62 characters long. I figured that it’d be easier to remove those scores than clean them. This reduced the number of results from about 10,000 to 9,100. Maybe a better option would be to use some sort of REGEX expression to find and clean those 900 results. Suggestions welcome.</p>
<p>The next step was to take the remaining results and import them into a <code class="highlighter-rouge">pandas</code> DataFrame. Originally, I planned to use the script I discussed in an <a href="/importing-fixed-width-text-files">earlier post</a>. However, I ran into a few problems. First, there were some missing data in the “birthdate” and “score” columns. That meant I couldn’t assign datatypes to these columns in the <code class="highlighter-rouge">read_csv</code> command. Second, there were several duplicate rows that needed to be eliminated. So here’s how I modified the code:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>col_breaks = [(0,19),(20,30), (31,33), (40,42), (43,46), (47,55), (55,61)]
col_names = ["name", "id", "score", "language", "placement", "test date", "birthdate"]
# Abbreviations from ISO 639-2
langs = {"3": "SPA", "4": "FRA", "5": "GER", "6": "LAT", "7": "JPN", "8": "KOR", "9": "RUS", "10": "CHI", "11": "ARA", "12": "PER", "13": "ITA", "14": "POR", "15": "GRE"}
filepath = "2017-10-10 - all_results.txt"
# Import text file
df = pd.read_fwf(filepath,
colspecs = col_breaks, header = None, names = col_names, dtype = object, parse_dates = [5], infer_datetime_format = True)
# Deal with missing data
df["language"] = df["language"].apply(lambda x: langs[x])
df["score"] = pd.to_numeric(df["score"], errors="coerce")
df["birthdate"] = pd.to_datetime(df["birthdate"], errors = "coerce")
# Remove duplicates
dupes = df.duplicated("name", keep = "last")
deduped = df.loc[dupes == False]
</code></pre></div></div>
<p>Removing duplicates brought the count from 9,100 to 8,700.</p>
<p>Finally, I had to reshape the data using <code class="highlighter-rouge">pivot_table</code>, then resample the Spanish column by year:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>by_lang = deduped.pivot_table(aggfunc="count", index = "test date", values = "score", columns = "language")
by_lang_agg = by_lang.resample("A")["SPA"].sum()
</code></pre></div></div>
<p>…and here’s your answer in the form of a nifty graph. Enjoy!</p>
<p><img src="/images/spanish.png" alt="" /></p>This week, I began exploring our backlog of language placement exams. I think the best way to talk about his is to walk you through the process of answering a sample question. For instance, how many students have taken our Spanish exam since we started collecting data?GitHub Repository2017-10-04T00:00:00+00:002017-10-04T00:00:00+00:00http://goodbody.io/github-repository<p>I’ve created a GitHub repository for the scripts that I’m writing for my work in the Language Center. You can access it <a href="https://github.com/doctornerdis/lc_placement_exams">here</a>. So far, there are two scripts: one for assembling test results into a fixed-width text file, and one for turning a fixed-width text file into a <code class="highlighter-rouge">pandas</code> DataFrame.</p>
<p>This is pretty niche stuff, but I hope that sharing my work might help folks who are in similar situations. Enjoy!</p>I’ve created a GitHub repository for the scripts that I’m writing for my work in the Language Center. You can access it here. So far, there are two scripts: one for assembling test results into a fixed-width text file, and one for turning a fixed-width text file into a pandas DataFrame.Importing Fixed-width Text Files2017-10-02T00:00:00+00:002017-10-02T00:00:00+00:00http://goodbody.io/importing-fixed-width-text-files<p>In my <a href="/combining-csv-files-with-glob">previous post</a>, I discussed the process of converting the results from our language placement test into a fixed-width text file that’s compatible with our student information system. But what about going in the other direction? We have years worth of data in text files, and they’re ripe for analysis!</p>
<p>Fortunately, <code class="highlighter-rouge">pandas</code> comes with an easy solution: <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html"><code class="highlighter-rouge">read_fwf</code></a>. This allows me to wrangle fixed-width data into a DataFrame.</p>
<p>To give you an idea of what I’m working with, here’s a hypothetical result from a placement exam:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TROJAN TOMMY T 0123456789 93 0 0 0 32 II 10/02/17010199
</code></pre></div></div>
<p>What’s all this mean? The line starts with name and student number. Then there’s the test score, three placeholder <code class="highlighter-rouge">0</code>s, a one- or two-digit code for the language, and the placement level (as a digit and Roman numeral). Finally, there’s the test date in <code class="highlighter-rouge">mm/dd/yy</code> format, followed directly by the student’s birthday in <code class="highlighter-rouge">mmddyy</code> format. If you’re confused, that’s because it’s confusing!</p>
<p>Next, here’s the script I wrote:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import pandas as pd
import numpy as np
col_breaks = [(0,19),(20,30), (31,33), (40,42), (43,46), (47,55), (55,61)]
col_names = ["name", "id", "score", "language", "placement", "test date", "birthdate"]
col_types = {"name": np.object, "id": np.object, "score": np.int64, "language": np.object, "placement": np.object, "test date": np.object, "birthdate": np.object}
# Three-letter abbreviations from ISO 639-2
langs = {"3": "SPA", "4": "FRA", "5": "GER", "6": "LAT", "7": "JPN", "8": "KOR", "9": "RUS", "10": "CHI", "11": "ARA", "12": "PER", "13": "ITA", "14": "POR", "15": "GRE"}
df = pd.read_fwf("sis_upload.txt", colspecs = col_breaks, header = None, names = col_names, dtype = col_types,
parse_dates = [5, 6], infer_datetime_format = True)
df["language"] = df["language"].apply(lambda x: langs[x])
</code></pre></div></div>
<p>Finally, here’s the first row of the resulting DataFrame. Note how the <code class="highlighter-rouge">parse_dates</code> and <code class="highlighter-rouge">infer_datetime_format</code> arguments convert the test and birth dates into datetime objects.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> name id score language placement test date \
0 TROJAN TOMMY T 0123456789 93 SPA II 2017-10-02
birthdate
0 1999-01-01
</code></pre></div></div>
<p><em>Et voilà!</em> The next step would be to use <code class="highlighter-rouge">glob</code>, which I discussed in my <a href="/combining-csv-files-with-glob">previous post</a>, is to concatenate our backlog of results into a single DataFrame. That leaves the question of storage. I’m working on a SQL solution for this and other data we need to easily access in the Language Center. Stay tuned!</p>In my previous post, I discussed the process of converting the results from our language placement test into a fixed-width text file that’s compatible with our student information system. But what about going in the other direction? We have years worth of data in text files, and they’re ripe for analysis!Combining CSV Files with Glob2017-09-05T00:00:00+00:002017-09-05T00:00:00+00:00http://goodbody.io/combining-csv-files-with-glob<p>An important part of my job at the <a href="http://language.usc.edu">USC Language Center</a> is administering placement tests and making the results available to students, advisors, and other administrators. Several times during the year, students take our tests using Scantron forms, and I end up with several CSV files — one for each of the languages we offer. I then need to make sure that all those results end up in a single, fixed-width text file that’s compatible with the university’s student information system. It’s one of those data management tasks that are perfect for automation with python.</p>
<p>Enter <a href="https://docs.python.org/3/library/glob.html"><code class="highlighter-rouge">glob</code></a>, a python package that helps you find multiple pathnames matching a certain pattern. For example, it allows me to quickly find all the CSV files that I need to combine into a single DataFrame that I then convert into a fixed-width text file.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import pandas as pd
import glob
import time
# Creating a date string for the exam, required by out student information system
date_str = str(time.strftime("%m/%d/%y"))
# Creating list of columns I want to import into my DataFrame
columns = ["Student Name", "Month", "Day", "Year", "ID", "Special Codes", "Total Score", "Grade"]
# Set my empty DataFrame
frame = pd.DataFrame()
# Use glob to collect all my CSV files into one DataFrame
for f in glob.glob("/filepath/placement_tests/*.csv") :
df = pd.read_csv(f, usecols = columns, dtype = object)
frame = frame.append(df, ignore_index = True)
# Add date string
frame["Test Date"] = date_str
</code></pre></div></div>
<p>I’ve been working through a <a href="http://datacamp.com/">DataCamp</a> course on cleaning data taught by <a href="https://twitter.com/chendaniely">Daniel Chen</a>. There he advised a slightly different strategy: using <code class="highlighter-rouge">glob</code> to create a list of DataFrames and then using the <code class="highlighter-rouge">.concat()</code> method to combine them all together. It would look something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Collect CSV files
pattern = '/filepath/placement_tests/*.csv'
csv_files = glob.glob(pattern)
# Create an empty list
frames = []
# Iterate over csv_files
for csv in csv_files:
df = pd.read_csv(csv, usecols = columns, dtype = object)
frames.append(df)
# Concatenate frames into a single DataFrame
test_results = pd.concat(frames)
</code></pre></div></div>
<p>I don’t know if one of these strategies is better than the other for this use case. Still, it’s interesting to see that there is more than one way to get things done!</p>
<p>Do you find yourself dealing with these sorts of repetitive data tasks in your work? How do you deal with them?</p>An important part of my job at the USC Language Center is administering placement tests and making the results available to students, advisors, and other administrators. Several times during the year, students take our tests using Scantron forms, and I end up with several CSV files — one for each of the languages we offer. I then need to make sure that all those results end up in a single, fixed-width text file that’s compatible with the university’s student information system. It’s one of those data management tasks that are perfect for automation with python.