Aggregating Spanish Placement Exams

This week, I began exploring our backlog of language placement exams. I think the best way to talk about his is to walk you through the process of answering a sample question. For instance, how many students have taken our Spanish exam since we started collecting data?

To answer this question, I needed to combine all of the fixed-width text files into a single data set that I could analyze. I did this using glob. The below script combines all the results into a single text file.

One drawback of this code is that it filters out all lines that aren’t 62 characters long. I figured that it’d be easier to remove those scores than clean them. This reduced the number of results from about 10,000 to 9,100. Maybe a better option would be to use some sort of REGEX expression to find and clean those 900 results. Suggestions welcome.

The next step was to take the remaining results and import them into a pandas DataFrame. Originally, I planned to use the script I discussed in an earlier post. However, I ran into a few problems. First, there were some missing data in the “birthdate” and “score” columns. That meant I couldn’t assign datatypes to these columns in the read_csv command. Second, there were several duplicate rows that needed to be eliminated. So here’s how I modified the code: