This tutorial was partially adapted from http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial, where you can learn more about Open Refine. It used to be called Google Refine so try that too when you are searching for information. However, in our case, we use python to do the same thing.

Downloading Data

The university data can be downloaed from http://enipedia.tudelft.nl/enipedia/images/f/ff/UniversityData.zip

What you can learn

The data contains quite a few issues, and this tutorial shows how to do things like:

Universidad Juárez Autónoma de Tabasco is a public institution of higher learning located in Villahermosa, Tabasco, Mexico.

uk_condition=df['country']==','df['country'][uk_condition]="Mexico"

df[df['country']=='Satellite locations:']

university

endowment

numFaculty

numDoctoral

country

numStaff

established

numPostgrad

numUndergrad

numStudents

75009

Nova Southeastern University

US $64.5 million

2083

NaN

Satellite locations:

4319

1964

22060

6397

28457

Nova Southeastern University (NSU) is a private nonprofit university, with a main campus located on 300 acres (120 ha) in Davie, in the US state of Florida. Formerly referred to as “Nova” and now commonly called “NSU”, the university currently consists of 18 colleges and schools offering over 175 programs of study with more than 250 majors.

Clean up values for the number of students

We need to clean the data for the number of students. Not all of the values are numeric, and many of them contain bits of text in addition to the actual number of the students.
To figure out which entries need to be fixed, we need to use a Numeric facet:

Clean up values for the endowment

First remove the numeric facet for numStudents and create a new numeric facet for endowment. Select only the non-numeric values, as was done for the number of students.
Already we see issues like “US$1.3 billion” and “US $186 million”

df.endowment=[str(i).replace(' million','E6').replace(' billion','E9').strip()foriindf.endowment]df.endowment=[str(i).replace('million','E6').replace('billion','E9').strip()foriindf.endowment]df.endowment=[str(i).replace(' Million','E6').replace(' Billion','E9').strip()foriindf.endowment]df.endowment=[str(i).split(' ')[0]foriindf.endowment]df.endowment=[str(i).replace('M','E6').strip()foriindf.endowment]df.endowment=[str(i).replace(';','').replace('+','').strip()foriindf.endowment]# df.endowment = [str(i).split('xbf')[1] for i in df.endowment]# df.endowment = [str(i).split('xb')[1] for i in df.endowment]# df.endowment = [str(i).split('xa')[1] for i in df.endowment]

After most of this has been cleaned up, select the non-numeric values, and delete them, just as was done for the numStudents.