Although data about faculty salaries at private universities can be difficult to find, getting data regarding faculty salaries at public universities is much easier. I've always been curious (even though I don't want to go into academia myself!) which fields pay the best and why and if there is a way to predict your salary if you are going to be a professor at a certain university.

So, I decided to make this a project I would explore and use two of the very best data science tools to do it: Python and R! Thanks to the rpy2 library and IPython's R magic, I can now include R code inside the notebook instead of having to switch between knitr and here. The first half will be about cleaning and EDA (exploratory data analysis): a key first step of any data science project dealing with structured data especially. The second half will be about machine learning and applying a model for regression.

While Python is making improvements in visualization, I still prefer R for this part. Combining the great tools of the plyr library for splitting the data into pieces with the beautiful visualization of the ggplot2 library makes this look much better than Python's current offerings.

For this project, I decided to choose my current university: Texas A&M. It's a large university, which means it also has a lot of faculty members to look at! The Texas Tribune publishes salary data each year from the university here.

We only need the data from Texas A&M, no other extension agencies.

Here is part of the problem though: all of the files we can download are .xlsx Excel files. R doesn't read these very well (although there are some workarounds). The best way is to save them as a .csv file first, then read them in using read.csv.

Pandas can read Excel files, however. Let's get the files using Pandas, then save them as .csv files locally. Once that is done, we can load them into R.

It is time to play around with some of IPython Notebook's lovely interaction with R using the rpy2 library and the R magic. Note to Windows users: this probably won't work very well for you as rpy2 doesn't play nicely with Windows! You could use a virtual machine with Linux to get around this if you wish.

Let's load all of the libraries we will need in R first.

In [48]:

%load_ext rpy2.ipython

In [49]:

%%R # You need this to use a block of R code.library(plyr)library(ggplot2)library(scales)

Now that we have all of our libraries loaded, let's load our data into R and take a look at it.

Clearly there are some issues we will need to address. First of all, we can see other universities were included in the database that are NOT a part of Texas A&M. For this study we are only looking at Texas A&M faculty, so let's only include rows under the 'MbrName' column that are either part of Texas A&M or the Health Science Center. Let's also trim this down a bit and get rid of the 'X' and 'CurrEmplDate' columns as we won't need them for our analysis.

For this project, we are looking at tenure-track faculty only. We want to get rid of other employees such as "BUS ADMIN I". As a start, let's remove all employees that don't have "Professor" in their title (from the ShortTitleDesc field).

In [53]:

%%R
salaryDB <- salaryDB[grep('PROF', salaryDB$ShortTitleDesc),]

In [54]:

%%R
dim(salaryDB)

[1] 2818 9

We have reduced the number of employees a lot, from 9234 to 2818.

The next thing we want to do is change the OrigEmplDate column to something more meaningful. Instead, let's make it the number of years they have worked at the university. We can use this as one of our features later.

The metadata at the website where we obtained the original salary data says that it was uploaded on Jan 15, 2014. Let's find the difference between that date and the original employment date for all of our employees.

In [55]:

%%R
origDate <-as.Date('2014-01-15')# This is the format R likes: YYYY-MM-DD.
salaryDB$OrigEmplDate <-as.Date(salaryDB$OrigEmplDate)# Get this into the Date form R likes for finding differences # between dates.
salaryDB$OrigEmplDate <-as.numeric((origDate-salaryDB$OrigEmplDate)/365.)# Calculate the difference and divide by 365 for number of # years.colnames(salaryDB)[8]<-'YearsWorked'# Use a more informative column title.

That seems correct. The next thing we are going to do is create another feature: college. All of the departments at universities are organized into colleges. At Texas A&M, that means 12 major colleges. This is somewhat tedious work, but we want to organize all of the faculty members by college as a feature.

Let's get a list of all of the different departments included first.

In [57]:

%%R
colnames(salaryDB)[5]<-'Dept'# Much better title for this column!
deptLevels <-as.factor(salaryDB$Dept)print(summary(deptLevels))

Quite a large variety of departments listed here. To be consistent with the college labelings on the university's website, I will only label a department as part of a college if it is specifically mentioned. We will create a variety of department lists of statements we can grep for each college, then assign a college number. Data science is sometimes a little messy!

In [58]:

%%R
# Include a large list of department terms we can use grep on so that each faculty member is grouped by college.
agList <-c('AG EC','AG ED','ANIMAL SC','BIOCHEM','ECOS','ENTO','HORTI','NUTRI','PLANT P','POULTRY','REC,','SOIL','WILDLIFE')# College of Agriculture
archList <-c('ARCHI','CONSTR','LANDS','VISUALIZ')# College of Architecture
bushList <-'BUSH SCH'# Bush School
maysList <-c('ACCOUNTING','FINANC','INFO & OP','MANAGEM','MARKET','MBA')# Business school
eduList <-c('DEAN OF ED','TEACHING,','EDUCATIONAL PSYC','HEALTH &','EAHR')# Education/Human Development
engrList <-c('ENGR','COMPUTER','CIVIL','MATERIALS','BIOMEDICAL ENG','AEROSPACE','AG ENGINEER')# Engineering
earthList <-c('GEO','OCEAN','METEO')# Geoscience (my college)
libList <-c('ANTHRO','COMMUN','^ECON','ENGL','HISTORY','INTERN','HISP','PERFORM','PHILO','POLITICAL','PSYCHOLOGY','SOCIO','LIBERAL')# Liberal Arts
sciList <-c('^BIOLOGY','CHEMISTRY','MATH','ASTR','STATISTICS')# Science
vetList <-c('VET','BIOMEDICAL SCI','LARGE AN')# Vet School

Now that we have our grep lists, make our new feature and assign a number for each college.

Let's include another feature for our dataset, tenure ranking. There are three basic levels: assistant, associate, and full. Assistant means you haven't reached tenure yet, which takes about 5 years or so after you start (and is not guaranteed!) Once you get tenure, you are promoted to associate rank. Do more good work, and you can reach the rank of full professor. Salary is clearly going to be somewhat dependent on this, so let's see if we can include this as an additional feature.

As you can see in the head of our dataframe, the professors have their rank included in the "ShortTitleDesc" column. Let's use grep again to assign each professor a rank category.

However, there were some other professors that remained we don't want to be including in our dataset (such as adjuncts or visiting professors). Let's give them a label of 0 to reset them. Then, get rid of any other faculty members that aren't in one of these three categories.

In [63]:

%%R
salaryDB$Rank[grep('ADJ|VISITING|CLIN|INSTRUCT|RES|PRACT', salaryDB$ShortTitleDesc)]<-0# Non tenure-track.
salaryDB <-subset(salaryDB, Rank >0)# Get rid of any faculty members not falling in one of the three rank categories.

Let's see how our dataframe is looking (and how many faculty members are left!)

Examining the histogram, we can see that the faculty members predominantly fall in category 1 (white). Category 2 represents Black/African-American, category 3 is Latino, and category 4 is Asian. Category 5 represents more than one race (mixed) while category 7 represents Native American.

While it is not ideal, we can guess the race based on the names of the NA members. White is by far the most common category, so we will assign all remaining faculty members to this category for now.

Our dataframe now has some new features, along with converting all of the categorical features to numeric ones. We have also gotten rid of all NA values. Our data is now (finally) clean! We can now start the next phase.

In this section, we will try some visualization of the salary data to find any interesting patterns or other existing problems with the data we may have missed.

There are several interesting ways we could look at this data (feel free to play around with it!). I am going to choose a few of the more interesting patterns. To do that, we will use the libraries plyr to shape the data and ggplot2 for visualization. These two are a powerful combination for exploring data.

First thing we are going to do is make a very nice plotting function that we can call to help visualize the data quickly and efficiently.
Let's make that now.

This function will make a nice looking density plot of our salaries to see the distribution, along with the median value for comparison. Let's call it now.

In [77]:

%%R
PrettyDensPlot(numericDB$BudgetedSalary,'Salary')

Looks like the median salary is right around the 110,000 dollar mark. The distribution looks to have a bit of a tail on the lower end. Remember that the plot marks below show a logarithmic distribution of salary data, which you often need. Salary data is very rarely normally distributed without a log transformation. Therefore, a summary statistic such as the median will usually be more informative about the distribution than the mean.

It does seem that women do make less money than men, regardless of race. The largest gap is somewhat surprising in Asians. Asian men have the greatest median salary among races, yet Asian women have the smallest! Hispanic faculty members seem to have the smallest wage gap by sex.

Next, let's examine how tenure ranking affects the salary. There should be an increase in median salary for higher ranks.

It appears that going from Assistant to Associate professor, according to the median, will increase the salary by only about 10,000 dollars a year. That doesn't seem like very much! Full rank definitely seems to provide quite the salary boost, however. This could be biased somewhat, however, because Full is the highest rank we have, meaning professors that have been around for ten years and thirty years fall into this category, so the median will be increased.

Now let's see how salaries rank according to which of the 12 colleges you are in.

I had to shorten the names so they would all fit, but hopefully you can understand what they are. Being a professor in the Business school definitely makes a good living! It is far ahead of the other colleges. Engineering isn't far behind, which is understandable, especially since Texas A&M is known for having a large Engineering college. Liberal Arts and Architecture are the lowest paying, which makes sense considering the conventional wisdom regarding salaries in those fields.

Last, let's examine salaries for a few departments. To do that, we will need the information from the larger, original dataframe we made that has the department titles.

As you can see in the code, FIN = Finance, COMP = Computer Science, METEO = Meteorology, STAT = Statistics, and ENG = English. My current field, Meteorology, falls somewhere in between Computer Science and Statistics (the two main branches of Data Science). Finance sure pays a lot, while English not quite as much.

Our last step for finishing with R is to output our final numeric dataframe and save it as a .csv, where Python will take over for the machine learning portion of this notebook. We are going to save two different dataframes. The first will be our numeric frame, while the second will be a lookup table we can create easily for Python. This will come in handy later when we implement our machine learning algorithm.

We have obtained an external data source, cleaned it up, transformed the data, added new features, and did some exploration of the data. Believe it or not, this work actually ends up taking most of a Data Scientist's time! It is very important, as good features make machine learning algorithms work far better than which model you end up choosing, so it is a step that is worth the time investment.

We will now start Part 2 of this notebook: how to take this new dataset and make a predictive model out of it that we can actually use. While this work could be done in R, I personally think Python has better Machine Learning capabilities and can be more easily integrated into existing code than R can.

Now, it is time to switch gears a bit and move back to Python. Our goal is to make a regression model that can be used to predict what your salary would be if you were to become a faculty member at Texas A&M.

We have done most of the necessary preparation already, but a little more remains. First, let's load our two .csv files into Python via the pandas library.

We want to try a couple of different machine learning algorithms for this project. If we were only using a tree method (such as Random Forests or Gradient Boosted Regression Trees), we wouldn't need to alter our data: just run it as is. This is what is convenient about tree methods. However, it may be helpful to also have another algorithm to compare to, such as linear regression. If you are going to use regularization with linear regression (you should!), it requires us to apply two things:

Feature Scaling

One-Hot Encoding

Trees don't care about this, but to set the proper weights for the features in linear regression (and apply regularization appropriately), we need to scale our numerical features. We need One-Hot Encoding for the categorical features because otherwise linear regression will think College 8 is greater than College 1. That isn't really a comparison; they should be treated equally by the regressor.

Let's take care of this now while we can.

First, let's take care of the One-Hot Encoding. We only need to apply this to the categorical variables.

In [87]:

fromsklearn.preprocessingimportOneHotEncoder# Get the categorical features onlycategDF=salaryDF[['Race','College','Rank','SexCode','DeptCode']]encoder=OneHotEncoder()# Create the encoder objectcategDF_encoded=encoder.fit_transform(categDF).toarray()# want it as numpy array since it is not too large # tree algorithms in scikit-learn can't use sparse arrays yet

Now that all of our categorical variables are in zeros and ones, let's add our last feature to the numpy array to create our final feature matrix.

In [88]:

yrs_worked=salaryDF['YearsWorked'].values# Now combine this with the rest of our features.importnumpyasnpx_final=np.c_[yrs_worked,categDF_encoded]

Let's get our salaries as the target. Then, we can create our train/test split.

In [89]:

y_final=salaryDF['BudgetedSalary'].valuesfromsklearn.cross_validationimporttrain_test_splitx_train,x_test,y_train,y_test=train_test_split(x_final,y_final,test_size=0.2,random_state=0)# We now have our train/test splits

What about the feature scaling I talked about earlier?

What a lot of people do is scale the features first, then split into test and training sets. It's easier to do, but this is a bad habit.

DO NOT APPLY THE SAME SCALING TO YOUR TRAINING AND TEST SETS! This is a very common mistake and terribly easy to make. The reason you don't do this is because otherwise the mean and standard deviation being calculated for the scaling includes data from the test set (which in theory you aren't supposed to have access to when building the model!)

Instead, what is more "fair" is to first scale your training set only. Then, apply the exact same scaling to the test set, using the training set average and standard deviation instead. This way, the model won't be unfairly biased by information in the test set. Let's do that here.

Similar to the first project I showed regarding classification of movie reviews, we are going to use GridSearch from scikit-learn to help tune our models.

The first model we are going to try to start with is linear regression with L2 regularization (also known as ridge regression). There is a built-in normalization (feature scaling) option, but since we also have categorical variables that are one-hot encoded, this wouldn't work very well.

Let's set up a ridge regressor with 10 different values for the regularization parameter (alpha). You will need to wait about 30 seconds or so for all of the grid search runs to be completed.

In [91]:

fromsklearn.grid_searchimportGridSearchCVfromsklearn.linear_modelimportRidgefromsklearn.metricsimportmean_absolute_errorridge_params={'alpha':np.linspace(1,10,10)}# Try alpha values ranging from 1 to 10.ridge_model=GridSearchCV(Ridge(),ridge_params,scoring='mean_absolute_error',cv=20)# Run a grid search on all 10, with 20-fold# cross-validation# Fit the model. ridge_model.fit(x_train,y_train)# See how well it did.y_true,y_pred=y_test,ridge_model.predict(x_test)print'Mean absolute error of ridge regression was:'print(mean_absolute_error(y_true,y_pred))

Mean absolute error of ridge regression was:
20088.8770539

Okay, now we have a model as a baseline to compare against. Typically, Gradient Boosted Regression Trees are the go-to solution for any regression problem that doesn't involve a large number of training examples/features (greater than 100,000 or so). They allow non-linear interactions between features that linear regression can't do, so it tends to be more accurate.

However, it requires a LOT of tuning for maximum optimization. I spent a fair amount of time trying several different settings to see what worked the best, so it will save us some time here. Let's now use the GBRT model and see if it improves our mean absolute error. This is a more complicated model than ridge regression, so we will have to wait longer for it to be finished.

In [92]:

fromsklearn.ensembleimportGradientBoostingRegressorasGBRTgbrt_params={'learning_rate':[0.01],'max_depth':[6],'min_samples_leaf':[3],'n_estimators':[1000]}# Found these after a lot of experimentation with gridsearch. You can try tuning these to see what results you get.model_gbrt=GridSearchCV(GBRT(random_state=0),gbrt_params,scoring='mean_absolute_error')# Turned off CV since I already tested thismodel_gbrt.fit(x_train,y_train)y_true,y_pred=y_test,model_gbrt.predict(x_test)print'Mean absolute error of GBRT was:'print(mean_absolute_error(y_true,y_pred))

Mean absolute error of GBRT was:
20249.1455361

Well that is a little surprising! This is an unexpected result. I tried a lot of different parameters and nothing seems to improve over the ridge regression model. Let's try another model: random forests. These tend to not work as well for regression problems, but since our GBRT model didn't improve upon the benchmark, let's give them a try.

Fortunately, these don't require as much tuning and tend to run faster as they can be run in parallel if you have multiple cores.

In [98]:

fromsklearn.ensembleimportRandomForestRegressorasRFRrfr_params={'min_samples_leaf':[1,3,5],'n_estimators':[100,300,500]}model_rfr=GridSearchCV(RFR(random_state=0),rfr_params,scoring='mean_absolute_error',cv=5,n_jobs=-1)# n_jobs = -1 allows you to parallelize grid search to the number of cores on your machinemodel_rfr.fit(x_train,y_train)y_true,y_pred=y_test,model_rfr.predict(x_test)print'Mean absolute error of RFR was:'print(mean_absolute_error(y_true,y_pred))

We are now going to create a function to output the results of our model for any combination of parameters we choose. The function will request inputs of race, years of experience, rank (assistant, associate, full), sex, and the department.

The function will automatically find the associated college from the lookup table we loaded earlier, recombine our inputs and change them to the proper format. Then, the final predicted salary will be displayed.

In [99]:

defsalary_prediction(race='W',yrs_experience=5.0,rank='AST',sex='M',dept='English'):''' This function will allow you to predict a faculty member's salary based upon the parameters entered. Inputs: Race (B for black, W for white, A for Asian, and H for hispanic) Years of Experience (input as a decimal, such as 5.2) Tenure rank (AST for assistant, ASC for associate, and FUL for full) Sex (M for male, F for female) Department (Enter as a string. Function will convert to all capital letters and find the college and department code automatically from the lookup table. If there is more than one matching department from the title you entered, you will given the names and asked to enter again, being more specific.) For parts of words that are in multiple departments exactly (such as BIOLOGY), use the regex format ^BIOLOGY instead. Output: Predicted annual salary based on our best model (ridge regression). '''# Create empty array for storage.info_store=np.zeros(5)# Load our dicts.race_dict={'W':1,'B':2,'H':3,'A':4}rank_dict={'AST':1,'ASC':2,'FUL':3}sex_dict={'M':0,'F':1}# Force our race, rank, sex, and department inputs to all capital letters. This makes it easier in case someone enters# lower case instead of upper.race=race.upper()rank=rank.upper()sex=sex.upper()dept=dept.upper()yrs_experience=np.array([float(yrs_experience)])# Force to array type.# Now find the appropriate college and department codes from the lookup table defined outside of the function.dept_info=lookup_table[lookup_table['Dept'].str.contains(dept)]# Loop back and request additional input if more than one department is found as an additional feature.whiledept_info.shape[0]>1:print'More than one department found. Matching departments include:'foriindept_info['Dept']:printidept=raw_input('Please enter a more specific department:')dept=dept.upper()# Fix possible bug when person types it in again.dept_info=lookup_table[lookup_table['Dept'].str.contains(dept)]# Try again.ifdept_info.shape[0]<1:print'No department match found. Try broadening the search (use fewer letters)'return# Exit early.# Now store the appropriate values in our previously empty array.info_store[0]=int(race_dict[race])info_store[1]=int(dept_info['College'])info_store[2]=int(rank_dict[rank])info_store[3]=int(sex_dict[sex])info_store[4]=int(dept_info['DeptCode'])# Get the numerical equivalent of our department from the lookup table.# Anyway, now that we have the inputs, combine them together as a test example for the model to predict.# Use our scaler from the training set defined earlier in the progam.yrs_experience_scaled=scaler.transform(yrs_experience)# Now one-hot encode the categorical features the same way.combined_onehot=encoder.transform([info_store]).toarray()# Merge them together.# Merge them together.final_test_example=np.c_[yrs_experience_scaled,combined_onehot]# Now predict with this.pred_salary=ridge_model.predict(final_test_example)print'The department you chose was:',dept_info['Dept'].values[0]print'Your predicted annual salary is:','%.2f'%pred_salary[0]# End of function

Now let's give it a few tries (feel free to experiment with this yourself).

Our first potential faculty member is a white male with a little over a year of experience. He is at the assistant professor rank in the Meteorology department. What is his expected salary?

Our predicted salary is less than in the Meteorology department, which we would expect. Feel free to try others in the notebook if you are following along.

I found through trial and error that the predictions tend to be less accurate for younger professors (at least comparing to the examples in the database) than for older, more established ones. This is because there are fewer young professors the model can train on, so you have to treat these salary predictions more skeptically.

In this notebook, we took an interesting external dataset and explored it in detail. A fair amount of work was necessary to engineer our features, but we were able to use them afterwards for exploratory data analysis purposes in R. We shaped the data using plyr and visualized it using ggplot2.

Once our EDA was complete, we built a regression model for predicting salaries and tried a few options. Then, we wrote a function to implement the model in an easy way.

Of course there were some steps missing since this is just an example, but this is the basic Data Science workflow from start to finish!

Possible ideas for improvement:

Include more data from other nearby public universities (or at least ones of similar ranking). This can be tricky, however, because different universities have different college structures, so we would have to design new features again.

Try to implement the final prediction function inside of a website by using something such as Flask from Python. The prediction model could be exported via pickling and using sklearn's joblib class.

This website does not host notebooks, it only renders notebooks
available on other websites.