Fabian Buentello

Nba Machine Learning Chapter 4

15 minute read

Chapter 4. Machine Learning

7/24/2015

Intro

The following command will sync your repo with mine if you’re having issues:

$ git checkout startChapter4 -f

I want to mention this before continuing. When it comes to programming the machine learning portion, it’s actually extremely easy and fast. This tutorial will not teach you everything you need to know about machine learning, it will simply show you how to implement it. If you want to continue with machine learning, I will be referencing free online courses and tutorials that I read that helped me along the way.

As of right now, we have read in data from CSV files. We cleaned the data, we structured the data, and we exported the data. I’ve been waiting days to say this, now let’s do something with the data!

Before we start I want to go on the record about something. If I told you that I was a PYTHON GURU, I would be lying like the rug.

With the exception of a Codecademy and “Hello World!” type programs, this is my first real program with Python. I apologize if some of my code is very noob-ish. If anyone knows a better way of implementing some of my code, by all means, make a pull request to the repo.

You can either use the checkout above or simply create these two new files:

nbaImport.py

runNBA_Data.py

In nbaImport.py, we will have two variables and the methods we need in order to connect to mongo. Let’s get started with that. Add this comment section at the top of the page:

The following methods will connect us to our MongoDB. I followed this StackOverflow answer as a guide to connect to MongoDB. It’s pretty straight forward, I made a few tweaks to it. Add the following methods to nbaImport.py

# Connect to MongoDBdefconnectMongo(db,host='localhost',port=27017,mongo_uri=None):""" A util for making a connection to mongo """ifmongo_uri:conn=MongoClient(mongo_uri)else:conn=MongoClient(host,port)returnconn[db]defreadMongo(db,collection,query={},queryReturn=None,_limit=None,no_id=True,mongo_uri=None):""" Read from Mongo and Store into DataFrame """# Connect to MongoDBdb=connectMongo(db=db,mongo_uri=mongo_uri)# Make a query to the specific DB and Collectioncursor=db[collection].find(query,queryReturn)# Check if a limit was setif_limit:cursor=cursor.limit(_limit)# Expand the cursor and construct the DataFramedf=pd.DataFrame(list(cursor))# Delete the _idifno_id:deldf['_id']returndf

I will explain DataFrame when we get to runNBA_Data.py. Almost forgot, add these imports at the top of nbaImport.py:

Now let’s go over to our runNBA_Data.py and implement our four functions, that’s right only four functions. One of those functions I consider a helper, so you know it’s short. Let’s import our modules at the top of the page.

Part 1 - We are using our readMongo() method that we imported. We set query = {} so it would return everything. queryReturn is the projection document of the query. Meaning, it tells mongo: “Hey, I don’t care about all that other stuff, I only want back these values here(WANTED_FEATURES).”

Part 2 - Picture DataFrame as a spreadsheet/SQL table. Since nbaFrame.Seasons returns an array of objects, and objects don’t go well with spreadsheets. We need to use our flatten() function to transform our data so it can work with the DataFrame.

Let’s console.lo.... I mean, print(statsDF) after we set it. Add the following line after we set statsDF:

print(statsDF)

Be sure you have your Mongo DB instant running if you’re not using Mongolab. Let’s run this code:

$ python3 runNBA_Data.py

Hopefully should’ve gotten something similar to this:

Do you see what I’m talking about when I was talking about DataFrame and objects? Goodluck trying to put that in a spreadsheet.

Let’s go ahead and finish up this function, paste the following in BuildDataSet():

Part 1 - Make a new DataFrame object called stats which is set to the values of all the totals stats. We make a new features called FT_M and FG_M which calculate free throws missed and field goals missed. Lastly, we convert all the numbers to floats.

Part 2 - We add PER to the stats from the advanced stats.

Part 3 - We are randomizing our data, which is very common in machine learning. Next we are setting our inputX and output y.

Now to our PlotLearningCurve(), I am not going to go into detail as to how the machine learning portion is working. I will, however, be referencing documentation which goes into much more detail and contains examples.

Part 1 - Machine Learning: [learning_curve()](http://scikit-learn.org/stable/modules/generated/sklearn.learning_curve.learning_curve.html#sklearn.learning_curve.learning_curve) returns the data we need for our graph. So you have an idea of what a learning curve looks like, check out this example. The variables algorithm, X_data, and y_data will make more sense when we get to the Analysis() function. Our variable sizes tells our learning_curve(), “Hey, I want you to calculate the accuracy at 10%, 20%, 50%, 80%, and 99%.” The variables ending in _mean and _std are simply the measurements of the average and the standard deviation at each of the percentages.

Part 2 - Plot Graph: Here we are placing the data onto the graph and finally displaying it at the end. I put comments to help explain what’s going on since it’s pretty straightforward. Just setting up labels, legends, titles, etc.

Part 1 - Here we get our X and y which we get returned from BuildDataSet(). Next, we make a new variable called linear_regression which is set to our imported [LinearRegression()](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) algorithm class.

Part 2 - We’re going to be adding [Polynomial Features](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#example-model-selection-plot-underfitting-overfitting-py) to our algorithm. This video by Andrew Ng does an extraordinary job explaining the importance of this. I will be speaking about Andrew’s course in the conclusion of this tutorial.

Part 3 - we need to use our [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to combine polynomial_features to our LinearRegression() algorithm.

Part 4 - lastly, we pass in our X, y, and algorithm as arguments to PlotLearningCurve().

Are y’all ready to predict a player’s PER? Let’s do one last checkup before running it.

I bet you’re saying, “Fabian, what the hell man? I thought you said 95%!?!? What the hell is this 40% that I’m seeing?!? I can’t get jiggy with this!! .” I know, I know!! Something went wrong, I think we might have missed something, let’s go back to Chapter 1 and redo it all….. Nah I’m just kidding, when I first saw this, It kinda scared me. But don’t worry, I got this. Let’s open up basketball-reference in a new tab, and click on the PER stat, like so:

It should put all the PER stats in descending order. As you see, our data could use a bit more cleaning. There’s no way Sim Bhullar should have a higher PER than Anthony Davis, Stephen Curry, and especially the best 2-Guard in the game right now, James Harden. Some may find that last statement a bit biased, I don’t know, you tell me, does this gif lie?

I see what’s going on, it’s the amount of games(G) the player has played. That’s why Sim Bhullar has a high PER.

I know what you’re thinking, “Fabian, are we going to have to go back to Chapter 2 and redo something?” The answer is “if I was an asshole, I would make you do that, but I’m not. So I’ll tell you what to do.” We’re going to make a new collection in our database that will not include seasons from players who did not play a certain amount of games. Derrick Rose will act as our example. We will use all the seasons Derrick has played except his ‘12-‘13 and ‘13-‘14 seasons for which he was hurt.

I’m sure you’re saying: “Fabian 85% does not equal 95%! I’m getting really close to just leaving this tutorial! I’m tired of you playing with my emotions!!”

OK I’m sorry this is the last fix, I PROMISE! Remember that polynomial_features? Since _deg gets set to 1 in Analysis(_deg=1), we aren’t taking advantage of that feature. So at the very bottom of the page, the very last line, where you see Analysis() all by itself. Replace it with this:

Analysis(2)

and then run it. Watch your accuracy shoot up! By the way, it may take a little longer than before. Let’s try 3 next.

Analysis(3)

You hopefully should’ve gotten this:

There’s your 95% + 1!

Let’s zoom into the graph a bit. At the end of PlotLearningCurve(), replace the following lines:

plt.ylim((0,1.1))plt.legend(loc="best")

with:

plt.ylim((0.5,1.1))plt.xlim((0,6500))plt.legend(loc="best")

Looks a lot better don’t you say?

Conclusion

I hope you were able to take something away from this tutorial. Machine learning is a very exciting field and I highly recommend you look into it. As you see, it takes more time to collect and clean data than the actual Machine Learning part. Machine Learning is all about theory:

What are you trying to predict?

What are your features?

What algorithm are you going to use?

Machine learning is nothing like programming. In programming, you can grind it out. Meaning, you can work when your body is completely drained and still get something done. Machine learning is not like that, you need to have a clear mind and really understand your data.

I want to thank basketball-reference and all the hard working teams involved for allowing me to use their data. I also want to thank Sean Forman for allowing me to make their data available for download on this tutorial. I want to thank Andrew Ng for his amazing FREE Machine Learning course on coursera. Lastly, I want to thank sentdex for his great python machine learning tutorials on YouTube. Both sentdex and Andrew’s videos really helped out! I recommend you watch them(Andrew’s before sentdex) if you want to get into Machine Learning. If you’ve made it all the way down to this sentence here, I want to personally thank you for taking the time out of your day to do this tutorial. I hope you took something away from it, but most importantly, I hope you had a good time. Any feedback would be appreciated. Thank you again!