This content is part # of # in the series: Social power, influence, and performance in the NBA, Part 1

This content is part of the series:Social power, influence, and performance in the NBA, Part 1

Stay tuned for additional content in this series.

Getting
started

In this tutorial series, learn how to analyze how social media affects the NBA using Python,
pandas, Jupyter Notebooks, and a touch of R. Here in Part 1, learn the basics
of data science and machine learning around the
teams in the NBA. The players of the NBA are the subject of Part
2.

What is data science, machine learning, and AI?

There is a lot of confusion around the terms data science, machine
learning, and artificial intelligence. Often used interchangeably, but from a
high level:

Data science is a philosophy of thinking scientifically about
data.

Machine learning is a technique in which computers learn without
explicitly being instructed to learn.

Artificial intelligence is intelligence exhibited by machines.
(Machine learning is one example of an AI technique; another example is
optimization.)

80/20 machine learning in practice

An overlooked part of machine learning is the 80/20 rule in which
approximately 80 percent of the time is spent getting and manipulating the data,
and 20 percent is devoted to the fun stuff like analyzing data,
modeling the data, and coming up with predictions.

Figure 1. 80/20 Machine learning in practice

A problem of data manipulation that isn't obvious is getting the data in
the first place. It is one thing to experiment with a publicly available
data set; it is another entirely to scrape the internet, call APIs, and
get the data in usable shape. Even beyond those issues, a problem that can
be even more challenging is getting the data into production.

Figure
2. Full-stack data science

Rightfully so, a lot of attention is paid to machine learning and the
skills required to model: applied math, domain expertise, and knowledge of
tooling. To get a production machine-learning system deployed is a whole
other matter. This is covered at a high level in this tutorial
series with the hope that it will inspire you to create machine-learning
models and deploy them into production.

What is
machine learning?

Beyond a high-level description, there's a hierarchy to machine learning.
At the top is supervised learning and unsupervised learning. There are two
types of supervised learning techniques: classification problems and
regression problems that have a training set with labeled data.

An example of a supervised regression machine-learning problem is
predicting future housing prices from historical sales data. An
example of a supervised classification problem is using a historical
repository of images to classify objects in images: cars, houses, shapes,
etc.

Unsupervised learning involves modeling where data is not labeled. The
correct answer might not be known and needs to be discovered. A common
example is clustering. An example of clustering is to find
groups of NBA players with things in common and label those clusters
manually — for example, top scorer, top rebounders, etc.

Phrasing the problem: What is the relationship between social
influence and the NBA?

With the basics out of the way, it is time to dig in:

Does individual player performance affect a team's wins?

Does on-the-court performance correlate with social media
influence?

Does engagement on social media correlate with popularity on
Wikipedia?

Is follower count or engagement a better predictor of popularity on
Wikipedia?

Does salary correlate with on-the-court performance?

Does salary correlate with social media performance?

Does winning bring more fans to games?

What drives the valuation of teams: attendance, local real estate
market?

To answer these questions and others, it is necessary to retrieve several
categories of data:

Wikipedia popularity

Twitter engagement

Arena attendance

NBA performance data

NBA salary data

Figure 3. NBA
data sources

Going deep into the 80-percent problem: Gathering data

Gathering this data is a nontrivial software engineering problem. The
first step to collecting all of the data is figuring out where to start.
For this tutorial, a good place to start is to collect all the players
from the NBA 2016-17 season.

This brings up a helpful point about how to collect data: If it is easy to
collect data manually — for example, download from a website and clean up
the data manually in Excel — then this is a reasonable way to start with a
data science problem. If collecting one data source and manually cleaning
the data turns into more than a few hours, then it's probably best to
write code to solve the problem.

All of the source code and data for this tutorial is also in a GitHub
repo: Social Power NBA.

Fortunately, collecting the first data source is as simple as downloading a
CSV from Basketball Reference. Now that the first data collection is out
of the way, it's time to quickly explore what it looks like using pandas
and Jupyter Notebook. Before you can run some code, you need to:

Create a virtual environment (based on Python 3.6)

Install the packages used in this tutorial: pandas and Jupyter
Notebook.

Because the pattern of installing packages and updating them, is so common
I put it into a Makefile, as shown below:

Listing 1. Makefile contents

Another trick is to create an alias so that when you want to work on a
particular project, you automatically source the virtualenv
when you cd into the project. The contents of the .zshrc file
with this alias inside look like:

To start the virtual environment, type nbatop. You will
cd into the checkout and start your virtualenv.

To inspect the data set you downloaded or used
from the GitHub repo:

Start Jupyter Notebook: Jupyter notebook. Running this launches a web browser in which you can explore existing
notebooks or create new ones.

If you are using the files in the GitHub repo, look for basketball_reference.ipynb, which is a simple
notebook that looks at the data inside.

You can create your notebook using the menu on the web or load the
notebook in the GitHub repo called basketball_reference.
To perform an initial validation and exploration, load a CSV
file into a pandas data frame. Loading a CSV file into pandas is easy, but
there are two caveats:

The columns in the CSV file must have names.

The rows of each column are equal length.

Listing 2 shows how to load the file into pandas.

Listing 2. Jupyter Notebook basketball reference
exploration

The following image shows the result of the data loaded. The describe function on a
pandas data frame provides descriptive statistics, including the number of
columns. In your data, and shown below, the number of columns
is 27, and the median (this is the 50-percent row) for each column. At this
point, it might be a good idea to play around with the Jupyter Notebook
you created and see what insight you can observe. To learn more about
what pandas can do, see the official pandas tutorial page.

Figure 4. NBA dataset load and describe

One thing this data set doesn't have is a clear way to rank
offensive and defensive performance of a player in one statistic.
There are a few ways to rank players in the NBA using just one statistic.
The website FiveThirtyEight has a CARMELO ranking
system. ESPN has Real
Plus-Minus, which includes a handy output of wins attributed to
each player. The NBA's single-number statistic is called PIE (Player Impact Estimate).

The difficulty level increases slightly when you get the data from both
ESPN and the NBA websites. One approach is to scrape the website using a
tool such as Scrapy. The approach used
in this tutorial is a bit simpler than that, though. In this case, cutting
and pasting from the website into Excel, manually cleaning up the data,
then saving the data as a CSV is quicker than writing code to do it.
Later, if this turns into a bigger project, this approach might not work
as well. But for this tutorial, it's a great solution. A key takeaway for
messy data science problems is to continue to make forward progress
quickly without getting bogged into too much depth.

“It is possible to spend a lot of time perfecting a way to
get a data source and clean it up, then realize the data isn't helpful
to the model you are creating.”

The image below shows the NBA PIE dataset. The data also has a count of 486 or 486
rows. Getting the data from ESPN is a similar process to above. Other data
sources to consider are salary and endorsements. ESPN has the salary
information, and Forbes has a small subset of the endorsement data. Both
of these data sources are in the GitHub project.

Figure 5. NBA
PIE dataset

In Table 1, there is a listing of the data sources by name and location. In
short order, we have many items from many different data sources.

Table 1. NBA
data sources

Data source

Filename

Rows

Summary

Basketball-Reference

nba_2017_attendance.csv

30

Stadium attendance

Forbes

nba_2017_endorsements.csv

8

Top players

Forbes

nba_2017_team_valuations.csv

30

All teams

ESPN

nba_2017_salary.csv

450

Most players

NBA

nba_2017_pie.csv

468

All players

ESPN

nba_2017_real_plus_minus.csv

468

All players

Basketball-Reference

nba_2017_br.csv

468

All players

FiveThirtyEight

nba_2017_elo.csv

30

Team rank

There is still a lot of work to do to get all of the data downloaded and
transformed into a unified data set. To make things even worse, collecting
the data thus far was easy. There is still a big journey ahead. In looking
at the shape of the data, a good place to start is to take the top eight
players' endorsements and see if there is a pattern to tease out. Before
that though, explore the valuation of teams in the NBA. From there, you can
determine what impact a player has on the total value of an NBA
franchise.

Exploring team valuation for the NBA

The first order of business is to create a new Jupyter Notebook. Luckily
for you, the Jupyter Notebook is already created. You'll find it in the
GitHub repo: exploring_team_valuation_nba.

Next, import a common set of libraries that are typically used to explore
data in a Jupyter Notebook.

Listing 6. Seaborn pairplot

Looking at the plots, notice the relationship between average attendance
and player valuation. There is a strong linear relationship between the
two features, as represented by the almost straight line formed by the
points.

Figure 10. Seaborn correlation plot NBA attendance versus
valuation

The correlation plot shows a relationship to value in millions of dollars
(of an NBA team), percentage of average capacity of the stadium
that is filled (PCT), and average attendance. A heatmap showing average
attendance numbers versus valuation for every team in the NBA will help you
dive into this a bit more. To generate a heatmap in Seaborn, it is
necessary to reshape the data into a pivot table (much like what is
available in Excel). A pivot table allows the Seaborn charting to pivot,
among three values and shows how each of the three columns relates to
the other two. The code below shows how to reshape the data into a
pivot shape.

Figure 11. Seaborn correlation plot NBA attendance versus
valuation

One way to investigate further is to perform a linear regression using the
Statsmodels package. According to Statsmodels.org, the
Statsmodels package "is a Python module that provides classes and
functions for the estimation of many different statistical models, as well
as for conducting statistical tests, and statistical data exploration. An
extensive list of result statistics are available for each estimator."

You can install the Statsmodels package by using pip install
Statsmodel. Following are the three lines necessary to run the regression.

Listing 9. Linear regression VALUE ~ AVG

The image below shows the output of the regression. The R-squared shows that
approximately 28 percent of the valuation can be explained by attendance, and the
P value of 0.044 falls within the range of being statistically significant.
One potential issue with the data is the plot of the residual
values doesn't look completely random. This is a good start of trying to
develop a model to explain what creates the valuation of an NBA
franchise.

Figure 12. Regression with residual plot

One way to potentially add more to the model is to add in the ELO numbers
of each team. According to Wikipedia,
"The ELO rating system is a method for calculating the
relative skill levels of players in competitor-versus-competitor games
such as chess." The ELO rating system is also used in sports.

ELO numbers have more information than a win/loss record because they rank
according to the strength of the opponent played against. It seems like a
good idea to investigate whether how good a team is affects the
valuation.

After the merge, there are two charts to create. The first, shown in
Figure 13, is a new correlation heatmap. There are some positive
correlations to examine more closely. In particular, attendance and ELO
seem worth plotting out. In the heatmap below, the lighter the color, the
more highly correlated two columns are. If the matrix shows the same value
compared against itself, then the correlation is 1, and the square is
beige. In the case of TOTAL and ELO,
there appears to be a 0.5 correlation.

Figure 13. ELO correlation heatmap

Figure 14 plots ELO versus attendance. There does appear to be a weak
linear relationship between how good a team is (ELO RANK) versus the
attendance. The plot below colors the east and west scatter plots
separately, along with a confidence interval. The weak linear relationship
is represented by the straight line going through the points in the
X,Y space.

You can see by the plot shown below that there are three distinct
groups, and the centers of the clusters represent different labels. A note
to pay attention to is that sklearn MinMaxScaler is used to scale all of
the columns to a value between 0 and 1, to normalize the difference
between scales.

Figure 16. Team
clusters

The image below shows the membership of cluster 1. The main takeaways from this
cluster are that they are both the best teams in the NBA and teams that
have the highest average attendance. Where things break apart is on total
valuation. For example, the Utah Jazz is a very good team according to
the ELO, and they have very good attendance, but they are not valued as
high as other members of the cluster. This may mean there is an
opportunity for the Utah Jazz to make small changes that significantly
raise the valuation of the team.

Figure 15.
Cluster membership

Conclusion

In Part 1 of this two-part series, you learned the basics of data science
and machine learning, and started to explore the relationship of valuation,
attendance, and winning NBA teams. The tutorial's code was kept
in a Jupyter
Notebook you can reference here.
Part
2 leaves the teams
and explores individual athletes in the NBA. Endorsement data, true
on-the-court performance, and social power with Twitter and Wikipedia is
explored.

The lessons learned so far from the data exploration are:

Valuation of an NBA team is affected by average attendance.

ELO ranking (strength of team's record) is related to attendance.
Generally speaking, the better a team is, the more fans attend
games.