NBA Data Science: Breaking Down NBA Data

Recently, the Oklahoma City Thunder and the San Antonio Spurs played to a frenetic 112 – 106 OKC win on October 28th. It was one heck of a statement to open the year for the NBA as the Thunder returned to the court as a complete, healthy unit; the first time since February of last season. It was also a solid win for Billy Donovan as this was his first game as the OKC head coach.

So for this post, let’s take a look at a game event from the Spurs – Thunder game. By going to the box score on stats.nba.com, we find that the NBA lists this game as “Game 0021500013.” All this identifier notes is that the OKC – San Antonio game is the thirteenth game of the 2015-16 NBA season. In total, there are 1,230 total NBA games during the regular season. If, we are to look at the play-by-play raw data, we obtain the following cryptic sequence:

For roughly every .04 seconds, we obtain this vector. This vector is called an NBA Moment. This is a location vector that identifies the 10 players and the basketball. In total, there 5 numerical values and a 11 x 5 vector resting in the sixth vector entry. So let’s break down each part of the data vector.

3. This is the first value of the vector. This value indicates the quarter for which the moment takes place.

1446082223024. This is the second value of the vector. This is epoch time given from the Linux processor that the data is collected on. This is known as absolute time and is measured from a given time from midnight of January 1st, 1970. Converting this specific value, we obtain 8:30:23.024 Central time on October 28th, 2015.

481.51. This is the third value of the vector. This is time left on the game clock in seconds. Therefore there are 481.51 seconds remaining, which is 8:01.51 remaining in the third quarter.

5.56. This is the fourth value of the vector. This is the time remaining on the shot clock for the possession in seconds.

null. This is the fifth value of the vector. This value is typically null and we have been unable to discern this value.

[-1,-1,10.29043, 39.06515, 2.68302]. This is the first element of the 11×5 matrix in the sixth element of the data vector. The values -1,-1 indicates the basketball. The values 10.29043, 39.06515 indicates the position of the basketball on the court. Here, the point 0,0 is located in a corner of the baseline and the sideline. The values are measured in feet. Finally, the value 2.68302 is the diameter of the basketball.

[1610612760, 2555, 5.19317, 21.95614, 0.0]. This is the second element of the 11×5 matrix in the sixth element of the data vector. The value 1610612760 indicates the Team ID. This team ID is Oklahoma City. The Oklahoma City Thunder have a linked list associated to the team and is accessed by the second value, 2555. Here 2555 is the Player ID, which is Nick Collison. At time 481.51 remaining in the third quarter against the San Antonio Spurs, Collison is located at 5.19317, 21.95614 on the court. The final value of 0.0 is merely a place holder for the basketball diameter and ensure that the 11×5 matrix stays as such.

We can repeat this process for the remainder of players, but the gist is the same. Using this data, we can identify simple actions such as distances traveled, speed of players, acceleration rates, locations on courts, and more advanced values. If we refer to the spatio-temporal analysis of NBA plays or build animations of plays.

In future posts, we will take brief looks at different types of analysis that we can perform on this type of NBA data. This is an incredibly rich data set that exists for every game. What would you like to see performed from this data?