How-to: Analyze Fantasy Sports using Apache Spark and SQL

As part of the drumbeat for Spark Summit West in San Francisco (June 6-8), learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL.

In the United States, many diehard sports fans morph into amateur statisticians to get an edge over the competition in their fantasy sports leagues. Depending on one’s technical chops, this “edge” is usually no more sophisticated than simple spreadsheet analysis, but some particularly intense people go to the extent of creating their own player rankings and projection systems. Online tools can provide similar capabilities, but it’s not often transparent where the numbers come from.

Although the data involved is not large in volume, the types of data processing, data analytics, and machine-learning techniques used in this area are common to many Apache Hadoop use cases. So, fantasy sports analytics provides a good (and fun) use case for exploring the Hadoop ecosystem.

Apache Spark is a natural fit in this environment. As a data processing platform with embedded SQL and machine-learning capabilities, Spark gives programmatic access to data while still providing an easy SQL access point and simple APIs to churn through the data. Users can write code in Python, Java, or Scala, and then use Apache Hive, Apache Impala (incubating), or even Cloudera Search (Apache Solr) for exploratory analysis.

In this two-part series, I’ll walk you through a common big data workflow: using Spark and Spark SQL for ETL and complex data processing, all while using fantasy NBA basketball as a contextual backdrop. In particular, we’ll do a lot of data processing and then determine who our system says was the best player in 2015-16 NBA season. (If it is anyone other than Stephen Curry, we’ll need to go back to the drawing board.) For those of you who follow professional sports other than basketball (or don’t follow sports at all), don’t pay too much attention to the subject area itself because the dataset patterns involved are highly similar to those involving other sports as well as real use cases.

Data Processing: Setup

We begin by grabbing our data from Basketball-Reference.com, which allows us to export season-based statistics by year as CSV files. We’ll grab data from the current season all the way back to the 1979-1980 season, which is when the NBA adopted the three-point shot. (Prior to the merger of the ABA and NBA in 1976, the three-point shot was used in the ABA but not the NBA. Blocks and steals are also not available for all historic seasons. Starting from the 1979-1980 season gives us the same information to use on all seasons.) This gives us 26 full seasons of data to use for our analysis. We note that the CSV files have some missing data, repeated headers, and so on, which we will have to account for and scrub out during our data processing. (No dataset is perfect!)

A sample csv file from Basketball-Reference.com

To get started, we’ll download all the CSV files from Basketball-Reference.com and load them into HDFS. Throughout this post, we’ll use the Cloudera QuickStart VM and operate as the user “cloudera”.

1

hadoop fs-put BasketballStats/user/cloudera/

Looking at the files, each line has no indication of the year in which the stat was recorded, so we’ll process each file to include the year. Start up the spark-shell and run the following:

Now that we have all the basic information we need in our data, let’s set up some data structures and helper functions that we’ll use later on. For those without a statistics background, the appendix contains a brief explanation of the z-scores and normalized z-scores, which are the main statistics that we will be computing and using for analysis.

BballData: a class that holds the statistics as received from Basketball-Reference.com, as well as arrays and doubles that will correspond to our z-scores and normalized z-scores as we process the data.

bbParse: a function that takes a line of input from the Basketball-Reference.com data, computes new statistics, and returns a BballData object. This allows us to do all the computations described in the appendix.

BbStatCounter: a new object that we use to calculate aggregate statistics across a dataset. BbStatCounter contains an instance of Spark’s StatCounter class, which allows us to easily calculate things like mean, max, min, and standard deviation across a data set. We’ll use this in our data transformations to easily compute these values across all nine statistics (FG%, FT%, 3P, TRB, AST, STL, BLK, TOV, PTS) in one pass.

processStats: takes in the raw data and calculates our mean, max, min, and standard deviation values. This function takes an RDD of Strings, which are the raw lines of statistical data, parses them via bbParse, groups the data by year, then calculates the average, min, max, and standard deviation for each statistic. The result is stored in an array and passed back. This is the main data processing function.

Data Processing: Calculating z-Scores

We begin by reading all our data into a single RDD. Then we filter our rows we don’t want and cache the result. Caching an RDD keeps the computed RDD in memory so that we won’t need to re-compute it later. This is useful for RDDs that are going to be used more than once.

The RDD is now passed to the processStats function, which will return an array of aggregate statistics that we collect into a map and broadcast. Broadcasting a value in Spark means that every node in the cluster will have a local copy of the value. This is important for objects like lookup tables or maps that are needed by each executor. Although we’re running on a single node, this is still good practice for when we are working in a distributed environment.

Now our data has everything it needs to calculate z-scores. We pass it through processStats again, this time to calculate the z-scores. The resulting Array gets collected as a map and broadcasted. We then parse our stats one more time, this time to calculate the normalized z-scores.

Data Processing: Saving Data as a Table

Once the values are all calculated, we’d like to save this as a table so we can query it with SQL syntax. To do so, we turn our RDD into a RDD[Row] object, define a schema for our table, and create a DataFrame object via createDataFrame. This DataFrame is then saved as a table using the saveAsTable function. (Note that once we have a DataFrame, we can query it using a routine SQL command.) We’ll also modify this table to quickly add a player’s experience level to the data, which was not previously tracked. This will be useful information during later calculations.

vval dfPlayers=sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers join (select name,min(age)as min_age from tPlayers group by name) as t1 on tPlayers.name=t1.name order by tPlayers.name, exp ")

//save as table

dfPlayers.saveAsTable("Players")

Programmatic Data Analysis with Spark SQL

Now that we have all the data we need, it’s time to start analyzing it. Since we’re working on top of Cloudera’s platform, we have some options on how to work with the data; we can either start querying it with Spark SQL or query it with Impala since we saved the table into the Hive Metastore. The former may be easier if we’re expecting to need programmatic access to the data, or wish to check our work while we’re still developing in Spark, whereas the latter is a better idea for production workloads or if we want to make the data available to users to query on an ad hoc basis. We’ll start by using Spark SQL on DataFrames to illustrate its API.

Our DataFrame has a lot of information in it—raw stats, z-scores, normalized z-scores—so we can immediately start asking and answering some good questions. For example, let’s figure out how many seasons have been recorded since 1980 for each age:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

scala>dfPlayers.groupBy("age").count.sort("age").show(100)

+---+-----+

|age|count|

+---+-----+

|18|12|

|19|93|

|20|238|

|21|450|

|22|1137|

|23|1623|

|24|1626|

|25|1455|

|26|1356|

|27|1236|

|28|1077|

|29|980|

|30|883|

|31|745|

|32|619|

|33|487|

|34|362|

|35|251|

|36|166|

|37|111|

|38|73|

|39|40|

|40|15|

|41|4|

|42|3|

|43|1|

|44|1|

+---+-----+

We easily see that the bulk of seasons are concentrated around ages 22-32, with the peak at age 24.

As you can see, the DataFrame exposes many functions that allows you to quickly manipulate it and show the desired data. For those more comfortable writing straight SQL queries, Spark SQL allows you to register a DataFrame as a temporary table or save the table in Hive. You can then query it with SQL syntax using the HiveContext object provided in the spark-shell, sqlContext. To illustrate, let’s check in on the MVP race for this year.

1

2

3

4

5

6

7

8

9

10

11

12

scala>sqlContext.sql("Select name, zTot from Players where year=2016 order by zTot desc").take(10).foreach(println)

[Stephen Curry,19.766248304312754]

[Kevin Durant,15.323017389251323]

[Anthony Davis,13.186429940875069]

[Kawhi Leonard,13.18181904233336]

[James Harden,12.622408009920706]

[Russell Westbrook,12.26014043592826]

[Kyle Lowry,11.634357073733122]

[Paul Millsap,11.28903998833887]

[Chris Paul,10.843486407063033]

[Jimmy Butler,10.475908301410975]

Basketball fans should not be surprised to see Stephen Curry topping the list, as he’s having a historic year to date. The rest of the list contains players who are also having great years, so this is a good sanity check that ours z-score is a good statistic. Let’s check out the normalized scores as well to see if they tell a different story:

1

2

3

4

5

6

7

8

9

10

11

12

scala>sqlContext.sql("Select name, nTot from Players where year=2016 order by nTot desc").take(10).foreach(println)

[Stephen Curry,3.911865399387443]

[Kevin Durant,2.8729484855957916]

[Kawhi Leonard,2.7288580160780636]

[Anthony Davis,2.599364621997217]

[Russell Westbrook,2.4555072670169955]

[Paul Millsap,2.3037685595490816]

[LeBron James,2.1939606667616527]

[Kyle Lowry,2.151090172022115]

[Chris Paul,2.1331243771597586]

[James Harden,2.0788749389912686]

We see similar players in both lists, but the ordering is slightly different. Anthony Davis and Kawhi Leonard are neck and neck in z-score, but Leonard has a slight edge in normalized z-score, which suggests that he contributes more significantly across a wider spectrum of stats than does Davis, or that Davis may be overkill in the stats that he is good in. Curry has a commanding lead in both zTot and nTot—29% on the second place player in z-score and 39% in normalized z-score. Let’s have a look at Curry’s full numbers from this year.

1

2

3

scala>sqlContext.sql("Select * from Players where year=2016 and name='Stephen Curry'").collect.foreach(println)

We see his largest contributor to zTot is z3P at 6.05, meaning his three point game is contributing a lot to his value. Indeed, he’s actually top in this regard.

1

2

3

4

5

6

7

8

9

10

11

12

sqlContsqlContext.sql("select name, 3p, z3p from Players where year=2016 order by z3p desc").take(10).foreach(println)

[Stephen Curry,4.9,6.056007802613957]

[Klay Thompson,3.2,3.6363158210435227]

[Damian Lillard,3.1,3.493980998598203]

[Paul George,2.9,3.2093113537075637]

[Kyle Lowry,2.7,2.9246417088169245]

[J.J.Redick,2.7,2.9246417088169245]

[James Harden,2.7,2.9246417088169245]

[Eric Gordon,2.5,2.6399720639262854]

[Wesley Matthews,2.5,2.6399720639262854]

[C.J.McCollum,2.5,2.6399720639262854]

Curry has almost doubled the value of the second place player, teammate Klay Thompson. At just shy of five three-pointers made a game, Curry is arguably having one of the best shooting seasons of all time. Or is he? z-scores can help us here, by telling us how much better a player is relative to the state of the league. We’ll run two queries to demonstrate:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

sqlContext.sql("select name, 3p, z3p from Players order by 3p desc").take(10).foreach(println)

[Stephen Curry,4.9,6.056007802613957]

[Stephen Curry,3.6,4.395853442084636]

[Stephen Curry,3.5,4.372814802572277]

[Ray Allen,3.4,4.735522316886375]

[Ray Allen,3.3,4.863432467545102]

[Dennis Scott,3.3,4.229610850304963]

[George McCloud,3.3,4.229610850304963]

[Stephen Curry,3.3,3.8893664999911772]

[Klay Thompson,3.2,3.6363158210435227]

[Klay Thompson,3.1,3.651613974257506]

sqlContext.sql("select name, 3p, z3p from Players order by z3p desc").take(10).foreach(println)

[Joe Hassett,1.3,10.220836345844305]

[Mike Dunleavy,1.1,8.683932683360899]

[Darrell Griffith,1.1,8.683932683360899]

[Mike Dunleavy,0.8,7.496775843134757]

[Joe Hassett,1.0,7.456044031025608]

[Brian Taylor,1.2,6.94629576951599]

[Darrell Griffith,1.2,6.9433935103235775]

[Michael Adams,2.5,6.679409664703833]

[Don Buse,0.9,6.671197290917649]

[Danny Ainge,1.8,6.282718341161911]

The first query shows that Curry makes, on average, more three-pointers than anyone in the history of the NBA. The second query shows that Joe Hassett had the best three-point shooting season in 1981, as compared to the rest of the league, in the history of the NBA. Curry doesn’t even rank in the top 10. (He barely misses it, coming in at 12th.) Although Hassett only made 1.3 three pointers a game in 1981, the three-point shot was new and few people were taking it. Three-pointers are common in today’s game to the point where there is talk about moving the line back. The proof is always in the numbers:

1

2

3

4

5

scala>broadcastStats.value(“2016_3P_avg”)

0.6452380952380956

scala>broadcastStats.value(“1981_3P_avg”)

0.0478947368421054

In 1981, Hassett’s 1.3 three-pointers a game was such a commanding difference over the league average of .04, that it produces a higher z-score than Curry’s 4.9 in 2016.

This example illustrates why it’s important to keep in mind that z-scores are measuring relative values in year: we’re not claiming that Hassett is a better three-point shooter than Stephen Curry, just that his performance in 1981 was a larger outlier that year than Curry’s in 2016.

Conclusion

In this post, we enriched our data with additional statistics that allow us to assign values to players, which we can then use to determine which players are the best in specific categories, or all around. In Part 2, we’ll do some additional data processing and use Impala to do some interactive analytics.