Moneyball 2.0: Winning in Sports w Data

INFSCI 1091: Special Topics

General Information

Data and analytics have been part of the sports industry from as early as the 1870s, when the first boxscore in baseball was recorded. However, it is only recently that advanced data mining and machine learning techniques facilitated by our ability to collect more fine-grained data, have been utilized for facilitating the operations of sports franchises. Draft selection, game-day decision making and player evaluation are just a few of the applications where sports analytics play a crucial role today. Apart from the sports clubs, other stakeholders in the industry (e.g., the leagues' offices, media, etc.) invest in analytics. The leagues increasingly rely on data to decide on potential rule changes. In this course, we will introduce data science concepts for sports analytics. Students will learn concepts related to data collection, data analysis and modeling as well as data visualization.

For whom? Students need to have some basic statistical background and programing skill. They should also have an interest in sports, since all the analytical examples will be taken from the sports field.

Where are analytics used in sports? What is the current state-of-the-art? In Week 1 we will explore various sports APIs and web scraping frameworks. (ppt)

This week we will introduce the notion of empirical probability and the concept of statistical tests. We will showcase these concepts using examples from in-game strategic decision making. Two-point conversion and fourth-down decisions in NFL, end-game basketball strategy etc. (ppt)

This week we will introduce the concept of Monte Carlo simulations and resampling with replacement. (ppt)

This week we will introduce linear and logistic regression in the context of sports analytics. We will examine the Bradley-Terry model, as well as, in-game win probability models. We will also examine appropriate evaluation metrics for probability models (Brier score, probability validation curves). (ppt)

Readings:

This week we will examine various methods for rating teams and players. In particular, we will examine the Elo rating method that was first used to rate chess player, and today is the core of fivethirtyeight's predictions. We will also examine regression-based ratings as well as network ratings. We will further see advanced metrics for player evaluation (adjusted plus/minus, wins above replacement etc.).(ppt)

During this week we will introduce the bias-variance tradeoff and the problem of overfitting. We will also introduce the notion of regularization for preventing overfitting. Finally, we will discuss how we can combine Monte Carlo simulations and team ratings for simulating sports tournaments. (ppt)

Midterm

During this week we will introduce the concept of schedule strength and statistical adjustment that controls for the different schedule strengths. We will also introduce the ideas behind expected points per play in NFL. (ppt)

During this week we will examine how we can evaluate players who we have small number of observations using Bayesian inference. Tangential to this problem is also evaluating draft picks and the efficiency of the underlying market. We will particularly deal with the NFL and NBA draft through the seminal studies from Massey and Thaler, as well as Winston, Sagaring and Medland. (ppt)

During this week we will discuss spatial data in sports. We will focus specifically on the NBA that has been using optical tracking in all of its stadium for several years now. We will introduce matrix factorization techniques (and in particular, Singular Value Decomposition and Non-negative Matrix Factorization) that can be used to identify latent patterns in spatial data, as well as, metrics that can quantify floor spacing. (ppt)

During this week we will discuss some basic concepts of game theory. We will focus on the notion of pure and mixed strategies, zero-sum games and Nash Equilibrium. We will see examples of game theory being applied on American football, basketball and soccer. (ppt)

During this week we will discuss algorithms for identifying clusters in data. We will see k-means and hierarchical clustering and we will discuss ways for choosing the number of clusters. We also discuss the curse of (high) dimensionality and Principal Component Analysis for dimensionality reduction. (ppt)

During this week we will discuss applications of network science and analysis in the realm of sports analytics. In particular, we will see the representation of player interactions through networks, as well as, learning representations through network relations. (ppt)