Extracting and Analyzing 1000 Basketball Games using Pandas and Chartify

We will narrow our scope to some specific fields for this project: GameId: This is not crucial for analysis but database-wise it will be useful to have this information GameDate: So we can group by date and get insight from a given gameday.

Also for historical analysis AwayTeam: Name of the away team HomeTeam: Name of the home team AwayPts → (Q1, Q2, Q3, Q4): Points scored by the away team.

We will create separate fields for quarterly points HomePts → (Q1, Q2, Q3, Q4): Points scored by the home team.

Separate fields for quarterly points Referees → (Referee1, Referee2, Referee3): Each game as three referees.

We’re going to store their names separately TimesTied: Number of times when both teams had the same score during the game LeadChanges: Number of times when the lead changed from one team to the other LastMeetingWinner: Winner of the last meeting of the two teams Winner: The winning team’s name Designing our Database One record stores data about one game.

Generally, when designing a database, the tables and their normalization always depends on the kind of insights we want to gain from the project.

For example, you could calculate the winner by looking at the points scored by both teams.

Whichever team’s got more points is the winner.

But in our case, I’m creating a separate column for the winner.

Because I feel like it’s not gonna be a problem for us to have a somewhat redundant field, like this, stored.

With that said, I don’t create a separate column for the points a team scored in the whole game.

I just store the quarterly points by both teams.

If we will need to know this data we’ll need to always sum up the quarterly points by one team.

I think that’s not a big sacrifice considering that this way we can analyze specifically the quarters of each game.

Fetching and Filtering the Data We will follow the below steps for fetching and filtering our data: Iterating over the score pages Collecting GameIDs and storing them Iterating over game data responses and parsing JSON Saving the specified fields into a database Cleaning the data Let’s understand each step in a bit more detail.

` 1.

Iterating over the score pages Inspecting even one score page gives us a hint that this page uses a JSON file to get data as well.

An example URL of this kind of request: https://stats.

nba.

com/stats/scoreboardV2?DayOffset=0&LeagueID=00&gameDate=03/03/2019 Again, rather than scraping data from the page, we use this endpoint to get GameIDs.

Cleaning the data After storing data about each game played this season, I recognized some outliers in the dataset.

I removed the NBA All-Star game from the database because it was a huge outlier with regards to the points total.

It shouldn’t be lumped together with the regular season games.

I also had to remove some games that were played in the preseason in early October.

So now, we only have regular season data.

Analyzing the Data and Generating Reports Finally, the fun part: querying the database to generate insightful reports and interesting stats.

But first, we need to figure out what reports we want to create: Overall reports about the dataset Home court advantage Scored points distribution Game points by date Comparing two teams Biggest Comebacks Biggest blowouts Most points in one gameday Most exciting games Prolific referees These are ad-hoc reports that might be interesting to go through.

There are a bunch of other ways to analyze this dataset – I encourage you to come up with more advanced dashboards.

Installing the required libraries Before we start generating reports, we need to install some libraries we’re going to use.

We’ll generate a pie chart which tells us if there’s any home court advantage, aka, is there more chance to win if the team plays at home, based on statistics?.(Chartify doesn’t yet support pie charts so we’re using the pandas wrapper function for this task, which is essentially matplotlib.

show(“html”) It seems the date of the game doesn’t make any difference to the number of points scored.

At least not at a high-level.

See that gap on the right side of our plot?.It seems to be falling somewhere in mid-February.

As it turns out, no games were played between Feb 15-20.

This was the time for the all-star game which we intentionally excluded from our database earlier.

Incredible what a simple visualization can reveal, right?.Comparing two teams It’s always a fun exercise comparing teams to see how they are doing relative to each other.

For our study, I chose a high performing team and an underperformer: These two teams have pretty different point distributions.

For Cleveland, it’s very rare to reach 120 points in a game.

They usually score between 90 and 110.

For Milwaukee, they are usually on the edge or over 120 points.

Based on this chart, it’s not surprising to learn that Bucks are the 1st in their conference while the Cavaliers are second-to-last.

It would be interesting to see this chart with Kyrie and Lebron back in the team, but that’s for another time!.Biggest Comebacks We want to see some comebacks.

Who doesn’t love a rip-roaring comeback by a team most consider to be out of the game?.We’ll take the cases where a team was down in the first half by a lot but managed to win the game: def comebacks(self): query = (“SELECT *, ABS(Home1stHalf-Away1stHalf) AS Comeback FROM ” “(” “SELECT GameDate, AwayTeam, HomeTeam, (Q1AwayPts+Q2AwayPts) AS Away1stHalf, ” “(Q1HomePts+Q2HomePts) AS Home1stHalf, (Q3AwayPts+Q4AwayPts) AS Away2ndHalf, ” “(Q3HomePts+Q4HomePts) AS Home2ndHalf FROM GameStats” “) s ” “WHERE (Home1stHalf > Away1stHalf AND Home1stHalf+Home2ndHalf < Away1stHalf+Away2ndHalf) OR ” “(Home1stHalf < Away1stHalf AND Home1stHalf+Home2ndHalf > Away1stHalf+Away2ndHalf) ” “ORDER BY `Comeback` DESC”) df = pd.

index += 1 return df There were 32 lead changes in the Golden State Warriors v Utah Jazz game!.The lead changed every 1.

5 minutes on average – that sounds like a pulsating affair.

Eventually, GSW won the game 124-123.

We’ve got two San Antonio Spurs games on the list, maybe the Spurs tend to play give-and-take type of games more often than others?. Most exciting games (Volume 2) Another way to statistically define exciting games would be based on the number of ties during a game.

That means you cannot watch 3 game days in a row without any of them being on or near the court.

Impressive!.End Notes This article is intended to inspire you on how to make use of web data or other kinds of data.

There are more and more tools available that you can use to draw insights from public data.

I hope this walkthrough gives you some ideas about how to make data work for you.

You can also use this analysis to build machine learning models.

We’ve done the data cleaning and exploration part – take it forward and use your favorite algorithms to predict a team’s chances of winning.

The possibilities are endless.

If you have any questions or suggestions, feel free to leave them in the comments section below.

Thanks for reading!. About the Author Attila Tóth Attila is the Founder of ScrapingAuthority.

com where he teaches web scraping and data engineering.

He has expertise in designing and implementing web data extraction and processing solutions.

You can check out his YouTube channel here.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.