Teammates of Champions

Starting in the 1983-84 season, every NBA champion has featured a player who was on the same team as Shaquille O'Neal at some point in his career. This summer the Warriors won again -- except they did it without Matt Barnes, their last link to the Big Kevin Bacon -- ending the streak at 33 years. Last week, /u/packmanwiscy heroically sifted through fifty candidates to fill Shaq's outsize shoes. He limited the pool to accomplished players drafted more recently than 1990, and settled on Chris Paul (and soon LeBron James) as the heirs to O'Neal.

That felt like an appropriate finding, since Paul and James are both well-travelled vets and perennial contenders, which means they cross paths with a lot of champion role players looking for their next parade. But with something like 2,600 players getting game checks from the NBA since the beginning of Shaq's streak, I wondered if there was another player out there, perhaps less well-known, with a streak that could rival Shaq's.

To answer that question, I scraped roster data stretching back to the 1983-84 season from stats.nba.com, and fed it into a Neo4j graph database. Graph databases are a little different from typical relational databases. Unlike MySQL, which keeps its data in a series of tables not much different than an Excel spreadsheet, Neo4j puts everything into a giant network, full of nodes connected to each other by links. It's built to answer questions like this one, and here's what it came up with. In each row, after the player's name, you'll find their longest streak in parentheses. Darker shades indicate that the player has logged more time with the players on that year's championship team, based on seasons shared. (One caveat: I don't think stats.nba.com is as reliable as basketball-reference.)

Highlight a season to see which of that year's champions a player played with, and where.

Right now, Tim Thomas has the longest active streak with 16, but as packmanwiscy pointed out, LeBron has a streak of 18 as soon as he plays his first game with JaVale McGee on the Lakers.

Shout out to Chris Quinn, who in a seven year career pinballed between the Heat, Spurs, and Cavaliers... as well as the Tulsa 66ers, BC Khimki, and Valencia. I'm guessing he had to field a lot of questions about Tim Duncan and Shaq when he was overseas. Anyway, as long as Shaun Livingston is winning rings, Quinn's streak will live on.

Grant Hill came very close to having the longest active streak -- NBA.com lists Mikki Moore as a member of that '02-'03 Spurs team, but he was waived before taking the court. Moore and Hill played together for the Pistons.

Credit to the Stats & Info desk at ESPN, who -- I think -- first discovered this stat. See the original story here.

How I Created This Table

Loading the database

If you're curious about graph databases, let me give you an idea of how they can be used, and some of the quirks involved.

First we need to load our database, and that means grabbing data. The endpoints at stats.nba.com are open to the public, so long as your request has a User-Agent header that isn't obviously from a headless browser. A query like this will get you a team's roster for a given season:

... and after churning through all 30 teams and all 34 seasons between my start and end points (with a respectful sleep between each request, of course), I glued them all together in a concat operation.

Once I had my database, I did a little post-processing to give myself some more fields that I knew I'd need to reference later in my queries. All this data wrangling took probably the most time, and was not particularly exciting, so I'll gloss over the rest of it.

One thing that would have helped was setting up a pipeline for getting data from Pandas into Neo4j. I kept making the classic mistake of thing that "well, this is just a fun exploration, I don't need a whole pipeline for this". But nothing makes a fun exploration tedious like copy-pasting a sequence of IMPORT commands into Neo4j's browser, which doesn't allow for multiple queries to execute at once.

(Note: I was just told that newer versions of Neo4j's browser does allow for multiple queries, so you may have better luck than me in this regard!)

Had I known how many times I'd have to modify my schema and reimport CSVs, I would have taken the time to write up a Python script capable of firing off the few requests necessary to the REST API, which is not at all challenging.

Here's a look at some of the cypher queries I used to load my database.

You can load CSVs from regular URLs as well as files on your file system. In order to access an arbitrary directory in your load statement, you need to disable a security feature in your Neo4j configuration file, which is a text file to be found in the same folder as the binaries you execute to start your database. In these queries I'm not messing with that setting, so all these paths are being resolved relative to PATH_TO_MY_NEO4J_DIRECTORY/import

One more to get our players. By the way, these "period commits" are a way to keep your database from dropping dead of a memory error. Wasn't a problem for me at any point, it's just mentioned in most of the examples.

CREATE CONSTRAINT ON (p:Player)
ASSERT p.playerId IS UNIQUE;

Putting a constraint on a node type will also index it, which is important for performance in the later queries. Without indexes, Neo4j can take quite awhile to find and merge your relationships.

Here's the query that lets us really get to work. At this point, Neo4j knows about our players and the seasons, but not how they're related. So I feed it a CSV that represents the join table between the two, where each row indicates a player and where they played that season. I then match both those components, and MERGE a relationship between them. Merging in Neo4j is like saying "find or create this relationship".

I saw Shaq and some other Lakers at the top, which gave me confidence that my data had been imported correctly. I made use of some aggregating functions like collect and count to give myself a more compact output. Keep in mind that a "row" to Neo4j is a single pattern match. We tend to think in terms of the object of interest, here a player, but Neo4j has been instructed to find a particular configuration of players and teams; that's what it'll return unless you tell it to bundle up some of that information.

When I was poking around my database, I was a little concerned to see certain players matched up. They never played together, right? To spot-check these relationships, I asked Neo4j for the shortest path between them.

I should mention how important these WITH statements are. Without them, I wasn't getting all of my results, and from what I could glean in my Googling, this had to do with cardinality errors, explained here.

After a lot more exploration, I was ready to find some streaks. I thought it'd be useful down the road to refer to the length of a player's best streak, so this adds a new property to matching players.

This is a monster query, so I'll annotate it inline.

MATCH (n1:Player)-->(t1:TeamSeason)(champTeam:TeamSeason {champ: true})
WITH n1, collect(distinct champTeam) as champTeams
// We need some way of finding the longest -- not the shortest -- path through
// all the teams a player has connections to. While Neo4j is optimized for
// finding short paths, we need to do something like this to find the longest.
UNWIND champTeams as t
UNWIND champTeams as u
WITH *
MATCH p=(t)

And here's the query that gave me the data for the D3 script responsible for the table above:

MATCH (n:Player)
// I only want to see players with long streaks, so we ignore those who don't
// have that property set, and grab the top 25 according to streak length.
WHERE exists(n.longestChain)
WITH collect(n) as players, n
ORDER BY n.longestChain DESC
LIMIT 25
// We've got the players, now let's try and find their connections to the
// championship teams:
MATCH (t:TeamSeason {champ: true})
WITH players, collect(t) as champTeams
UNWIND players as n
UNWIND champTeams as champTeam
MATCH (n)-[:PLAYED_FOR]->(baconTeam:TeamSeason)(champTeam)
RETURN DISTINCT n.name as player,
n.longestChain as longestChain,
p2.name as championTeammate,
champTeam.displayName as champTeam,
baconTeam.displayName as baconTeam
ORDER BY n.name ASC

I recommend aliasing your output columns, so they've got tidy names when you send them into the next step of you process.

I'd love to tell you about how I continued into D3 for rendering this information... but it's lunch time! I hope you enjoyed seeing some of these queries. Graph databases are fascinating, and can really flex their muscles on problems like this one. How would I have even begun to approach this in something like MySQL? I would speculate, but again... lunch.