The Grand Slams (with IBM) and now the WTA (with SAP) are claiming to deliver powerful analytics to tennis fans. And it’s certainly true that IBM and SAP collect way more data than the tours would without them. But what happens to that data? What analytics do fans actually get?

Based on our experience after several years of IBM working with the Slams and Hawkeye operating at top tournaments, the answers aren’t very promising. IBM tracks lots of interesting stats, makes some shiny graphs available during matches, and the end result of all this is … Keys to the Match?

Once matches are over and the performance of the Keys to the Match are (blessedly) forgotten, all that data goes into a black hole.

Here’s the message: IBM collects the data. IBM analyzes the data. IBM owns the data. IBM plasters their logo and their “Big Data” slogans all over anything that contains any part of the data. The tournaments and tours are complicit in this: IBM signs a big contract, makes their analytics part of their marketing, and the tournaments and tours consider it a big step forward for tennis analysis.

Sometimes, marketing-driven analytics can be fun. It gives some fans what they want–counts of forehand winners, or average first-serve speeds. But let’s not fool ourselves. What IBM offers isn’t advancing our knowledge of tennis. In fact, it may be strengthening the same false beliefs that analytical work should be correcting.

SAP will provide the media with insightful and easily consumable post-match notes which offer point-by-point analysis via a simple point tracker, highlight key events in the match, and compare previous head-to-head and 2013 season performance statistics.

“Easily consumable” is code for “we decide what the narratives are, and we come up with numbers to amplify those narratives.”

Narrative-driven analytics are just as bad–and perhaps more insidious–than marketing-driven analytics, which are simply useless. The amount of raw data generated in a tennis match is enormous, which is why TV broadcasts give us the same small tidbits of Hawkeye data: distance run during a point, average rally hit point, and so on. So, under the weight of all those possibilities, why not just find the numbers that support the prevailing narrative? The media will cite those numbers, the fans will feel edified, and SAP will get its name dropped all over the place.

The first promising sign for Sharapova against Kanepi was her rally hit point. Sharapova made contact with the ball 76% of the time behind the baseline compared to 89% for her opponent. It doesn’t matter so much what the percentage is – only that it is better than the person standing on the other side of the net.

Is that actually true? I don’t think anyone has ever published any research on whether rally hit point correlates with winning, though it seems sensible enough. In any case, these numbers are crying out for more context. Is 76% good for Maria? How about keeping her opponent behind the baseline 89% of the time? Is the gap between 76% and 89% particularly large on the WTA? Does Maria’s rally hit point in one match tell us anything about her likely rally hit point in her next match? After all, the article purports to offer “keys to match” for Maria against her next opponent, Serena Williams.

Here’s another one:

There is a lot to be said for winning the first point of your own service game and that rung true for Sharapova in her quarterfinal. When she won the opening point in 11 of her service games she went on to win nine of those games.

Is there any evidence that winning your first point is more valuable than, say, winning your second point? Does Sharapova typically have a tough time winning her opening service point? Is Kanepi a notably difficult returner on the deuce side, or early in games? “There is a lot to be said” means, roughly, that “we hear this claim a lot, and SAP generated this stat.”

In any type of analytical work, context is everything. Narrative-driven analytics strip out all context.

The alternative

IBM, SAP, and Hawkeye are tracking a huge amount of tennis data. For the most part, the raw data is inaccessible to researchers. The outsiders who are most likely to provide the context that tennis stats so desperately need just don’t have the tools to evaluate these narrative-driven offerings.

Other sporting organizations–notably Major League Baseball–make huge amounts of raw data available. All this data makes fans more engaged, not less. It’s simply another way for the tours to get fans excited about the game. Statheads–and the lovely people who read their blogs–buy tickets too.

So, SAP, how about it? Make your branded graphics for TV broadcasts. Provide your easily consumable stats for the media. But while you’re at it, make your raw data available for independent researchers. That’s something we should all be able to get excited about.

10 responses to “Analytics That Aren’t: Why I’m Not Excited about SAP in Tennis”

You are so right. As a baseball fan for over 40 years, I find the amount of statistics available to the public paltry. I would love to have the data to do some sabremetrics to find out what truly wins matches. Thanks for your article.

I really do wonder why SAP/IBM/the WTA/whoever even bother. Isn’t what SAP *really* wants is to write a big check so that the WTA will plaster their logo over everything? And doesn’t the WTA mostly just want to cash said check? Which is understandable, but also irrelevant to, y’know, match analysis.

Maybe they’re really just terrified that somebody will look at the raw data and tell everybody that their so-called “insights” are just made up? But

I think I know exactly why IBM and SAP bother. No matter how big your budget, you can’t buy authority by simply loading up on ad space. But if you get folks in the NY Times and the Guardian and every other paper that covers tennis to use your name in conjunction with all the most advanced statistics they’ll ever mention, that’s PR that money can’t buy. Since these are massive corporations, they can spend tons of money to put themselves in a position to get that PR.

Yeah, that’s probably it, isn’t it? “Official statistics by X” on the TV broadcast isn’t enough I suppose. Definitely going to pay attentions to whether the NYT etc. do regularly mention the company names from now on, tho. See if it’s working.

PS: Guess my first comment was truncated somehow, last sentence was: “But then again you’re doing that already.”

The only reason I was excited about the WTA partnership with SAP was the extraordinary job that same company has done with the NBA in terms of opening up all sorts of data for everyone to use.

Just now, I went to nba.com/stats to see if something I heard was true: that you can watch all shot attempts from any given player. My beleaguered Los Angeles Lakers started a castoff point guard tonight, and I read that he did well. So I went to his stats page, and lo and behold…I could watch all of his shot attempts.

Of course, the NBA has a vision about making these things available to everyone, so it might just be that they were the ones that pushed SAP to create this incredible portal. It might just be that the WTA won’t demand such a thing, given the establishment’s attitude about making complex data about matches available to the general public.

So I still have a little bit of hope for SAP and the WTA…though I fear the worst and agree with everything you wrote.

SAP really did a much better job with the NBA page than you’d expect. Opportunities for improvement, but significantly better than what the NBA had before. Admittedly, analytics are hard, so it will take a little time for SAP to recreate what IBM had, but then I’m pretty comfortable they’ll start opening the kimono some for average fans.

I don’t buy that analytics are hard. Collecting data is work-intensive, but with a system in place, it’s easy. (Looks like they’re already there.) Releasing data, which I’m asking for, is also easy. Making pretty websites with analysis that serves marketing purposes … that’s probably hard.

Analytics *are* hard – just making data available is easier, but that doesn’t make it easily accessible and meaningful for the majority of end-users. If all people needed were data, IBM and SAP would both have much smaller wallets.

The only thing that confuses me a little about all this is the idea of narrative. What are some of the narratives and who are they serving? I did notice around the time that unforced errors started becoming a huge part of Federer’s matches, that it became harder to track unforced error stats in matches. So is it this sort of thing? Keeping heroes on pedestals, etc?

it’s just laziness. some commentator (tv guy, writer, whoever) thinks they notice a pattern — like Fed hitting lots of shanks, or Novak winning points with DTL BHs, or whatever — and then numbers are generated to support that narrative. There’s no grand conspiracy, but these narratives do take on a life of their own, and just because there is data in the stories doesn’t mean they are factual.

in other words, people using this data are writing the same stories they always wrote. they just want a piece of supporting data, and whatever your story is, SAP/IBM/whomever can come up with some data that supports it.