Sport is a thankful field for data
journalists: There is a lot of data, but surprisingly few data
journalistic stories about it. At Swiss radio and television (SRF),
my colleagues have been reporting intensively on swiss tennis star
Roger Federer for 20 years. We asked ourselves: "Who if not us"
should make Roger Federer's best and most comprehensive data
analysis? So we gave it a try. For those who haven't seen the result
yet, you can find it here.

In tennis, every millimeter of the game
is measured: It can't be that hard to find all of the “Maestro’s”
(one of his many nicknames) moves. Or is it?

No official data source

ATP
website, you can find the most important metrics for each match:
service games won, number of aces, etc. But a download button? Forget
it. The ATP, which organizes most tennis tournaments and manages the
world rankings, does not have an official data API where one could
download all their stats. So we had to get the data elsewhere.

But this turned out to be more
difficult than expected. The first hurdle: Federer started playing
tennis at a professional level in 1998. Since we had the ambition to
map his entire career, it was clear that we needed a data source that
would last from 1998 to the present day. So also one that is
constantly updated.

And this was not easy to find on
Google. If you search for "Tennis ATP Data API", the first
thing you see is a scraper that downloads data from the ATP website –
but given the large amount of data we wanted to avoid that.

Very quickly, you come across the
Github repositories of
Jeff Sackmann, a tennis nerd who collects quite a lot of tennis
data that he scrapes or collects by hand together with volunteers. He
makes it all available as CSVs. For all those who want to start doing
something with tennis data, this is a great place to start. At the
time of the research, however, it was unclear whether and how
regularly the data would be updated.

Our process took over three months,
from the idea to the publication. We wanted to publish the story at a
major event in Federers career. But what event would that be? In
autumn 2017, when we started working, it wasn't clear at all: would
he win another big tournament? Or might he get injured and resign? We
had to prepare for all eventualities (For those who are interested:
here is our internal
summary in German, which we write for each of our researches,
where we discussed the possible dates).

A little help from
serbia

Our rescue was the website
ultimatetennisstatistics.com
by Mileta Cekovic. The code of the whole site is Open Source. If you
download the repository,
you can recreate the whole page on your (Windows) computer –
including a Postgres database, in which all data is stored in a
structured way.

So we had the data, but what were we
actually looking for? I knew the tennis rules, but had no idea what
might be interesting in the data. We clearly needed help! This we got
from Bernhard Schär, a Federer intimus of the first hour. Together
with him we formulated a number of hypotheses and checked them for
their truthfulness:

Is it true that Federer was
older than others when he got to the top?

Who is his worst opponent? Why?

Is Federer really the GOAT? The
greatest of all time?

What is the competition doing?

At the same time, we were looking for
possible role models. In a Google
Doc, all team members took screenshots of interesting data
journalistic projects in the field of sports. With the help of this
collection, we were able to formulate further exciting hypotheses and
see which forms of presentation might be suitable for which data set.

There was one question that we
struggled with at every viz: How deep can we go? It was clear that
more than half of the readers would read the story on a smartphone –
in the end it was over two thirds: Nevertheless, we didn't want to do
without very detailed graphics.

One way to deal with this dilemma is to
hide less important data on mobile devices. In the following graphic,
for example, we've provided the most important players with small
portraits on the desktop, and omitted this on mobile devices. In
addition, on mobile devices we have drastically reduced the shown
number of data points in order to make the differences between the
important players more visible.

Looking back, we probably should have
done the same with other graphics as well. For example, the first and
the last graphics are very detailed. On mobile devices you can look
at the graphics, but the legibility would have been better if we had
removed some data points. This would have made the lines simpler and
easier to read:

But we also always had to ask
ourselves: Are we able to select data in such a way that even Federer
fans can learn something new without excluding readers who don't know
tennis at all?

In search of an answer, we tried to
talk to as many people as possible: We showed prototypes of our
graphics to sports editors and friends outside the company at an
early stage and asked them: Do you understand what we have visualized
here? Are you surprised? Are you interested?

It was very helpful that we were able
to create a lot of graphics very quickly using R and ggplot. These
graphics can also be found in our method
description (a thing we always publish at the same time as our
articles).

What was missing?

When the Australian Open took place in
January and Federer won preliminary round after preliminary round, it
slowly became clear to us: Ok, a possible publication is getting
closer and closer. We knew that the final would take place on 28
January. Suddenly everything had to go pretty fast. That was also the
reason why we couldn't take a closer look at certain things.

Which I personally think is a pity: Not
a single tennis court has been visualized in the whole story. I would
have loved to work with Hawk-Eye data to analyse Federer's game more
closely and compare it with his competitors.

We also didn't analyze much other data
regarding Federer's playing style: With which hand does he score?
With the forehand or the backhand? An analysis of the data from Jeff
Sackmann's Match
Charting Project – a project in which volunteers log every
second of a match – didn't immediately provide the desired answers.

Update: More
difficult than expected

At SRF Data we highly value
reproducibility. It was important to us that we could easily update
the visualizations if we needed, in order to publish the story again
if we want. Shortly after the first publication, Federer once again
conquered the world number one ranking – as the oldest tennis
player of all time. Further events important events in his career
will probably follow, but an update would still require a lot of
effort. Not only the graphics, but also the text would have to be
rewritten. And not only in one language, but in eight. Swissinfo
translated the piece into Japanese,
Russian
and Chinese,
among others.

So we have to admit to ourselves that
we will probably let the project rest after all and devote ourselves
to more classic data journalistic topics again.