Democratising data at the Financial Times

The imposter

I feel like an impostor. Many of these talks are about openness, for the
greater benefit of humanity, but my talk is about closed, private data, mostly
for the benefit of one organisation. But a lot of the themes - hack-ability,
simplicity, usability, interoperability - are directly applicable to private
organisations like the FT. So this talk is about that.

The FT is a 125 year old news organisation with nearly 800k subscribers, over
70% of those are digital subscriptions, and we have over 5k companies on
corporate licenses. For a sense of scale the central data team has around 30
people. ￼￼

This is Tom Betts, our Chief Data Officer. He came up with a phrase
'democratising data' last year to describe a push to try help the organisation
make data central to people's day to day lives.

No matter what job you do, the better educated you are, the better decisions
you can take. So in democratising data - we are really talking about
accessibility of our data.

If your data is locked up in a warehouse, or (often) a dozen different
warehouses, or it's stored in oblique formats, or has to be fished out with odd
languages or protocols, then it's going to be less accessible, less democratic.

It's not dissimilar to the various open data manifestos. Here's a list of
criteria from a site called Open Definition, part of the Open Knowledge
Foundation. Is it available online? Can we use it? Is it machine-readable?
Is it available in bulk? Is it free?

Although the FT and other organisations won't make it's data openly available
the needs of it's internal community are the same as those as users of, say,
government data. You can ask yourself does your internal data follow these
rules?

I should say it’s not entirely universal at the FT - financial and personally
identifiable information especially is kept more closed.

Data powers tools for the newsroom

This is Lantern, a tool used by our newsroom to gauge the performance of the
stories we publish. These sort of things are quite common place, most
publishers use them. Every time a user does something on FT.com they log a
tracking pixel with metadata describing what they did. And Lantern aggregates
this data and displays it in meaningful ways. It’s partly education - what % of
users found this type of story via another website, and it’s partly a decision
making tool - Eg, what stories are going stale on the front-page. This was a UK
focussed story, so we can see the peak of UK traffic for an article is around
8am, we can see people took 43 seconds to read it, and the retention rate was
14% - so those people who did something else after reading it.

Our news agenda isn't lead by this, but in terms of understanding audience, the
impact of promotion and so on it has it’s place in helping understand the
consequences of our actions.

Similarly a key part of how we communicate with our users is via email. We have
dozens of daily emails - some, automated, curated, newsletters, alerts
triggered by keywords, breaking news, marketing - we send several million each
day.

Email has a life-cycle - from when you subscribe, to send, open, click,
unsubscribe. So our editors have dashboard to see the performance of each
email.

The analytics team

The FT has a central data analytics team. They use a mix of SQL, Excel, R, and
more recently hosted reporting product Chart.io. This is fairly typical of the
sorts of information being produce, description the referral traffic to our web
app.

You can see the influence of social media referrals in red to the green search
referrals in this graph - about 50/50 split.

Data in our products

Aside from reporting, we also power parts of website products purely from data.
Here's a standard 'top 10' most popular stories, generated from a simple
rolling count of who is reading what over the last 12 hours or so. It's one of
the most popular parts of the front-page.

Our journalists also tag all their stories with topics so we can display
popularity on a more thematic level - what was in the news today, or this week,
or what have we written most about.

To power that sort of thing internally we have a rich graph of interconnected
topics. We can link stories to companies and companies to their industry, their
board members, their stock listings and so.

On the FT site you can ‘follow’ any of these topics.

By linking our internal model of the news to our model of the user activity we
can generate personalised journeys through the information we publish.

Here’s just one simple example of how that looks. So rather than recommend
based on what everyone else is doing we can find things that relate to you
personally, which we think it more valuable.

So we’ve got a quite a data lead, multi-layered approach to recommendations at
the moment and these different ways of helping people find what they are
interested in are all complementing our journalists editorial line.

Data is propaganda :)

We have an internal communications team. They build dashboards to display
around the screens in the office - in the canteen, reception and so on.
Typically these are things that give a sense of who is using the FT and what
they are reading. They need something visual, something that explains itself
with a few seconds as you walk past.

This is article stats cut by country. You can sort of see from this it looks
like most countries like reading about themselves.

Data for growing our audiences

Marketing is one of the most interesting parts of the FT when it comes to data.

The paywall, pricing and so on, are all underpinned by AB testing, and
projections based on behavioural and financial data.

So on the screen is an example of the output from what we call Propensity API.
It's a predictive model of anonymous person's likelihood to subscribe to the
FT. A score of 0 means low, a score of 1 is very likely, and within that the
persons propensity to subscribe to a particular offer, Eg. trails for £1

It’s built by data scientist, distilling 500 variables in to a dozen or so key
indicators of future behaviour.

Having this in API form means we can adapt the experience for different types
of our users, for example, what would it take for a person who we think has a
high propensity to subscribe to actually do so - marketing, discounts, showcase
etc.

Customer research collect structured data

This is a form on the website, prompting the user for feedback. In previous
companies I’ve worked at customer research teams have used external third-party
survey tools to collect qualitative data from the audience. These work well but
typically the information ends up disconnected from the rest of your data in
your warehouse.

Collecting it ourselves and connecting it to the rest of our customer data
means we have a much better handle of the value of each piece of data - for
example you can take each piece of negative feedback and look at what the user
has done, or did after they sent it. We might want to know if negative feedback
results in a lower subscription renewal rate.

Or we can split out feedback for loyal users, which we may want to prioritise
or pay closer attention to.

The team who use this data like spreadsheets, so we pump that data in to Google
Sheets API, which lets them do the analysis they need.

And because customer feedback is so important we broadcast it to all staff on a
internal Slack channel. It’s an unhealthily mix of praise, disappointment and
general abuse.

Democracy = Diversity

It’s a very diverse set of use cases.

Lots of users - represented right across the whole business, from the newsroom,
to commercial, to the digital product teams.

Different needs - the newsroom needs real-time data to make decisions, but the
behaviour models for marketing evolve over a 3 month window.

Different levels of complexity - some, like the most popular globe are simple
counters, whereas recommendation algorithms require specialist graph databases.

Different skills - some people are comfortable with processing raw data, but
processing terabytes of data isn't easy so some people need abstractions or
spreadsheets to make it simple to work with.

We need one system to do all of this. As fragmented data easily leads to
multiple versions of the truth. This data warehouse says this, whereas this one
tells a different story. That’s a real problem as it creates conflict - and
data should be used to give answers, not create confusion.

Nothing really fits that market, so we started building something ourselves.

Modelling events

Architecturally an analytics system typically collects events of the world and
puts them a data warehouse. It’s quite straight-forward.

If we are going to make a usable system, people need to be able to understand
the data they need to send to it. Almost all the things going on inside our
ecosystem at the FT can be described as events. There’s a mix of things our
users do on the website, as well as things that our back-office systems are
doing. We express each event as a category and an action.

Category is the general domain - like 'page' or 'email' or 'signup' or
'infrastructure' and action is a verb that describes what happened - 'view'
or in the the context of email 'opened'

The idea is to get away from arcane descriptions of these things and make them
more human.

What have several other concepts attached to this model:-

Events happen at a time on a date.

Events often happen to a user on a device. Two different things. Shouldn't conflate. A person is a known subscriber that we can join up across devices.

Context is a describes anything you want to associate with the event - so, for a webpage it might be it's URL, for an email, the address of the sender, for a subscription event the payment transaction ID.

System is meta-data associated with the sending system, Eg. API keys.

The API in to the data warehouse is a simple HTTP JSON one. You should be able
to see all the concepts I just mentioned here serialised as JSON.

Again, the emphasis on simple - there's no strange interfaces to learn, no
limitations on key-value pairs, no dependency on particular libraries. It’s
essential schema-less. Anything that can generate an HTTP request can start
sending events, and importantly, anyone in the FT can read the API
documentation and integrate the collection of events they care about in
minutes. There’s some edge cases where we just need a pixel with data sent via
querystring, but 99% of users use JSON.

So we've got lots of systems generating events.

Client-side applications running in the browser, server-side events, web hooks
from third-party systems, things like AMP, we even capture offline events which
are then buffered then sent in a batch.

One pipeline, many sinks

The very first version of the system simply had this API for everyone to put
events and put them in the data warehouse. But all databases have limitations -
RedShift is powerful but slow, ElasticSearch is a great cache but not suitable
for 10 years worth of analytics data. If you think back to some of the examples
at the start of this presentation, for a simple 'Yop 10 news stories' Redis is
a great, cheap choice. The newsroom analytics tool uses Elastic Search -
essentially a cache of 30 days. The recommendation system uses Neo4J as it
needs to query relationships.

So we needed a way to let people consume the event stream in to their own
specialist systems. We ended up with two options.

Kinesis is a natural choice for this problem. It’s a ordered 7 days record of
all your events - so you can tap in to a point in history and start replaying
the events in to your consumer. It’s a shared stream.

SQS is a message queue. It’s conceptually simpler than Kinesis - you pick up a
message, process it, delete it, pick up the next one, on so on. Some people
liked this simplicity.

Enriching the events to add meaning

So we have this data pipeline. We've got a way for anyone to put something in
to it and anyway to take the collective pool of events out again. But what
value can it add?

When we talked to people and looked at what they were using the they often want
to transform or annotate the event with extra information.

Sometimes they were doing the same thing over and over. Each variation
introduces an opportunity for mistakes. So we want to centralise some of the
generally useful things people were doing. We call these enrichments.

If your event contains a URL we tokenise it. Most analytics software does
something like this URLs.

Rarely do you need to perform operations on a full URL, you typically want a
parameter from the query string (Eg, like if you are analysing search trends)
or a path or a domain as a filter.

Each annotation is just appended to the event, so the original data exists
alongside the enrichment.

We have a time annotation. This transforms a simple ISO date stamp when the
event was received by the data pipeline in to lots of useful properties.

Some of this is to speed up data processing for downstream systems with poor or
slow Date() capabilities, operating on numbers might be faster than parsing
dates. We ensure everything is transposed to UTC timezone to avoid confusion.

Annotations like 'week' have a specific meaning at the FT when generating our
weekly reports.

So enrichments help centralise this business logic.

Sometimes a transforms need to connect to APIs. Each event, naturally, has an
associated IP address. So we fire that at the MaxMind API. MaxMind is a service
that geo-locates an IP address to a reasonable degree of accuracy - so our
events can now all be geo-location down to a city level in most parts of the
world. You can also see we've hacked on support for FT Office detection too so
as to help people identify data generated by staff.

Enrichments allows us to take data from any other FT API or external API to
make the event richer, more meaningful.

Sometimes we need to invent an API.

Given a domain name, like facebook.com, this one classifies it in to one of six
buckets - social in this case, or in other cases search, news, partner and so
on.

This enrichment makes it much easier to do a quick analysis on our referral
traffic.

Enrichment results in this huge JSON document being constructed for every
event. And consumers around the business can pick and choose the events they
want to store in their own systems, then chuck the rest away.

We’ve built 22 annotations to date - each one adding more meaning to the events we collect.

It takes under a second from the event happening to it being exposed to our event stream.

There’s some interesting, experimental ideas - market prices, weather - to help
find correlations with the rest of our data.

It's working ok for us. As of March this year we no longer use a third party
analytics system at the FT.

The most painful thing at the moment is alignment of event schema's across
multiple products.

If you are tracking something, say how people use video, in three different
ways it's painful to use that data, so we've put some writing schemas for each
type of event then validating them inside the pipeline.

Democracy?

So have we reached the original vision of democracy?

We've focussed on the users - as you saw by the many examples at the start

Iterative - 700+ production releases which let us quickly adapt to people’s needs

Open for contributions - anyone in the company can improve it

None of the ideas I’ve talked really planned as such, we didn't draw up an
18-month plan of all the things people wanted to do with data, we just tried to
make it usable and accessible and watched how people used it.

Which I think it testimony to the principals of Open Data:- if it’s made
accessible it allows curious people to take it, play with it, it stimulates
ideas, they learn something, then share it and create value from it.