Exclusive: How the Guardian online is breaking news with real-time big data analytics

C-level briefing: Transforming a hackaday project into the backbone of how journalists work in a data driven organisation.

Staying competitive in any industry is increasingly requiring a greater adoption of technology as a means to differentiate from competitors – one of the focal points for finding that differentiator is in the use of data.

The holy grail of the big data analytics world is sometimes viewed as being real-time analytics, the ability to gain immediate insights into data and to act quickly can allow organisations across a vast array of industries to get ahead of competitors.

The media industry is faced with a constantly changing battle to maintain a readership in a time when there is more competition and increased pressure on reducing costs.

The Guardian acted quickly in order to analyse its data by developing an in-house analytics platform called Ophan in order to stay at the top of the market.

Born from a hackaday, Ophan was designed to create a feedback loop to make it easier for journalists to see how real-time changes to articles could affect visitor traffic.

Matthew O’Brien, Ophan and Data Lake specialist at Guardian News and Media told CBR: "It was never something that the Guardian had consciously decided that it would create its own in-house analytics system.

"It sort of evolved because of being very, very useful. If it wasn’t useful, if it wasn’t real time, then it may not have been the success it is right now. So it’s something certainly that’s been born of necessity in terms of the newsroom."

From a hackaday project to being used by the majority of the journalists at the Guardian, Ophan has become integral to how they work, and it’s all about real-time data.

The technology behind this data driven initiative is Elasticsearch, an open source search server that is based on Lucene (information retrieval software).

Shay Banon, Elasticsearch creator, founder and CTO of Elastic, told CBR: "If you open up Yelp or OpenTable, that is powered by Elasticsearch."

The technology is also being used by all the major banks in the UK, he said: "Major banks in the UK use it to see when an ATM is broken. A system was written that indexed tweets around people tweeting that ATM’s were broken, and by using the geo-location in Twitter, can pinpoint where the ATM is."

What this highlights is the power of search and, when combined with Ophan for the Guardian, it gives them an effective feedback loop which has helped to make them one of the most popular news websites in the UK.

The publisher had previously used systems that would get feedback to the SEO editor in four hours, a long time for media organisations. This meant that feedback given to editors on tweaks to articles would often be based on outdated data.

The organisation can now bring in data through a tracking pixel which is on all of its pages; this includes data such as the IP address from which geo-location of the visitor can be seen, what operating system the user is on, what browser, and where they’ve been referred from – they do not, however, collect personally identifiable information.

Once the data is collected it is put onto Amazon Web Services Kinesis Stream which helps to sort the data and make it understandable.

Following this the data is put into Elasticsearch, O’Brien, said: "We create documents called page views, which is where they are stored and create an index for each day so every day has an index."

Ophan holds 16 days of data and from someone visiting a page to a journalist being able to see that data on a graph only takes 10-15 seconds.

Banon said: "It ended up being used by the whole newsroom, it was transformational. In the beginning it was only a small set of people being exposed to that information and suddenly that information became publicly available in the whole organisation – this is what happens on the site, now go deal with it and make it better."

Not only is this an example of real-time data analytics becoming the foundation of a business process, it is also an example of the democratisation of data empowering staff.

Supported by nine nodes running in AWS, Ophan and Elasticsearch receive a large amount of praise from O’Brien who refers to the technology as "resilient and very fast."

"The beauty of kinesis is that it keeps your data for longer so you can replay your stream of data, so we hold seven days worth of data in Kinesis, that means if we have a problem with our application in some way we can reply that data," said O’Brien.

To truly make this a democratisation of data, the news outlet allows access to the system for anyone that has a Guardian email address and there are no roles set, meaning that everyone can see all of the data in it.

O’Brien said that by making everything open and transparent and with low bar to entry they didn’t have to do much promotion of it.

"Even though that seems like a really small thing, it’s a really big thing. If you lower the bar to entry for things and you make it simple for people to access them in the first place, you are pretty much halfway there for it to be a useful tool. I can’t overemphasise that enough," said O’Brien.

In a time when protection of data is more important than ever, the Guardian ensures security by having a risk team that assess all data and ensures that the data sets are encrypted at rest.

This is not a one off project by any means and the open source flexible nature of its approach to technology has allowed it to add new features such as ‘Attention Time’, which is how long a person is actually on a page and interacting with it.

Attention Time means that the Guardian can see how many minutes a user is reading an article for, how that compares between different referral sites such as Reddit, Facebook ,Twitter, and more. That data is then fed into Ophan, deepening their knowledge into their readership, how to tailor content to different sites, and in the end, becoming a better news organisation.

The changes that the Guardian have made by using real-time data analytics has resulted into a fundamental shift in how it operates. By becoming a data driven organisation it has been able to remain as one of the leading news organisations in the country.