Radio since 1922

TV since 1930

... and TV since 1930.
Since then it has grown to become one of the largest broadcaster in the world.

On the Web since 1994

The BBC had a web presence from quite early on as well. This
is a screenshot of the BBC web site in 1994.

Programme support

Since quite early on, programme broadcast on the BBC would
have a section on the BBC web site. However these different
'micro-sites' would be commissioned individually, for each programme,
causing a big disparity in terms of coverage, consistency and persistency.
A few programmes would get a big web presence, when a very long
tail of programmes would get no web presence at all.

1000 to 1500 programmes per day across 70 channels

As we broadcast between 1000 to 1500 programmes per day across around 70 channels,
this approach does not scale well, if we want all these programmes to have an online presence.

www.bbc.co.uk/programmes

We launched our BBC Programmes web site in 2007 to tackle this issue, by aggregating data from multiple
sources, such as commissioning data, archive data, data from playout systems,
and creating a persistent web presence for each of our programme. Each individual
programme has its own (persistent) URI within the BBC Programmes site.
All programmes will get some web presence, which can
be enriched by creating themes around the programme and adding more data around our core data.
Our core data effectively acts as a backbone for all that ancillary content.

The web site is the API

Another important aspect of BBC Programmes is that the data behind each page can be
accessed through content negotiation. So if I request an RDF representation of that same programme
I was looking at earlier I'll get the following. There is no separate API - the web site is its own API, and
provides the data in JSON, XML, RDF and RDFa. I think BBC Programmes was one of the first large 'corporate' Linked
Data site to published.
Exposing all this data has many advantages. People experiment with our data and give us
ideas of new sources of data to integrate or new types of user experiences.

schema.org

We are also working with schema.org in making it compatible with TV and radio-specific concepts.
Schema.org is an effort led by major search engines, in defining bits of semantic markup
that can be added to pages, and can be used by these search engines to enrich their search results.
Working with schema.org means that major search engines can extract
the RDFa we embed in our pages and surface more
information about our programmes. For example here we see what the Google's "Knowledge Graph"
has to say about the Eastenders BBC series, which includes information from Freebase and information
aggregated from schema.org markup from all over the web.

Using Linked Data (external)

We also link to external sources of Linked Data like Musicbrainz or DBpedia. This
enables us to use extra information held within the Linked Data cloud to enrich our pages. For example this page on Tom Waits
has a biography coming from Wikipedia and artist metadata coming from Musicbrainz. The only bit
of BBC data on this page is the playcount data (how much this artist was broadcast on our programmes)
and the album review data.

Using Linked Data (internal)

We also use the Linked Data we publish internally. For example on this page the
aggregation of programmes at the bottom is generated from Linked Data published on the BBC
Programmes web site. However in this particular example the integration was ad-hoc, directly
using the RDF data published by BBC programmes in this web application.

Towards a Linked Data Platform

We're therefore building towards a "Linked Data Platform" centralising all this RDF data across the BBC
and queried through SPARQL.
This platform would enable us to easily perform cross-domain queries, for example to
generate cross-domain aggregations, including
content from BBC sports, news, music, radio, tv, foord, etc..
Such cross-domain queries and aggregations were traditionally very difficult to make at the BBC,
due to the way our data was split across many different applications - tv and radio, news and sports,
knowledge and learning...

World Cup 2010

This approach was first tested for the World Cup 2010 website as a mean
to automate aggregation pages for example around specific teams
or footballers, that were previously manually put together and maintained.

Tagging articles

These automated aggregations for various World Cup 2010-related concepts
are driven by journalists tagging articles
with web identifiers available in a centralised triple store and denoting people, places,
events etc. and sourced from multiple data providers (including
Linked Open Data sources).
The resulting relationships between news articles and these concepts
are then pushed to the central triple store.
A benefit of using web identifiers as tags
is that they are unambigous, and that we can retrieve more information about these tags when needed.
For example when news articles are tagged with places URIs, we can easily retrieve the geolocation
of these places, enabling us to plot our articles on a map. Tagging articles with URIs
enable us to tackle a wide range of unforeseen use-cases.

Dynamic aggregations

Aggregation pages are then built by issuing SPARQL queries to that central
triple store.
For example the England page automatically
includes links to news articles that were tagged with
the England team URI using this tagging tool.

BBC London 2012

The same approach was scaled up and used to drive the London 2012 BBC web site, covering
around 250 countries, 300 events, 36 sports, 10,000 athletes and 30 venues. This data
is sourced from multiple places, both from Linked Open Data sources and commercial data providers.
Once all this data is sourced and we have web identifiers for all these things in our centralised triple
store, we can start annotating our content with them, in a similar way as done for the World Cup
2010 web site. We used this mechanism to build automated aggregation pages
for each of these things - for each country, sport, athletes and so on, by querying the
centralised triple store for all BBC items related to them.
The BBC London 2012 web site has been hugely successfull, and has proved the feasibility
of this approach, heavily based on Semantic Web technologies, using a triple store, queried
through SPARQL, to store relationships between our content and domain knowledge.

We're now extending that work beyond sports and to aggregate all sorts
of BBC and non-BBC data. This will enable us to easily create feeds such as 'all news articles about
Barak Obama', 'all videos about the place I am at', and to easily build cross-domain
aggregations.

BBC Ontologies

In order to support this approach we created a bunch of ontologies, modelling the domain knowledge held within the LDP
and often piggybacking on existing ones
(event ontology, FOAF, music ontology, geonames, etc.). They are all available on our website at bbc.co.uk/ontologies. They cover
programmes, wildlife, sports, learning and news and we use them as a backbone for the data in the LDP.

Linked Data and the BBC archive

Most of the data available within the Linked Data Platform is created manually.
It is a suitable approach going forward, but would be difficult to scale going
backwards. In the rest of this talk we are going to describe ongoing efforts
by BBC R&D to generate Linked Data describing our archive content in a
semi-automatic way, using a mixture of machine-generated interlinks and crowdsourcing.

The BBC Archive

The BBC has been broadcasting from 1922
and has accumulated a very large archive -- radio and tv programmes,
news articles, pictures, sheet music, production notes...

Cataloguing the archive

Items within this archive have been catalogued using a number of systems throughout
the years, such as this index card from April 1967.
At the top of this card, we
can see the classification of this particular item in a system called
Lonclass, based on the Universal Decimal Classification system, and which provides very detailed information.
This cataloguing effort has been geared towards reuse, and the coverage of the catalogue is not uniform across
the BBC's archive, for example excluding the BBC World Service which has been broadcasting to the world since 1932.

In Our Time archive

Traditionally, when it has come to publishing archive content online, the BBC focused on specific
topics or brands. An example of that is the Radio 4 In Our Time archive, giving access
to all In Our Time episodes online.

BBC Four Army collection

Another example are the BBC Four collections, focusing around particular topics rather than brands.
For example this collection aggregates archive content around the army.

Tagging programmes

In order to enable topic-based navigation within these collections, programmes are manually tagged with web identifiers
(DBpedia in particular)
using a tool similar as the journalist tool mentioned earlier. The resulting tags
are then be used to generate topic-based aggregations.

Machines + Users

As mentioned before, the process of manual tagging is very time-consuming and would take a considerable time to
apply to the entire archive. The problem is compounded by the lack of availability
of textual metadata for a portion of the archive, and programmes for which we have no data
or worse, wrong data - these ones which will always be left out of such brand or topic-specific slices of
the archive.
In the rest of the talk, we describe an alternative approach to publishing archives online.
Instead of focusing on very thin slices of archive content, we purposefully
expose large archives using machine-generated Linked Data describing this archive
that can be inaccurate, and we try to involve users
in helping us correct that automatically generated data.
In particular, there is enough Linked Data available online to help us bootstrap an automated tagging process.

The World Service archive

An interesting example is the BBC World Service archive, which has been left out of the cataloguing
and tagging efforts mentioned previously, but which has fully digitised its archive of
pre-recorded programmes. For the English language part of the World Service, the archive holds around 70,000
programmes since 1947, which amounts to about three years of continuous audio. The data around these digitised programmes is
quite sparse and some times wrong (e.g. lots of programmes with a broadcast date of 1901 or 2100).
But as mentioned before,
the content is accessible in digital format, as high quality wav files.

Machine listening

Can we use the content itself to bootstrap search and navigation within such an archive? Ultimately the content
itself has most of the data needed to unlock this archive. Of course listening
to and annotating all of this archive would take a considerable amount of time, but can we automate this process?
In particular, can we automatically tag content with web identifiers, in a similar way to what editors are doing on the BBC web site currently.

Automated speech recognition

The first step towards understanding what programmes are about is to figure out what was said in the programme.
Here is a (very) high level overview of how speech recognition works.
You extract features from an audio stream, and try to map that
to a stream of text. In order to do so, you use a set of models: an acoustic
model which models how features map to phones, a language model which models
how likely particular sequences of words are, and finally a dictionary mapping
sequences of phones to words.
We use the Open Source CMU Sphinx toolkit for performing automated speech recognition.

Automated transcripts

The results are pretty noisy though - speech recognition across programmes recorded during 6 decades,
across lots of different genres (drama, factual, ...) with
a variety of different speakers and recording qualities, is a hard problem.
However, the transcripts do include useful clues as to what the programme is about - proper names, locations, organisations...

Automated tagging

So we now need to isolate these useful clues, and infer the topics of the programme from them.
Most of the existing concept tagging tools are designed to work on text that was manually written, and rely
on punctuation, capitalisation, etc. We therefore developed our own tool. This tool first locates all these useful clues from
the automated transcripts.
It then uses the structure of the DBpedia graph to disambiguate and rank keywords spotted throughout the transcript.
For example, if a programme mentions Paris and Tour Eiffeil a lot, we will pick Paris in France, as it is closer in
the DBpedia graph to the Tour Eiffeil. If a programme mentions Paris and Texas a lot, we will pick Paris in Texas, as it is
closer in the DBpedia graph to the Texas resource.
For each programme, we get a ranked list of DBpedia tags, describing what the programme is about.

Example results

Here are a couple of example results. The first programme is a 1970 profile of the composer Gustav Holst.
The second programme is a 1983 profile of the Ayatollah Khomeini. The third programme is a 1983 episode of the Medical Programme.
The tagging algorithm works relatively well
for programmes that talk about a handful of topics, but works very poorly for magazine programmes, which talk about many many
different topics. The algorithm gets very confused in this case.

Processing archives in the cloud

We now have a process which can derive for each programme within our archive a ranked list of
descriptive Linked Data URIs, describing what each programme is about.
However processing large archives remains a challenge. Tagging a programme can take around 90 minutes for a 60 minute programme
on one CPU, meaning it would take more than 4 years to process the entire World Service archive. We therefore
developed a framework to process very large archives using Amazon Web Services.
A message queue distributes work between a number of independent `workers', hosted on AWS, and picking
up new jobs as soon as they're up and ready to process data. Using AWS gives us a potentially infinite
number of such workers, meaning that the only bottleneck to process a large archive is the bandwidth
between our content servers and Amazon's servers.
In our case it took around two weeks to process the whole
archive (70,000 programmes), for a very low and predictable cost.

Bootstrapping search and discovery

After running that process, we had uniquely identified topics for all programmes in the World Service archive. This data can be used to
bootstrap search and navigation within the archive, letting users discover and listen to programmes that often
weren't listened to since they were last broadcast. We built the World Service archive prototype enabling users to explore
this vast archive using that data.

Noise

However, as with all automated process, this data can be wrong.
For example as mentioned before the tagging algorithm works well
for programmes that talk about a handful of topics, but works poorly for magazine programmes.

Data validation

In order to deal with that we built some features within the prototype enabling users to validate or invalidate
the automatically extracted data.
In particular, users
can vote tags up or down, to approve or disapprove them, as shown on this slide.
This feedback is used to make the search and navigation within
the prototype more reliable, and could also be fed back directly in the automated tagging algorithm. It can also be used to
evaluate how well our tools work from the users' point of view.
As a result we're getting better and better interlinks between our archive and the Linked Data cloud.

Speaker segmentation

We also segment the audio depending on who is speaking. This enables users to get a quick overview of a particular programme
and to jump to specific points in the audio. For example in this From Our Own Correspondent episode we see the presenter of the programme
introducing various BBC correspondents who are each talking for a couple of minutes.

Crowd-sourcing speaker names

However just after processing we know that individual speakers are contributing to the programme, but not who they are.
We built a simple mechanism enabling users to name those speakers. This mechanism is also built on consensus - the name displayed
by default will be the name chosen by the most users.

Propagating speaker names

We are also able to recognise speakers across programmes. Therefore the names added by users can be propagated to other
programmes in the archive, detected as featuring the same speaker.
For that we use an index for speakers based on Locality-Sensitive Hashing, a way to
hash high-dimensional vectors with a collision probability that increases if the distance
between two of these high-dimensional vectors decreases.
Here we see that Nick Caistor's name (which I've entered in the previous slide) has been
propagated to this other programme featuring him. As this was automatically inferred (and therefore can be mistaken some time), we
provide a simple interface for users to validate this inference. This means that we can further evaluate and refine our
speaker identification algorithm.

Evaluating speaker identification

For example, we can use the speaker names as a basis for evaluating our speaker identification algorithm.
For each pairs of speakers with the same added name, did our algorithm think they were the same? We can derive
precision and recall from this dataset. Here we see how precision and recall evolved as the dataset of manually added
speaker names grew over time. We can see it is stabilising at around 85% precision and 54% recall.

User activity

We have been slowly opening up the archive prototype to users. We have a bit more than two thousand users at the moment,
and we got around 60,000 edits on tags.
Given the relatively small amount of users we have at the moment,
this is encouraging. We noticed in particular that a few committed users have done a very large amount of edits. One person
tagged more than 200 programmes in a single week-end!

Emerging shape of the archive

Analysing user activity across the archive also shows an emerging structure. Without any specific community-building features, it is looking
like communities are emerging around specific parts of the archive - drama, documentaries about specific
composers, programmes featuring a particular contributor, etc.

Visualising the archive

Such large archives of content are holding a significant amount of 'institutional memories'. A large
number of topics will be covered in some form by the programmes held in this archive.
In particular, the archive may hold programmes that could provide context for current news events.
For example a 1983 `Medical Programme' episode on techniques for measles immunisation could help
put in context a recent epidemic.
This visualisation is trying to tackle exactly that: surfacing archive programmes that relate to current news
events. The big blue dots are topics that were discussed on BBC News in the last five minutes.
The small dots are programmes within the archive that relate to those topics. The red-er a dot is,
the more connected it is - so the more likely it is to be relevant to a particular news event.
We are going to present this visualisation in more details at ISWC this year.

ClOud Marketplace for Multimedia Analysis

(COMMA)

We want to apply the tools developed as part of this work to other archives. In order to do so, we need
to create a platform that enables content owners and multimedia analysis service providers to easily
share data and algorithms, without worrying about the boring stuff (scaling, monitoring, etc.).
Such a platform could be used to enrich, interlink and unlock large archives of content.
We just started a new collaborative project to develop such a platform - please come talk to me
or email me if you're interested. We'll be talking more about that at IBC next week.