The Newsroom blog

2 posts from January 2017

There are exciting changes happening in how we use newspapers to study the past. After decades in which the use of newspapers in research meant leafing through volumes or scrolling through microfilms, digitisation made millions of newspapers more readily searchable and far more widely available. But now that digitisation that taken us to the next stage in development, which is using the data generated by the digitisation process to look at history on a grand scale. We are moving into the era of big data newspaper studies.

From the University of Bristol study: People in history. (A) famous personalities by occupation using all extracted entities associated with a Wikipedia entry; (B) the probability that a given reference to a person is to a male or a female person

Big data newspaper studies have come about through a combination of large-scale digital resources and a growth in analysis tools. Most will be aware of OCR (optical character recognition), the mechanism by which archival texts can be converted into machine-readable texts by converting what a computer sees as an image (i.e. the arrangement of letters on a page) and matches these to letters that it knows. It is an imperfect science, because OCR can struggle to work with older forms of types and deteriorating page originals, but levels of accuracy continue to improve as new OCR software is developed, and the results are generally satisfactory - that is, most of the time a researcher will find what they are looking for, if it is there to be found.

But added to this are software tools that can extract further sense from the raw data set that generated by OCR. The field of what is called Natural Language Processing, by which computer come to understand human text and speech, includes the extraction of keywords, or named entities, and the matching of these to controlled lists of terms (such as DBpedia), further mapped to geographic areas and time periods, which enables researchers to undertake controlled, thematic analysis of large historical datasets. Our archive of words yields patterns of behaviour with much to tell about our past selves.

This is the theme of a major project undertaken by the Intelligent Systems Laboratory at the university of Bristol, led by Professor Nello Cristianini. As described in their paper 'Content analysis of 150 years of British periodicals', the project worked on a corpus of newspapers digitised from the British Library's collection by family history company Findmypast for the British Newspaper Archive website. The figures involved are huge. The project analysed 28.6 billion words from 35.9 million articles contained in 120 UK regional newspapers over the period 1800-1950, which they calculate forms 14% or all regional newspapers published in the UK over the period.

The project then used this study to explore changes in culture and society, determined by changes in the language. It looks at changes in values, political interests, the rise of 'Britishness' as a concept, the spread of technological innovations, the adoption of new communications technologies (the telegraph, telephone, radio, television etc), changing discussion of the economy, and social changes such as mentions of men and women, the growth in human interest news and the rising importance of popular culture. It is the stuff of multi-volume histories of the past, boiled down to eye-catching graphs.

This does not mean that we thrown away those multi-volume histories, however, The researchers are at pains to point out that such data analysis is an inexact science, with many caveats needed to explain how the entities have been arrived at and with what degree of caution they should be treated. The data derived from such tools can only work where it is supported by traditional studies, to gain the richer understanding of what happened. The machines may have taken the natural language of humans and converted it into data, but the results need to be converted back into human language to offer real understanding.

So it is that some of the results of the project yield results that may seem obvious. We could have guessed beforehand that the newspaper archive would show an increase in discussion of popular culture subjects, that politicians are more likely to achieve notoriety within their lifetimes than scientists, or that there was a rise in coverage of the Labour Party from the 1920s onwards. But the analyses reinforce through data what we have previously inferred through study, while discoveries such as the term 'British' overtaking the term 'English' at the end of the 19th century, or the decline in terms associated with ''Victorian values - such as 'duty', 'courage' and 'endurance' - call for new studies to explore these things further.

The project is at pains to point out the importance of using newspaper archives. Previously we have had big data analyses of millions of historical books, most familiar through the Google Ngram Viewer. This has caused controversy among some scholars, because of the unevenness of coverage of topics in books, and the limitations of merely counting words and making them searchable again. Opening up newspaper archives for comparable analysis widens the amount of content available, arguably with greater reliability overall, and now with tools to make analysis that much more scientific. The use of controlled terms will also enable the analysis across different datasets - so, books and newspapers, but also other news forms, as subtitle extraction and speech-to-text technologies now start to make our television and radio archives available for similar and shared analytical studies. Our big data is only going to get bigger.

There are limitations to this use of newspaper archives. The quality of OCR varies not only according to the original newspaper, but according to the microfilm where this has been used instead of print. Digitisation is quicker and cheaper this way than digitising from print, but older microfilm can be photographically poor, leading to inferior OCR (though there are promising tools appearing for improving poor OCR). The British Newspaper Archive is made up mostly of UK regional newspapers, because the main nationals have often been digitised by their current owners and are available separately. How different was the discourse in newspapers based in London from those around the rest of the country? That has to be the subject of another major study.

One of the better jokes from the Victorian Meme Machine project

The British Library has been engaged in its own big data analyses of newspaper archives. BL Labs is an initiative designed to support and inspire the public use of the British Libraryâ€™s digital collections and data in exciting and innovative ways. It has facilitated several studies of British historical topics through the digital newspaper archive. These include Bob Nicholson of Edge Hill University's study of jokes in Victorian newspapers, with the concept of the Victorian Meme Machine (automatically matching jokes to an archive of contemporary images); Katrina Navickas of the University of Hertfordshire's mapping of nineteenth century protest; and Hannah-Rose Murrayof University of Nottingham's tracing of black abolitionists in 19th century Britain. A major user of our newspaper data is M.H. Beals of Loughborough University, who is researching how ideas travel across the historical news media, creating new insights through understanding newspaper archives as structured data.

Such projects are just the start. The availability of large-scale newspaper archives in digital form, and the data derived from such archives, enables us both to seek answers to traditional questions more quickly, and to start asking new kinds of questions. The latter is the great challenge that newspaper data offers. We need to come up with new questions, because the technology enables us to do so, and because it may question what we previously thought that we knew. As the data from their archives comes more readily available, and more easily usable by the non-data specialist, so we will find that we have only just started to read the newspapers. We are going to find that they have much more yet to tell us.

Links:

All of the regional newspapers used in the University of Bristol project are available at www.britishnewspaperarchive.co.uk (subscription site, free to use at British Library locations)

Tags

The incoming US president, Donald Trump, is rewriting the book on the political process. However, despite the apparent creation of policy via social media, the real impact Trump has made since the presidential election process began has been through the more traditional media, particularly television. His statements made through Twitter have been picked up by newspapers, television and radio, and it is here that the seismic realignment of American political priorities is being digested and disseminated. Twitter has been used to ignite a media process. Social media remains for Donald Trump a means of being on TV, where his mass audience lies (Trump has 18.6m Twitter followers, but there are 114m television sets in the USA alone).

From the Sky News coverage of the US presidential election result, 09/11/2016

Trump's impact on television in Britain can be traced through the news and current affairs programmes recorded for the British Library's Broadcast News service. As well as recording regular television and radio news programmes each day from 22 UK and international channels, we have recorded numerous special programmes on Trump and the US election. On 8/9 November we recorded the election night programmes of BBC One, ITV1, Sky News, Al Jazeera English, CNN, RT (Russia Today), Channels 24 (Nigerian television) and CCTV (China). All of these can be found on our Explore catalogue with links to the playable programmes, which for copyright reasons can only be played on terminals at our London or Yorkshire sites. For ease of searching it is best, if you are onsite and using a British Library terminal, to go to the Broadcast News service itself (http://videoserver.bl.uk) and use the Advanced Search facility to select all recordings for 8/9 November 2016.

We also have many individual television programmes produced through 2016. of which the titles below are only a selection. They document not only the events of recent history, but the struggle that the often incredulous traditional media have had in trying to come to terms with the Trump phenomenon. The links are to our catalogue records, but again please note the programmes will only be playable on a British Library terminal. Descriptions in inverted commas are those provided for the programmes as part of the EPG (Electronic Programme Guide).

The Mad World of Donald Trump (Channel 4, tx. 26/01/2016): "Matt Frei enters the colourful and mad world of presidential hopeful Donald Trump, whose meteoric political rise comes amid one of the most controversial political campaigns America's seen."

President Trump: Can He Really Win? (Channel 4, tx. 30/03/2016): "Donald Trump has emerged as the clear front runner for the Republican Presidential nomination. Matt Frei investigates whether 'the Donald' could make it to the White House."

President Trump: Can He Really Win? (Channel 4, tx. 23/08/2016): "Matt Frei explores how the US presidential contest is shaping up to be one of the most brutal in living memory, and asks if Donald Trump can make it all the way to the White House."

Trump vs Clinton Live (Channel 4, tx. 27/09/2016): "US Presidential Debate: Channel 4 presents live coverage of the first of three US presidential debates, as Donald Trump goes head to head with Hillary Clinton."

Clinton v Trump: The Second Debate (Sky News, tx. 10/10/2016): "We join Sky News for coverage of the second presidential debate of the 2016 US Election between Hillary Clinton and Donald Trump."

Paxman on Trump v Clinton: Divided America (BBC One, tx. 17/10/2016): "Jeremy Paxman travels to Washington and beyond to understand how Americans came to face such unpopular choices in its candidates for the presidency."

US Presidential Debate (BBC News, tx. 20/1/2016): "Hillary Clinton and Donald Trump face each other in the final 2016 presidential debate at the University of Nevada in Las Vegas."

The World According to President Trump (Channel 4, tx. 12/11/2016): "What will a President Trump really do? Will he really ban all Muslims? Build a wall? Pal up to Putin? Smash Isis? Matt Frei speaks to the people who know."

Panorama: Trump's New America (BBC One, tx. 14/11/2-16): "Hilary Andersson meets angry Americans on both sides of the electoral race who feel disillusioned and disenfranchised by the electoral process."