Review: The British Newspaper Archive

[update summer 2014] I wrote this review of the British Newspaper Archive way back in 2011. At the time, I was rather critical of some of its shortcomings. However, I’m pleased to say that the BNA has subsequently addressed many of the problems I identified back then. Their subscription packages are more reasonable, the usage caps have been lifted, and they’re very relaxed about people sharing the findings of their research. The hit-term highlighting problem has been solved, and new material seems to be appearing more rapidly than it did in the past. In other words, they have fixed almost all of the teething problems that I identified in the original archive. It’s not perfect by any means, but I do think the BNA has evolved into a good archive that justifies its subscription fee. I’ll be writing a new review shortly.

Christmas arrived early for historians this week. On Tuesday morning, amid a blaze of publicity, the British Library unveiled the new home of its digitised newspaper collection – The British Newspaper Archive (BNA).

Developed in partnership with commercial publisher brightsolid, the BNA provides online access to hundreds of eighteenth, nineteenth and early-twentieth-century newspapers. It’s an ambitious, long-term project – more than 3 million pages have already been digitised and the library hopes to reach 40 million pages over the next decade. If the project is successful it’ll have important implications for both professional and amateur historians. In the next couple of years, the British Library intends to close its newspaper archive at Colindale and transfer its holdings to a remote storage facility in Boston Spa. Whilst hard copies of undigitised newspapers should still be accessible, it’s clear that the British Library wants more and more researchers to access its collections online. The BNA, in other words, is the shape of things to come – and it’s vitally important that the British Library gets it right.

Content

The archive currently provides access to 170 newspapers. Many of these papers were available in the 19th Century British Library Newspapers database and have been transferred directly into the new archive. Only the Penny Illustrated Paper (which always seemed slightly out of place in the previous database) has been omitted from the new collection. Unfortunately, this means that gaps in the original archive are still a problem – the Northern Echo, for example, still has content missing from the crucial period between 1871 and 1872 when W. T. Stead first took over as editor. On the plus side, glitches from the previous database have been solved. The Preston Chronicle, for example, is no longer incorrectly listed as the Preston Guardian.

The real strength of the archive lies in its new content. 100 new newspapers are now accessible for the first time – almost all of them provincial papers. A full list of these papers is available here. Highlights from the new collection include long runs of the Bath Chronicle (1760-1903), the Chelmsford Chronicle (1783-1882), the Leeds Times (1833-1901), the Manchester Evening News (1870-1903), the Northampton Mercury (1770-1903), the Worcestershire Chronicle (1838-1903), and the Yorkshire Gazette (1819-1899). The new database is far less London-centric than previous offerings – most areas of the country are now represented by at least one paper, and major cities like Manchester, Liverpool, Birmingham, and Sheffield have multiple titles.

Whilst the majority of this new content focuses on the nineteenth century, some papers stretch deep into the eighteenth and twentieth centuries. Twelve titles include at least a decade of issues from the eighteenth century, including the Birmingham Gazette which goes all the way back to 1741. Whilst numerous papers cover the first three years of the twentieth century, only four titles stretch beyond the first decade: The Cheltenham Looker-On (1913), The Motherwell Times (1924), the Nottingham Evening Post(1944), and the Western Times (1940). It’s extremely encouraging to see the British Library push beyond the boundary of 1900 – let’s hope that this is first step towards bridging the ‘digital divide’ which has recently sprung up between 19th and 20th century history.

Perhaps the most exciting thing about the new database is the promise of more content. Unlike previous databases, which were updated in bulk every year or so, the holdings of the BNA are constantly being expanded. 8000 new pages are supposedly being uploaded to the website every day. Unfortunately, it’s not possible to quickly see which papers have recently been added or updated – this makes it difficult to keep abreast of the archive’s changing contents. Nor, for that matter, does the British Library give any hints about which papers will be digitised in the future. There’s no way of knowing whether a publication that’s critical to your research will appear in the archive tomorrow morning or in 10 years’ time. There’s something exciting about this I suppose. As the website itself points out, “who knows what you’ll find tomorrow, next week, next year, and beyond”. However, I suspect we’ll have to rethink and shore-up our methodologies in order to build research projects on constantly shifting sands – more on this in a future blog post.

Search Engine

A good search engine is crucial to the success of a digital archive – the methodological possibilities of these resources are determined primarily by the questions they allow us to ask. The BNA has most of the tools we’ve come to expect from newspaper databases. Users can perform a basic ‘Search’ by imputing keywords into a single search box, or they can construct more complex queries using the ‘Advanced Search’ page. The ‘Advanced’ interface allows users to put keywords into four boxes:

The first option searches for articles which include all inputted keywords. So, putting “America, Twain, New York” into the search box will find all articles which include these three terms somewhere in the text. Articles which only include the words ‘America’ and ‘Twain’ will not be found. For those of you familiar with Boolean searches, this is basically the equivalent of using ‘AND’.

The second option searches for articles which include any (but not necessarily all) of the inputted keywords. So, this time a search for “America, Twain, New York” will find every article in which at least one of these terms appears. This will return articles containing the word ‘America’ which don’t feature ‘Twain’ or ‘New York’. In Boolean terms, this is a straightforward ‘OR’ search.

The third option allows users to exclude articles containing certain keywords. So, we might search for articles featuring the word ‘Twain’ which do not contain a reference to ‘America’. In Boolean terms, this is the equivalent of ‘NOT’.

Finally, the fourth box allows users to search for a complete phrase. This returns articles which feature keywords in a particular order and is broadly equivalent of enclosing an ordinary keyword search in quotation marks.

Search results can be ordered by either ‘relevance’ or ‘date’ – this makes a nice change from Gale databases which only display results in chronological order.

Once a search has been performed, results can be filtered again by date, title, region, country, place, article type, and ‘public tag’. Crucially, the public tag feature allows articles to be sorted by additional categories, including: classifieds, adverts, news, commerce, arts, sport, crime, etc. The accuracy of these tags (many of which seem to have been imported from the previous database) isn’t great, but they can be helpful when filtering out irrelevant articles.

All of this works fairly well – if anything, the search engine is faster and more user-friendly than in previous databases. Unfortunately, some functionality has been lost. Most importantly, ‘proximity operators’ are no longer available. In the previous database, a search for “Twain n10 America” returned all articles in which the words ‘Twain’ and ‘America’ appeared no more than 10 words apart. This was a tremendously useful way to filter out results in which keywords appeared too far apart – it saved a lot of time and opened up a range of interesting methodological possibilities. In my thesis, for example, I use proximity operators to track changes in the number of articles featuring the words ‘America’ and ‘Competition’ in close proximity. It would be tremendously useful if this essential tool was reinstated in the new archive.

As for more advanced search methodologies like datamining or ‘culturomics’ – the chances of seeing the necessary tools introduced into the new database are slim-to-none.

Interface

The BNA’s interface is a mixed bag. It includes some welcome new additions. Each search result is now accompanied by a snippet of scanned text which helps users to decide whether an article is relevant before opening it – this should save a lot of time when wading through thousands of hits. Similarly, articles are now displayed within the full newspaper page – this makes it possible to zoom out using your mouse’s scroll wheel and explore the rest of the page. This should please historians who have (quite rightly) been warning us about the danger of viewing articles in isolation.

Unfortunately, this is where the good news ends. The BNA’s interface suffers from at least two major problems:

1. No hit-term highlighting.In previous databases, keywords would be highlighted in colour whenever you opened an article. This made is easy to quickly identify which parts of long articles you wanted to read. Every database since the Times Digital Archive has had this feature – it’s absolutely essential. Without hit-term highlighting, wading through a 2000 word article in search of a single keyword is a laborious chore. To do this 100 times in a day is infuriating and massively slows down the research process. I can’t even begin to fathom why the BNA doesn’t include it. Its absence is an inexcusable step backwards. If another element of the interface prevents the use of hit-term highlighting (such as the nice new zoomable images) then it needs to be unceremoniously scrapped. Right now.

2. Saving articles.In the 19th Century British Library Newspaper database, downloading an article was as easy as right clicking it and saving it to a relevant folder. It was quick, easy, flexible, and resulted in easily reusable jpg files. Now, articles can only be downloaded as full-page pdfs. If you want to paste an article into a word document, slot it into a powerpoint presentation, or upload it to twitter, you’ll have to convert it back into a jpg. To make matters worse, the quality of these files is embarrassingly low – in fact, it’s virtually impossible to read them. Here’s a sample:

Fortunately, a solution is at hand: for the low, low price of £35.95 the good people at brightsolid will print out a high-quality version of the page and send it to you through the post. Alternatively, you might prefer to use the print-screen key or the ‘Snipping Tool’ included with recent versions of Windows and save a more readable version for free.

OCR

Ensuring the accuracy of optical character recognition software (OCR) has always been one of the biggest challenges facing newspaper digitisation projects. Even the best software produces patchy results – some articles are transcribed with 100% accuracy, whilst others end up a garbled mess. As a result, software companies have typically preferred to hide raw OCR text from users; if we knew how inconsistent it was, they worry, we’d lose all faith in their product. So, it’s refreshing to see that the BNA openly displays raw, uncorrected OCR text alongside articles. It might put some users off, but we end up with a much better feel for how accurate our searches are.

More impressively, the BNA allows users to correct OCR errors and improve the database for other users. The interface for this process works fairly well. Lines of OCR text are displayed for correction on the left, and a black box highlights the specific area of the article which needs to be transcribed. A red box might have been slightly easier to see amidst the newsprint, but perhaps I’m being picky. In truth, the fact that this idea has been implemented so effectively makes the absence of hit-term highlighting doubly perplexing.

It remains to be seen how many users will bother to make corrections. I’d like to see the process incentivised a bit more –perhaps we could earn credits (more on them shortly) for each article we correct? It’s also unclear how the BNA intends to moderate corrections and prevent people from defacing the archive. However, I don’t want to be too critical of what is undoubtedly a step in the right direction. Whilst this form of ‘crowdsourcing’ won’t deliver 100% accurate ocr across the whole database – it would take thousands of users correcting around the clock to keep up with the 8,000 new pages added each day – it’s certainly better than nothing.

In addition to OCR corrections, users can also ‘tag’ articles with their own descriptive keywords. If enough users take advantage of this feature it promises to be another tremendous innovation. I suspect it’ll be particularly useful for finding images.

Subscriptions

Finally, we reach the dreaded question: how much does all of this cost? It would be nice if the British Library followed the example of their colleagues in Australia and New Zealand and allowed us to explore the archive for free. Sadly, in order to cover the cost of digitisation, the British Library has had to turn the content over to a commercial publisher. Unlike their previous partner Gale (which caters primarily to the academic market), brightsolid has a background in targeting amateur genealogists with websites like findmypast.co.uk and 1911census.co.uk. As a result, the BNA is presently only available to individual subscribers. This renders it immediately unusable for teaching. JISC claim to be in negotiations with the British Library and brightsolid to provide institutional access to the database – until this happens, the BNA won’t be of any use in the classroom.

Three packages are currently available to individual subscribers:

2 days (500 credits) – £6.95

30 days (3000 credits) – £29.95

12 months (unlimited access*) – £79.95

The ‘credit’ system is a bit complicated. It costs 5 credits to view an article published over 107 years ago in black and white, 10 credits to view similar articles in colour, and 15 credits to view articles published within the last 107 years. It’s fair to say, having bought the 2 day package to test the database out, that these credits don’t go very far. Browsing through one 20th century issue of the Nottingham Evening Post wiped out a quarter of my credits in five minutes.

For serious researchers, the 12 month unlimited subscription is the only real option. At first glance, £80 seems fairly reasonable – I’d spend way more than that on a two-day research trip to Colindale. However, buried in the small-print is a rather unpleasant surprise. If subscribers to the ‘unlimited’ package view more than 1000 pages in a calendar month, their account is frozen until the start of the next month. For some researchers, this cap will be perfectly tolerable. Unfortunately, as a press historian I’d expect to burn through at least 500 page views on a routine day of research. I’d be locked out of the database for 28 days of every month (save February, which has 28 days clear and 29 nine in a leap year). These quotas place an unacceptable restriction on research – I never want to be in a situation where my decision to read an article is determined not by its potential value to my research, but by the number of credits left in my account.

I e-mailed the archive’s customer service team and informed them that the cap would make many forms of academic research extremely difficult. They informed me that the BNA was intended for ‘personal use’ only. It’s nice to know where we stand.

Conclusion

In sum, there’s a lot to like about The British Newspaper Archive. The open approach to OCR, the introduction of crowdsourcing, and, above all, the incredible range of new content makes it a potentially fantastic new tool for researchers. I want to love it. Unfortunately, it currently suffers from at least four critical faults. The lack of hit-term highlighting, the inability to download a usable version of an article, the absence of institutional subscriptions, and the misjudged cap on the ‘unlimited’ package are all in need of urgent attention. Until these issues are fixed, its potential for academic research (not to mention its usefulness in the classroom) will remain frustratingly limited.

Share This Story, Choose Your Platform!

Related Posts

12 Comments

Thanks for the in-depth review. I used this service in its previous incarnation, and would have liked to have known what to expect before subscribing. My major issue was that so many of the pages were so faded or deteriorated that it was difficult to get the full text of certain lengthy articles.

It has always surprised me that, while early books are freely available from Google and other archives, 19th century newspapers are so hard to access. The fact that the British Library is going on the paid subscription model (for public domain content whose quality they cannot guarantee, no less) is discouraging. The cultural value of an archive like this one is so great that it really ought to be freely accessible to everyone.

Regarding hit-term highlighting, can’t a user simply use the Find function in the web viewer, or, once downloaded, the Find function in Adobe Reader. I always found the automatic highlighting a problem when I made printouts.

Thank you for the review. I nearly bought a subscription in my excitement. I am at a loss wondering at the price scheme: either base it on time or credits. Why both? It serves no one well! And the 1000 page limit per month just seems so arbitrary.

As a big fan of the Gale 19CBLN site (despite its flaws), the new version is a huge disappointment. Can we complain to the British Library?

Walter: I agree wholeheartedly. This content (which we all pay to preserve) should be available for free to everyone. I’m not sure why the British Library didn’t enter into a deal with Google or Microsoft to digitise their collections and make them freely available online. That said, Google’s existing newspaper digitisation schemes have left a bit to be desired. Unfortunately, it seems that the only way for the BL to fund these projects is to shack up with commercial publishers. One of the most worrying things about the new database is that brightsolid seem to own the digitised versions – I’m not sure how much control the BL has over what they do with them.

Peter: The articles are displayed as images, so a direct search using the Find function isn’t possible. The best solution at the moment is to open up the OCR text, use the Find function on that, and then click on it to view the image. It works, but its unnecessarily fiddly – not practical when trawling through 500 articles. It was possible to disable the auto-highlighting in Gale databases by unchecking a box in the user preferences – I always used that for printing and copying.

Troy: Great to hear from you. Every time I watch Jonathan Trott grind out another century I’m reminded of us trying to sell the joys of cricket to Patrick Leary at a sports bar in New Haven! I’m at a loss to understand the subscription model – the prices are just about fair, but the restrictions make research very difficult.

A lot of people are angry about the 1000 page limit – putting together a petition/letter would be a definite possibility. It’d be fairly easy for them to raise the cap. I’m not sure what a fairer amount would be though – 5,000, or maybe 10,000?

This post and the December 4th post give me pause. What exactly does the company think users will do with this information? By “personal use” I presume they mean the amateur family genealogist (which I have nothing against). There is no a.f.g. I know of who will come close to those usage limits. But there are few a.f.g. who are going to pay that much for a handful of obituaries or subscribe for long periods. They are the only possible users who may want a framed print of a newspaper page however.

But even the a.f.g. will print out pages, use the information in reports, and share that information with others.

Also, is my research on Victorian fiction considered “personal” or not? I guess it’s part of my job, but I spend a lot of my spare time on it! There are a lot of unaffiliated researchers who work outside academia.

So really what they have done is deliberately exclude researchers, teachers, and students by making the use of the site expensive, limited, and prescriptive. This, to me, is incredible short-sighted. Our use of the site would earn them money (probably lots of money) and publicize it to others through our work. As it is now, I must join Bob in advising researchers not to subscribe.

Judging by the comments to this blog, the company has already lost 4 x £80 = £320 in income this year (not counting others who have just read this blog). I don’t think they can stay in business for long at those rates. They need the researchers like me: if they continue to add newspapers at a constant rate, I would subscribe year after year. Academics are their audience for this!

What I miss most from the previous BL newspaper site was the ability to categotize the search terms by Document Title (Headline) and Full Text. This “headline” searching really helps targeting helpful (or more likely to be helpful) articles.

I agree, Tom. The new site does bring some useful new features (and they’ve shown a willingness to make improvements since I wrote this review), but they can’t match the power and flexibility of the previous site’s search engine. The basic search interface is fine, but once you try and construct more complex and precise queries it isn’t quite up to the task.

There has been a four-month dearth of new content, and the homepage boast of thousands of new pages added every day has been untrue since June. What is worse is the BNA has so refused to compensate customers for the lost months of their subscriptions. Complaints abound, and frankly with their abysmal level of customer service it is a wonder the BNA have any left at all.

I got all excited about the british newspaper archive. Then decided to check reviews before purchasing a subscription. Wow ….. all the negative reviews have made me reconsider, and I will not be signing up. And by the way, a couple thousand pages is very limiting even for the “weekend” and “weeknight” researcher/geneologist. Thanks for telling it like it is.

Thanks for your comment. I must point out that I wrote this review three years ago, and that the BNA has improved quite a lot since then. Their subscription packages are more reasonable, the usage caps have been lifted, and they’re very relaxed about people sharing the findings of their research. The hit-term highlighting problem has been solved, and new material seems to be appearing more rapidly than it did in the past. In other words, they’ve fixed almost all of the teething problems that I identified in the original archive.

It’s not perfect by any means, but it’s certainly the best service out there for anybody who wants to explore British newspapers. It’s up to you, obviously, but I think it’s worth the subscription fee.