Using Transkribus for a solution to the automated text recognition of historical Bengali Books

Using Transkribus for a solution to the automated text recognition of historical Bengali Books
As part of the Two Centuries of Indian Print project, Tom Derrick –our Digital Curator based with the
project at the British Library – has been working on solutions to automate text recognition of early
printed Bengali books.

He has recently been using Transkribus for automated text recognition of Bengali printed books.
Transkribus is a READ project and available as a free tool for users who want to automate
recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and this inspired him to see how it would perform on printed Bengali texts, which provides an altogether different type of challenge. It has the potential to help ‘unlock’ keyword searching and text mining in digitised printed collections.

Although Transkribus is most commonly used for automated recognition of handwritten texts, Tom
found it also worked fairly well for printed texts too, including printed texts in Indian scripts. He
tested it with a training set of 50 pages from the British Library’s 19th century printed books written
in Bengali script that have been digitised through the project.

Although this is a very small set compared to other projects using Transkribus he thinks the accuracy
could be vastly improved by creating more transcriptions and re-training the Transkribus recognition
engine – and may be the key to unlocking automated text recognition for not only Bengali but, in
time, other South Asian languages.

He has written a detailed blog of his initial pilot work here, have a read and see if this is a tool you
could use for a similar project!