OCR and OER – update

We welcome this short posting from Subhashish Panigrahi which updates a 2014 posting of his on Indic Language Wikipedias as Open Educational Resources at http://education.okfn.org/indic-language-wikipedias-as-open-educational-resources/

To read the blog post published by Open Education Working Group, see here.

Subhashish Panigrahi (@subhapa) is an educator, author, blogger, Wikimedian, language activist and free knowledge evangelist based in Bengaluru (often called Bangalore), India. After working for a while at the Wikimedia Foundation’s India Program he is currently at the Centre for Internet and Society‘s Access To Knowledge program. He works primarily in building partnership with universities, language research and GLAM (Gallery, Library, Archive and Museums) organizations for bringing more scholarly and encyclopedic content under free licenses, designs outreach programs for South Asian language Wikipedia/Wikimedia projects and communities. He wears many other hats: Editor for Global Voices Odia, Community Moderator of Opensource.com, and Ambassador for India in OpenGLAM Local. Subhashish is the author of a piece “Rising Voices: Indigenous language Digital Activism” in the book Digital Activism in Asia Reader.

Google’s OCR and its use by Wikimedians in South Asia

Some time back on the OCR project support network, Google had announced that the Google drive could be used for Optical Character Recognition (OCR). The software now works for over 248 world languages (including all the major South Asian languages). Though the exact pattern of development of the software is not clear, some of the Wikimedians reported that there is improvement over time in the recognition of their native languages Malayalam and Tamil. The recent encounter has been with a simple, easy to to use and robust software that can detect most languages with over 90% accuracy.

The OCR technology extracts text from images, scans of printed text, and even handwriting to some extent, which means that the text can be extracted pretty much from any old book, manuscript, or image. This certainly brings hope to most Indian languages as there is a lot to digitize. Most of the major Indian languages have plenty of non-digitized literature and the existing OCR systems are not as good as Google when so many languages are concerned as a whole.

Google’s OCR engine is probably using aspects of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books. Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages. The source code is available on GitHub.

The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text.

When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.

The user-end interaction of the OCR software currently is rather simple. The user has to upload an image of the scan in any image format (.jpg, .png, .gif, etc.) or PDF to the Google Drive. Upon completion of the uploading, opening the file in Google Drive shows both the image and the converted text in the same document.

One of the most popular free and open digitization platforms, Wikisource currently hosts hundreds or thousands of free books which are either out of copyright or under Creative Commons licenses (CC-by or CC-by-SA) allowing users to digitize.

While OCR works quite well for Latin based languages, many other scripts do not get OCRed perfectly. So, the Wikisourcers (Wikisource contributors) often have to type the text.

Thus the new Google OCR might be useful both for the Wikisource community and many others who are in the mission of digitizing old text and archiving them.

The image below shows a screen from a tutorial to convert text in the Odia language from a scanned image using Google’s OCR.

The views and opinions expressed on this page are those of their
individual authors. Unless the opposite is explicitly stated, or unless
the opposite may be reasonably inferred, CIS does not subscribe to these
views and opinions which belong to their individual authors. CIS does
not accept any responsibility, legal or otherwise, for the views and
opinions of these individual authors. For an official statement from CIS
on a particular issue, please contact us directly.

Follow our Works

Request for Collaboration

We invite researchers, practitioners, artists, and theoreticians, both organisationally and as individuals, to engage with us on topics related internet and society, and improve our collective understanding of this field. To discuss such possibilities, please write to Sunil Abraham, Executive Director, at sunil[at]cis-india[dot]org or Sumandro Chattapadhyay, Research Director, at sumandro[at]cis-india[dot]org, with an indication of the form and the content of the collaboration you might be interested in.

In general, we offer financial support for collaborative/invited works only through public calls.

About Us

The Centre for Internet and Society (CIS) is a non-profit organisation that undertakes interdisciplinary research on internet and digital technologies from policy and academic perspectives. The areas of focus include digital accessibility for persons with disabilities, access to knowledge, intellectual property rights, openness (including open data, free and open source software, open standards, open access, open educational resources, and open video), internet governance, telecommunication reform, digital privacy, and cyber-security. The academic research at CIS seeks to understand the reconfiguration of social processes and structures through the internet and digital media technologies, and vice versa.

Through its diverse initiatives, CIS explores, intervenes in, and advances contemporary discourse and practices around internet, technology and society in India, and elsewhere.