Wednesday, September 16, 2009

Google Acquires reCAPTCHA To Improve Books and Newspaper Scanning

Google today announced that they have acquired reCAPTCHA, a company that provides CAPTCHAs to help protect more than 100,000 websites from spam and fraud. reCAPTCHA is a free CAPTCHA service that helps to digitize books, newspapers and old time radio shows.

Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.

Many of us have seen such CAPTCHA while entering comments to some blogs or websites :

In this way, reCAPTCHA’s unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users.

Google will apply the technology within Google not only to increase fraud and spam protection for Google products but also to improve their books and newspaper scanning process.