Google bolsters OCR capabilities with reCAPTCHA acquisition

Aharon Etengoff, 17th September 2009

San Francisco, Calif. Google has bolstered its optical character recognition (OCR) capabilities with the acquistion of reCAPTCHA. The company - which was originally spun-off from a Carnegie Mellon University research project - protects over 100,000 websites from spam and fraud by providing CAPTCHAs gleaned from printed texts.

"CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there's a twist - the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books," Google Product Manager Will Cathcart explained in an official blog post.

"Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text."

According to Cathcart, reCAPTCHA's technology can also be used to improve the OCR process of converting scanned images into plain text.

"This technology powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process," added Cathcart.

Luis von Ahn, who founded the company in 2008, noted that Google was the "best fit" for reCAPTCHA.

"From the very start, people often assumed the project was connected to Google, so it only makes sense that reCAPTCHA Inc. ultimately would find a home within Google."