Google's reCAPTCHA Weakened by Recent Changes [Updated]

This site may earn affiliate commissions from the links on this page. Terms of use.

[Update: After re-examining all my sources, it appears that I misinterpreted them and that none of this is recent. At one point - December 2009 - this problem with reCAPTCHA may have existed, but that's a long time ago and I don't know how well it responds to these attacks now. I apologize to you and to Google.]

A researcher has discovered that recent changes to Google's reCAPTCHA system has made the system much less resilient to machine analysis.

CAPTCHAs attempt to restrict access to a web resource (such as a login) to humans by presenting words as graphics which have been distorted so that reading them is not straightforward. See the example below. The idea is that humans can figure it out, but computers will have a tougher time. When Jonathan Wilkins originally wrote his paper on CAPTCHA security in late 2009, reCAPTCHA tested well. But changes Google made to the system have changed the results.

reCAPTCHA uses words which have been put through OCR in Google's book/magazine scanning service. One of the words will be a real word and the other will be a word which was not recognized after OCR. reCAPTCHA is used as a voting system to determine what such words really are.

Like many other systems, reCAPTCHA takes the words and adds distortion to them, partly by creating waves in them. Originally, they also added a horizontal line into the text. See the example below, taken from a sample on the reCAPTCHA home page.

This line caused one of the methods used by Wilkins to fail utterly, weakening his overall detection to 5 out of 200. That system is OCRopus, an open source document analysis and OCR program from--guess who? Google!--. With the removal of the horizontal line, OCRropus suddenly started detecting a respectable number.

Hat tip to The H Online, via AllSpammedUp, via Slashdot. I disagree with the headlines in these publications that reCAPTCHA has been "cracked." While the author claims significant improvements in parsing the text, he still fails the large majority of the time, and what success he has comes from Google's actions, not his. And that's assuming his claims are accurate, as they haven't been verified, and that Google doesn't quickly change their algorithms to combat these methods.