Are You Human? Then Type Out This Book

Google has acquired reCAPTCHA and plans to use the system for digitizing books. Wait… what? CAPTCHA is the method of requiring a user to type in a visually obscured word to prove they are human. How can this digitize books? The answer is a bit obscure and takes some time to discover, but you’ll have fun along the way.

The Google blog links to a Google TechTalk video on Human Computation as an example of how they plan to use their new acquisition. It’s embedded below but at 51 minutes we figure most won’t watch it all so we did it for you. This fascinating discussion looks at how people are already being tricked into solving CAPTCHA challenges, and shows several tested implementations of getting people to input cognitive data computers cannot, under the guise of playing games.

Spammers have to beat the CAPTCHA system in order to get large numbers of free email accounts. There have been examples of software overcoming this test such as the greasemonkey script that beat MegaUpload’s security, or Time Magazine’s poll being hacked. But, for the most part, only humans can pass the test. People seeking to bypass millions of CAPTCHA challenges either pay for sweatshop laborers to solve them or, more creatively, they get you to solve them when cruising for porn. This is the proof of concept; we can use people to interpret words computers cannot if we use the right carrot.

Talked about in the video, the ESP game was written in order to correctly tag photographs. Players are shown pictures and asked to type what they see. The round keeps going until the two have typed the same word. With a lot of players, and proper safeguards, these tags are incredibly accurate. Furthermore, the game has been very popular and has the potential to accomplish herculean feats in short amounts of time (namely, tag every image in Google’s image search in just a few months).

It seems that Google plans to use these methods to digitize books that are otherwise very difficult to scan with Optical Character Recognition. According to the video, 9 billion human hours were spent playing solitaire in 2003. What if a small portion of this time could be diverted over to playing games that added to the digitized knowledge cache? If the right type of verification game can be developed it will allow Google to tap society as their typing minions. It’s an interesting proposition and frankly we hope to see it happen.

Awesome idea. Yes, we can still read yellowed, old paper far better than any OCR-toting computer. Just throw a captcha of a snippet from an old book out to eager participants, do it at least three times for consensus, and you have a digitized tome in just hours, and you can do many in parallel. You can even fool people into participating by offering additional captchas alongside required captchas in the marketplace!

So a regular Captcha tests to see if you are human by comparing your typed “decode” to what it already knows the answer is… so if these are non-OCR’able snippets of old books… how does the system know if you’ve typed it correctly? I know in the long run you can send the same snippet to multiple people to make sure they all agree, but how would this work as a Captcha? Will I have to wait to post my message board message until 2 other people have been exposed to the same Captcha I just typed?!

Interesting that you mention Time Magazine’s poll being “hacked”. They didn’t break CAPTCHA at all, they just came up with ideas for how humans could break the system. ReCAPTCHA even came out and said that they knew that they were being attacked, but that the way that 4chan was trying to game the system wouldn’t work.

It was an interesting try, but it shouldn’t be considered a breach of reCAPTCHA’s security.

I think reCaptcha is inherently flawed because it still works in concepts that machines understand. All the spambot needs is better OCR, a technical hurdle. True captchas should base their decision on things that only humans understand. KittenAuth is on the right track, but still asks to determine which is a kitten…the possibility is that image recognition will eventually be able to identify kittens. So we have to take it into the human-only domain. Show pictures and ask “Which animal is cutest?” “Which animal is scariest?” “Which face appears untrustworthy?”. Maybe show short passages of text and ask “Does this make you feel happy, sad, scared” etc. You would need a large pool of sample material that may eventually be manually identified, but there is no shortage of new images. I think we have a ways to go before computers understand emotions, and by that time they may refuse to spam us :)

this isn’t the first time i’ve seen this story, but it still strikes me as a very clever way to kill two birds with one stone.

@vonskippy

is this not a fair trade for the services they provide to you without charging monetary costs?

@macegr

i guess the advantage of this is that even if there is improved spambot ocr, you’re still digitizing text that your internal automated process was incapable of digitizing. i would assume the processing workload of recognizing text in an image is greater than the workload of providing the recaptcha.

@Jay – One way it could work is to give people two words to type. You already know the answer to one, so if they give the correct answer for that word, you let them through and record their answer for the second. When you get concensus on the second word you add it to the pool of known words. The known and unknown words are presented in random order.

thats hillarious, my buddies wife is always playing all these stupid flash games, one day i made a comment about possible it was, that she was actually controlling a machine in some warehouse somewhere and the game as just a complex UI for free labor……..

@Jay, thats the way recaptcha works, one known word, one unknown word, if you type in the correct known woud though you can type in any word for the unknown, and visually you can tell the known and unknown apart most of the time, once the unknown word was “Marine” and I typed in “Fag”, so I hope that unknown words have to go through a consensus or we could have some very funny ReCaptcha ocred books.

I cringe every time someone says that we are safe guarding plainly readable data by counting on a digital copy. I see something inherently dangerous in encoding valuable information into a format that requires some sort of device to decode and make human readable. I think the original is far more valuable. It is an issue my family and I struggle with. Despite all of the safe guards on our data and the numerous copies, I would say the oldest digital file we have is from the late ’90s.

I am 33 and I have 33 year old pictures, ~23 year old CDSs and 31 year old drawings. I have been using a computer for 25 years, but I think my oldest file is from 1997. Will I have 23 year old CDs? Digital Photos? I am not sure.

It may make an interesting post to see the best way to preserve data and also to find the oldest files that people have that is still readable (I dont think a 12 inch floppy counts). Are there any digital format that could last as long as the rosetta stone without regular maintenance?

Yes, this is what reCAPTCHA has been doing for years now. I imagine Google bought it both so that the sources being typed out were from their own digital library project rather than CMU’s, but also just to get Luis von Ahn.

Hirudinea: How can you tell the test word apart from the unknown word? Even the test words were unknown words before they were added to the pool. They should be of about equal difficulty to read or the OCR.
Also, the idea about trying to insert bad words into the pool won’t work for to reasons:
1) Two or more people need to agree on an unknown word for it to be added.
2) If the reCAPTCHA programmers are worth their salt, the OCR will still give a prediction of the unknown word. So if the word was marine, you might get away with entering nnarine or manine, but not fag.

When I see a recaptcha I always only enter the word the computer already knows, the other one I write just the letter ‘a’. It’s really easy to see which one is the unknown after some training. This really speeds up entering the captchas (yes, I need to use a service where i need to enter TWO per sent message)

Especially now that google owns it, I’m not wasting my time and making them money.

Yeah Guys,
We have all seen the CAPTCHA stuff with two words. This is common way of getting an answer. They have two words, one they know, one they do not. They distort the one they do not a little.

For proof, go find one, type in the word on the left correctly, the word on the right incorrectly (something like very obviously wrong, like cat where it says dog). You man need two or three goes, but it will eventually pass.

@macegr
The kitten test is unfortunately very stupid. Computers can’t recognize kittens, so humans must categorise the image database. And since one group of humans can categorise the images, another group (the spammers, or the 50c an hour Indian programmers that the spammers hired) can also build up their own database.

I seen several talks about Captcha systems lately, and it is hilarious how easy it is to break them. All you need is OCR and basic image processing facilities, and there are great open source implementations of those available. With those, all it takes is a little bit of knowledge and a 5-10 line scripts.

I’ve actually just got involved in Project Gutenburg Distributed Proofreaders, which is similar, but more involved – people proofread and format OCR text from scanned copies of public domain texts, one page at a time.

The benefit of the CAPTCHA method, of course, is that getting people to volunteer their brains toward one word for a specific purpose is a lot easier than asking them to do so for multiple pages of text without any sort of compensation.

It depends on one’s feelings about “Social Value” being denatured or not by Financial Entanglements.

Look at it this way:

Goog makes $ off the adverts associated with whatever they “Do.” Even if that’s empowered by Human OCR as Captcha etc, There’s a net GAIN to Humanity’s access abilities. It means in simple reality that Ink on paper is getting captured in electrons at a conversion rate simply impossible to achieve in affordable alternatives. Which is a Very Good Thing for humanity as some people will have their lives enriched way beyond what most of us can grasp. You likely never give a thought to what life would be like with no eyes. If you are dependent on others to read for you as one example.

Or it’s texts you might NEVER otherwise lay your eyes on a “Dead Tree” copy of is another. Everything between Archimedes to Zymurgy has the potential of being accessible to the eyeless thru an Instrumentality like Capcha.

I tire of people who are anti-capitalistic. Yes, Google is making money on this. Yes, people are doing work for Google for free. No, this is not hurting anyone. And YES, THERE IS A MUCH GREATER VALUE FOR EVERYONE! Whether or not you want to read a Markist publication or you want to read something from the Wall Street Journal 45 years ago. This is still a boon for the human race as a whole. And someday, and I hope it is not soon, your eyes may need help in reading something. And I know, and I hope it happens very soon, a computer will either be able to magnify or even read the text that you wish to read to you. Think about the long run when you anti this or anti that before you say die capitialist pig die.

On the technical aspect of this, you must tip your hat to Google for thinking about this. Google had to take some of their evil capitlisticly gained money and buy a helpless small business (capitalistic company) and when they could have used it for simple passphrases, they decided to translate old texts much cheaper and probably much more accurately than any team of humans on the same book would require and probably do wrong, without having to worry about interpretation.

As far as Capitalist Pig Company Google is concerned, I could think of much worse things than this they could be doing and are not. And while I am not exactly a fan of Capitalist Pig Microsoft either, I can tell you that computers have done a lot more to enrich our lives than they have done to harm us. One being this forum, which I hate going political on, it is a site for innovation, not for political discussion.

The known word and unknown word is slick. Yes someone is going to go “flag” on the word “marine”, but you are not going to get fifty different IP address polls to agree to that, even if you spoof. It won’t be two or three people agreeing, it will be thousand, or greater. Even with spoofing it is not worth the effort to do. And if there is a fifty/fifty concensus poll on a word, then you can send in a few humans to look at the word without Captcha. Congratulations you Capitalistic Pig Google, die Capitalistic Pig Google, long time from now!

No… Google could be considered as an example of “Microsoft” type business; find something some one else is doing, buy it, say it was your idea.

Also, 4chan already attempted to destroy a CAPTCHA process by spamming the word “penis” in to the system… it didn’t work.

Also, the compared words are stored for a while before being compared to a new pair word to test. Allegedly the reCAPTCHA system also has some rough estimates of what the word might be initially… tough to fool it too much with that… hacking it has been tried and hasn’t succeeded… (yet?)

Last year I did some 700 reCaptchas one day… it was a great way to get up on my Dvorak layout.
;-)