Posted
by
timothyon Thursday March 29, 2012 @05:10PM
from the you-are-the-crowd-being-sourced dept.

smolloy writes "Apparently some users of reCAPTCHA have recently begun seeing photographs appear in their CAPTCHA puzzles — photos that look very much like zoomed in house numbers taken from Google Streetview. It appears that Google has decided to put the reCAPTCHA system to help clean up Google streetview images, and 'according to a Google spokesperson, the system isn't limited to street addresses, but also involves street names and even traffic signs.' A large collection of these has appeared on the Blackhatworld website."

I'm guessing you've never done a copy-and-paste on, say, Google Books because the OCRed text quite frequently contains typos, random inserted spaces and completely wrong words. And since reCapatcha is used to supplement the OCR on Google Books, it would appear they aren't as smart as you would like them to seem.

Can we all agree on a word for the addresses just to have some fun with google?

Actually, words instead of numbers could be an issue already. My parents' house does not have a number anywhere. The house has a visible name instead, and that's what is used in letters addressed to them (including government letters): house-name, street-name, etc. Some houses on their street have numbers, but most just have names, and the house names are nothing to do with the names of the occupants. BTW that particular first world country does not have any postal codes, either.

Bingo, AC got it. I think almost every other EU country has postal codes - at least they do where I live. Incidentally, I've had to complain to more than one web-shop in the EU since they have the postal code as a required part of the address. So when ordering a gift for a parent, I have to put some bogus crap down (e.g. repeat the town name) as their "post code".

In the USA when e911 service is introduced into an area each street is named and each house numbered in the maps e911 uses, I assume these are official postal addresses as well. It is a good idea to have your house number visible for emergency services to find you.

If they're using this as a way to identify the street numbers, then I would assume that they're randomly matching the numbers with different words and seeing if they can get several matches to the same numbers. I would guess that they're also comparing the results to attempts at automated OCR. It would be difficult to bomb.

Wait, you mean that property records are in public databases! And sales of houses get reported to the government and published in local records! This brand new invasion of privacy cannot be allowed!
Seriously, there are privacy invasions out there that actually matter. Making public information more public is not a privacy invasion, and it makes for a "boy who cried wolf" appearance.

Theoretically, there is enough information in the actual font of the text, the frame that the text is on, and the material and texture of the wall to identify that location uniquely.

If you have seen that sign before, you would be able to recognise that location. USA street names tend to be white text on small blue rectangular signs at 90 degrees to each other, and on posts. London street names tend to be white rectangular plates mounted on walls along with the postcode at the bottom. In Scotland, thr street

What is your point? Who cares where a street sign is from? None of them are private information. And you're wrong about US street signs, they vary regionally but the majority tend to be white on green. And we were talking about house numbers anyway, or so i thought. How is any of what you wrote relevant?

"Could just be me being paranoid, but this sounds like something out of a science fiction book."Warp Drive teleporters, FTL, light sabers and robots are all in Science fiction. Why do you think things in science fiction are bad?

They give you two words to solve. One is an old, known word and the other is a new, unknown word. You have no way to tell which is which. To pass the CAPTCHA, you need to answer both and get the known one correct. Eventually entries can go from unknown to known when enough people provide the same answer.

They give you 2 words, one is an already solved known value, and the other is an unknown word.
if you get the first word correct, they take the value from your second word and add it to the "possible solutions" list.

After 2000 or so people have solved the word, they examine the results for a statistically unique answer. If there is not outlier, (say 65% have the same answer) it goes back into the unknown pile.

Once they find a statistically significant answer, it's considered "solved" and is used as one of the initial validation words.

I have read the quote from Google about what they are doing several times, and I don't see what everyone else sees. It appears to me that they are using the already known street names and numbers as possible ReCAPTCHA images. What they are NOT doing is using the results given by people to define what the image says. The point of the experiment is to determine whether these images are sufficient to separate people from web-bots. I imagine that they will look at the number of 'wrong' answers from both sides of the test, and see if bots are able to parse the street view images significantly more often than the standard test images.

So... can anyone point to something in the Google quote to show me where I went wrong? From TFA, here is the quote:

We’re currently running an experiment in which characters from Street View images are appearing in CAPTCHAs. We often extract data such as street names and traffic signs from Street View imagery to improve Google Maps with useful information like business addresses and locations. Based on the data and results of these reCaptcha tests, we’ll determine if using imagery might also be an effective way to further refine our tools for fighting machine and bot-related abuse online.

Getting around reCAPTCHA logins is usually easy. Just correctly type the easy to read word, and an approximation of the number of characters in the hard to read one. You don't even have to be close.

Google could have a few thousand house numbers they already know (their own recognition system is probably capable of this), and they can swap these in as well as a hard to read scanned word from a book, and you could never be sure which one was the reCAPTCHA and which was the CAPTCHA.

Yes, I understand this. I understand that they can look for most common answers among correct control responses, and crowd source the OCR of difficult street view images. My point is that is not what the experiment is doing. The point of the experiment is to determine if these images are as effective as the current images used in the tests. For the purposes of that experiment, it would be much easier (and probably more scientifically accurate) to use images where the correct answer is already known. As

What they are NOT doing is using the results given by people to define what the image says.

Um, no, that's exactly what ReCaptcha is for! The standard ReCaptcha images are all from old books that were scanned in (and presumably had trouble being OCRed with high confidence), and Google used ReCaptcha to "read" the words.

I read how it works. Multiple users are shown the same image, and once a few people have identified a given image as the same word, it's treated as the "correct" answer, and then later users have to match t

Yeah, the problem with that is that it can't work when most of the humans are robots. The robots will make guesses using standard algorithms, and their guesses will be pretty consistent with the other robots' guesses (which are quite probably the same robot in another instance). Then Google thinks the robot guess is correct, because it's overwhelmingly the most consistent answer. And humans who give the correct answer get marked wrong, because they're a minority.

It's quite noticeable if you use a site which relies heavily on recaptchas. For example, when you get a word which has old english S [wikipedia.org] which looks like a modern small case F, you're much better off claiming it's an F instead of giving the correct answer.

My understand of ReCAPTCHA is that it's to help translate books for libraries. Google has distorted that by using it to improve it's own databases. I personally don't have ReCAPTCHA on my website, but if I did I would be completely pissed off. Google is a for-profit company and can pay to do user studies to see how well people can read images. I'm willing to donate my time/reading ability to random libraries, not Google.

Back when reCaptcha showed two words that you could find in the dictionary, black on white I had no problem with it, it seemed like a good idea and you might be contributing to digitizing a book or something.But now you just get randomly generated characters with a zigzag going through the middle and blobs that invert it and it's hard to tell if this one letter is an 'i' or an 'r' or a 't'.So I don't even bother looking at the real word and just solve the generated one.

I thought text in Streetview was blurred out by design in the same way that faces were-- automatically and for security reasons (read: so Google doesn't get sued by crazy OMG I'M ON TEH INTERNET people).

I'd actually prefer if they un-blurred all street numbers and signs. It's fine to rely on Map's street number location when you're in a huge city, and the difference between 123 fake street and 125 fake street is ten feet or so. But last time I planned a ro

Yet Google would have to know what the address numbers really was in order to validate the reCAPTCHA, so that can hardly be why they are doing it. They don't need to crowd source an answer that they already know.

No they don't. They also add an altered text image alongside the picture (which presumably they generated), and can use that to validate the CAPTCHA. The street number can be validated by numerical probability (if 70% of them say it is "257", and the numbers "2,5,7" appear frequently in the rest, it is probably "257") even if they don't already know what it is.

I don't think you know how reCAPTCHA works. You are always presented with two different items to decode. One of them is always a known answer, and the other they are less sure about, but become more sure after they show it to enough people and get a crowd sourced answer. They don't give you two prompts just to be double sure you are human.

I only have about a 60% success rate on those swirly semi-inverted ones. My wife's friend's decaptcha software does a much better job than I do with its 79% success rate. I had wondered that as they get harder to read that the day was almost here when only machines would have the ability to decode captchas and prove that they were human.

ReCaptcha will accept any sequence of symbols for the unknown word. The most telling sign that a word is unknown is that, out of the two, it is the one that is ACTUALLY A WORD. Other signs are non-standard fonts, scanning distortions, non-Latin symbols, and punctuation marks.

Furthermore, there is a 1-chacter fault tolerance for the sequence of letters used as the part of the ReCaptcha to actually check if you pass or fail or not.

Wait - you can type something other than "nigger" for the unknown word?

One of these days I'm going to do that when someone's looking over my shoulder and get a serious WTF from them.

To whoever modded the parent at -1, pay attention.Out of the two images you are presented, one is known, the other is unknown. When a large enough number of people have entered the same answer for the unknown image, it gets moved to the 'known' list with that particular answer.

So on some places like 4chan, there has been a large effort to get as many people as possible to answer the unknown image with the word 'nigger'. If enough people do it on a single unknown image, it will get added to the pool with the

Yet Google would have to know what the address numbers really was in order to validate the reCAPTCHA, so that can hardly be why they are doing it. They don't need to crowd source an answer that they already know.

Doubtful. They post two images. One they know and one they don't. They use the data for the one they don't, combine it with data from 1000s of other people who have also solved that captcha to get an accurate picture of what that particular number is. They use the one they know to validate the recaptcha data and verify you're human...

What makes this more of an invasion of privacy than whatever they used to do to find house numbers? I assume they used some combination of databases, OCR, and paying someone to do it.

I'm surprised that this is a big help to them - if they can identify that something on a house is the house number (as opposed to a shadow or some home design pattern), it's surprising that they can't identify the number itself. It seems like there's going to be relatively few instances where something is identifiable as a hous

Different angles make it hard to be sure you have the number right. If you look at a street photo like a book you're going to OCR, you have first the layout detection, then identify the image part and the text part. Solving this problem would be similar to identifying where the page number is, to be eliminated from the text.

Taking a laser measurement, un-warping the photo, and then doing traditional OCR would be awesome, if they had the forethought to include the laser part in their vast collection, but t