Leveraging the power of homoglyphs in CAPTCHAs

A homoglyph is a character that looks like another character. For example, if you look at this letter: P, you would say it is a 'P'; whereas I say it is the Cyrillic equivalent of 'R'. Which one of us is right?

Regardless of the answer, there are ways in which this information can be applied for noble and evil purposes.

Note: this article contains exotic characters, if your browser does not handle Unicode gracefully, you will see squares here and there. This is discussed in the pros/cons section.

There are a lot of homoglyphs out there, here are some more:

C - looks like a Cyrillic 'S'

Y - looks like a Cyrillic 'U'

3 - the digit three looks like a Cyrillic 'Z', etc

The list above should be rewritten as:

С - looks like a Latin 'C'

У - looks like a Latin 'Y'

З - the digit three looks like a Latin '3', etc

With Unicode being widely supported, it gets better and better. There are a lot of other characters that can be easily interpreted by people:

¹ - that's a superscript 1

ʙ - a "small capital B", and a Cyrillic 'V'

∪ - that's a "union" sign, not a 'U'

￮ - that's a "halfwidth white circle", etc

Wait, it gets even better, but brace yourselves, for you may not see these if your fonts can't handle them:

ℍ - H

ℝ - R

ℳ - M

Ⅿ - obviously, it is not an 'M', it is the Roman numeral for 1000

⒳ - this is not '(x)', it is a single character that has the x within parens

Γ - a Greek Gamma, looks like a Cyrillic 'G', etc

There are homoglyphs for punctuation marks as well:

: - ：∶:։

) - ）)

? - ?？

There are a lot of ways to write the same thing. This can be used against us, as now there is such a thing as an IDN - internationalized domain name. A phisher can create a site that looks like "paypal.com", but is in fact - something else. A careful web-surfer can still fall into the trap, because the address looks right.

However, we live in a yingyangish universe, so there should be something positive in this story. There is! One way to leverage the power of homoglyphs for the benefit of humanity is to apply them in CAPTCHAs.

The idea is very simple - ask a person to type what they see in order to post a comment. This is exactly like the classic CAPTCHA, the difference is in what you show.

A regular CAPTCHA can be extremely convoluted, a homoglyph CAPTCHA looks straightforward and can be easily perceived by the reader.

Such CAPTCHAS can be randomly generated from character equivalence tables, comprehensive tables that do the same thing as the lists at the beginning of this story. You need something like looks like ':' - go ahead and use any of these ：∶:։.

A hypothetical spam-bot that wants to post a comment will input the character 'as is', whereas a reasonable human being will type ':'.

You can further simplify the scheme, by using a static CAPTCHA (something that used to work on the Codinghorror blog, all you had to do was type the word 'orange', it never changed). For example, I could say:

Type "РУТНОN" to post your comment

You will think of a snake and write "python", but the truth is that the only Latin character from that word is the 'N' in the end, everything else is Cyrillic (in Russian it reads as "rutnoN"). What you just saw is a homograph - same as a homoglyph, but scaled up to the level of a word.

Using homoglyphs in CAPTCHAs has advantages and disadvantages, the pros:

Popularity is bad - eventually, the bad guys will figure it out and use their own homohlyph tables. This method will work well if only a few sites use it.

Copy/pasting - some people are lazy and they will not type what they see, they'll just copy the text. Perhaps it can be addressed by writing "do not copy/paste the text" or making the edit box "unpastable".

Unicode coverage is not perfect. Some people that use old browsers (or that lack the right fonts) will see a bunch of squares, and they will be unable to write anything. The solution is to avoid the use of exotic homoglyphs (Cyrillic characters should work everywhere [citation needed]).