Special Project

Compute Visual Similarity of Top-Level Domains

Compute the visual similarity between a possible new Generic Top-Level
Domain (TLD) and other proposed TLDs, current TLDs, and reserved
words.

This web site describes
experimental software developed at
the National Institute of Standards and Technology (NIST).
No algorithms, code, or descriptions in whole or in part are
recommended, used, or endorsed by the Internet Corporation for
Assigned Names and Numbers (ICANN) or any other entity.

Compare Against All

Compare a string to other proposed TLDs, current
TLDs, including country codes, and reserved words.

Proposed Top-Level Domain:

Note: All characters other than 0-9, a-z, A-Z, and hyphen (-) are removed.

Compare Two Strings

Compare two strings to each other.

String 1:
String 2:

Note: All characters other than 0-9, a-z, A-Z, and hyphen (-) are removed.

Background

Computers connect the Internet with numbers, IP addresses. However,
people use strings ending in short segments like .edu, .uk, .tv, and
.com, called Top-Level Domains (TLDs) to navigate the World Wide
Web. Ensuring
the ongoing security and stability of the Domain Name System is one
activity of the Internet Corporation for Assigned Names and Numbers
(ICANN). With the growth of the Web, there is a possibility of a lot
of new TLDs.
As one method to implement the recommendation that, "Strings
must not be confusingly similar to an existing top-level domain ...",
this web page invokes an algorithm developed at NIST "to provide an
open, objective, and predictable mechanism for assessing the degree of
visual confusion" between proposed or existing TLDs.

The algorithm takes into
account varying degrees of similarity between some 60 pairs,
like 0 and O (zero and oh), 1 and l (one and L), Z and 2, h and n, rn
(R and N) and m, and w
and vv (v v). Even insertions or deletions may cause confusion, for
example .aaaah and .aaaaah look very much alike. The task is all the
more challenging because domain names are not case sensitive.

Does such confusion happen in real life?
Here's a piece of text from my screen.

I actually read the
second line as "w a m s", not "w a r n s", even though "wams" is not a
word!

Note that the algorithm is not meant to consider phonetic similarity.
For example, "fish", "phish", and "fiche" sound alike, but are visually
distinct and
unlikely to be confused.

Web Interface

The first form has the algorithm compare a string against
proposed generic TLDs, existing TLDs, including country codes, and
reserved words. It reports exact matches, near matches, and best
matches if there are no near matches.

The second form uses the algorithm to compare two strings against each
other.

Only alphabetic characters (a-z and A-Z), numerals (0-9),
and hyphens (-) are allowed in
strings. See
RFC 1035 2.3.1.

The code is written in
Python.
The interface to the algorithm itself is a single function,
howConfusableAre(). It takes two parameters: the two strings to be
compared.

HowConfusableAre() calls levenshtein()
to compute a form of edit difference, then normalizes the score and
accounts for string lengths.

Levenshtein() takes two strings. It is an enhanced Levenshtein
distance algorithm that accounts for substituting two
characters by one (or vice versa), inserting an additional repeated
character, and transposition, as well as the usual insertion,
deletion, and substitution.
The result is roughly the number of visual differences
between the strings. I say "roughly" because substituting O (upper
case letter "o") for 0 (zero) is a much smaller difference than
substituting, say, w for t. Levenshtein() calls two routines to find
similarity, and hence cost, for substituting or transposing characters
or "digraph" (character pairs): characterSimilarity() and
digraphSimilarity().

Both characterSimilarity() and digraphSimilarity() take two strings
(single characters in the case of characterSimilarity()). Both work
much the same way: look up
the passed strings in a table. If they are in the table, return the
value in the table. If not, use the default: 0 means completely
different and 1 means identical.

Cautions

The algorithm, consisting of the distance measure, the scoring
formula, and the character similarities are mostly just my
estimations. Several people have looked at resulting scores. I
adjusted the formula and a few weights to correspond with feedback.
Still the algorithm has not been systematically or independently
validated.