Visualizing city similarity

This blog post explains an alternative way to figure out how
similar cities are. After you read it, you will realize why I think Madison
and Reykjavik are very similar cities.

Teleport Cities allows you to research the
most interesting cities in the world using a sophisticated scoring system
that compares and contrasts them. UsingSantiagoas
an example, this is what you see when you first visit its Teleport City
page:

Each score has well-thought-out data
science behind it which helps us rank all of 134 cities that are currently
part of Teleport Cities. By the way, these scores are not absolute, but are
relative to the rest of the cities we feature.

A question that comes to mind ishow can we compare different
cities?There are two main ways
of doing this in Teleport Cities. The first one lets you compare two cities
side by side. For example, if we want tocompare Santiago with Oslo, we are able to get this view:

The other way is to rank all cities by
logging in and setting your personal preferences. We can calculate a score
for each city giving you a ranked view of all of them. This is the world
through your personal lens, if you may:

However, there is one slight problem with
just adding up all scores. If two cities have a similar score of 75, does
that mean that they are similar? I can offer you a simple counter-example
to illustrate that just one number does not represent similarity. Say that
we have only two categories: commute and safety. If city A has 50 for
commute and 25 for safety, but city B has 25 for commute and 50 for safety,
obviously they are not necessarily similar. You can expand this to all our
categories (we have 15 categories in our public view, but 20 categories for
logged-in users).

Similarity Metric

We are fortunate to have lots of
mathematical techniques to be able to estimate how close or far from each
other two data sets are. One of the most well-known metrics is theEuclidean Distance. You probably have used it already, because in its
most basic form, it is utilized to calculate the length of the hypotenuse
of a triangle (remember the Pythagorean theorem?).

The Euclidean Distance can be calculated
between points in any number of dimensions. We can consider that each score
category in a Teleport City is a dimension. In the public view of our
cities, we can count up to fifteen individual dimensions. The formula is
very simple: you take the difference between two categories and square it,
repeat the same for every category, add up all the results and then get the
square root of the total sum.

This only gives us a metric, an important
number that tells you the relationship between two different cities: if
they are close or far. Most importantly, you can now see within a set of
cities, which one is the closest and the farthest from an arbitrary city
for the whole city set.

Plotting distances in a two-dimensional
chart

Human beings can easily represent two
dimensions on any surface (or three with a bit of imagination).
Comprehending more than three dimensions is nearly impossible for most
people. This means that if we want to visualize the relationship between
cities, we should do it in two dimensions preferably.

There are several methods to “reduce” the
number of dimensions (called multidimensional scaling or MDS). We could
take our 15 city dimensions and represent them in two. One of the
techniques which has survived the test of time and a favorite of mine is
called
Sammon’s Mapping. This was
developed by John W. Sammon in 1969.

In summary, what this algorithm does is
start with a random or pre-configured set of two-dimensional points and
then iterates over all data points (our cities). With every pass, it tries
to reduce the error (called stress by Sammon) between the distance in the
two-dimensional set and the actual distance in the fifteen-dimensional
original set. Once the stress conditions are met, the algorithm stops and
we are left with a set of points that can be easily represented in any flat
surface. The x- and y-axis are not meaningful in these charts, but the
distance between points is.

Here is an example of running this
algorithm using the original fifteen categories for all our 134 Teleport
Cities and representing them in a two dimensional chart. You can expand the
image to see it in large resolution. Two cities which are close in the plot
would very likely be similar in their category scores and
vice-versa.

Some charts below are interactive. On the bottom left you can select the magnifying glass to zoom in,
the crosshairs symbol to pan and the home icon to reset the view.

Sammon’s mapping applied to all 134 Teleport Cities

Let’s zoom in on the plot and verify by
taking a city pair which is close in our chart above, but that we wouldn’t
have thought to be similar:

Although all categories are not exactly the same, we can say that
categories with a high score are mostly the same in both cities. Naturally,
the results are not the same as technically this would put both dots in
exactly the same location in Sammon’s mapping. Who knew that Wisconsin and
Iceland had something in common?

Improving Sammon’s Mapping with a Voronoi
Tessellation

As we saw above, Sammon’s mapping is pretty
good at representing city similarity in a two-dimensional plot. It’s
difficult to see what is the Teleport score for the city though. I tried to
represent it above by the dot size in proportion to the city score, but I
am nearly sure that it wasn’t even noticed by most people.

Voronoi diagramsare named after George Voronoy who worked on the
mathematics behind these tessellations in the early 20th century. However,
these diagrams are older than his research and were used even by Descartes
himself in 1644. The basics of these representations is to have polygons
(called cells) around points. Cells are calculated via a distance-based
heuristic.

Each cell was colored based on the score
for the city. Reds are for low scores and greens are for high scores. To
increase the differentiation between cells, scores were normalized so
strong greens were given to the highest scores and strong reds to the
lowest.

The resulting diagram with the same mapping
as the previous chart:

Teleport City similarity using Sammon’s Mapping and Voronoi Cells

As you can appreciate in the chart above,
similar cities have an obvious difference in score. If we take the same
example as Madison and Reykjavik, the differences will be much clearer.
Madison seems to be slightly higher than Reykjavik, but they both are
definitely higher than Milwaukee:

Customized Visualizations of City Similarity

The charts above use the default scores
provided by Teleport Cities. Even though we update our data frequently,
default scores are pretty much stable. For example, a city that is
expensive today will not get cheaper in just a few days.

Where it gets interesting is when users
actually choose their preferences and generate a personal set of scores for
all cities. To give you an idea, anyone can create an account and log in to
set their preferences. Some examples of our preference dialogs are:

By selecting a few preferences, the
resulting scores become a unique view of what a person is looking for.
Also, cities which are similar in one combination of preferences would not
be necessarily similar in other conditions. In this way, the similarity map
is truly personal. I have collected below three example maps from
hypothetical users with different priorities who have activated various
preferences.

Preferences (example 1):

Climate: similar to Denver

Internet access: very important

Culture: cinemas, comedy clubs, concerts

Example 1

Preferences (example 2):

Low crime rate: very important

Healthcare: best possible care

Education: schools for my kids

Example 2

Preferences (example 3):

Startup scene vibrance: very important

Venture capital ecosystem: somewhat important

Travel connectivity: medium hub

Ease of starting a business: very important

Low corporate taxes: somewhat important

Fast internet connectivity: very important

Example 3

Do you have any comments or corrections? Just leave a comment below or get
in touch with us.

All charts in this post were generated using iPython Notebook, NumPy,
scikit-learn, Matplotlib, D3 and mpld3.