Visualizing possible cities for Amazon’s new headquarters (using R)

Last week, Amazon announced that it has started to search for a new city in which to build a second headquarters.

Among several selection criteria, they indicated that they’re looking for a city with more than 1 million people, and a city with a good pool of tech talent.

While reading about Amazon’s new HQ search on a news website, I encountered a dataset of cities that might qualify: cities with over 1 million people in the metro area, and the corresponding percent of people with college degrees in each city.

The news website already visualized it, but I want to show you how to do this in R. It’s an excellent little exercise to show you again how to scrape/wrangle/visualize data.

With that in mind, we’re going to scrape the data, wrangle it into shape (using dplyr and a few other tools), and then visualize it as a map using ggplot2.

I’ll preface this by saying that this is an imperfect analysis. We don’t know the full and final selection criteria, and even if we did, a full analysis would be far beyond the scope of a simple blog post.

Having said that, this is a good “first pass” at such an analysis: the quick-and-dirty version.

Furthermore, if you’re getting involved in data science, this will give you some hints about how to use R as an analytical tool. If you’re in marketing, sales, or operations, you could very easily use this quick-and-dirty analysis as a template and starting point for some of your own work.

As it turns out, when we scraped the data, the original column names (the column names that appeared on the website) ended up in the first row of our newly created dataframe.

This is inappropriate, so we need to remove the first row of data.

#==============================================
# REMOVE FIRST ROW
# - when we scraped the data, the column names
# on the table were read in as the first row
# of data.
# - Therefore, we need to remove the first row
#==============================================
df.amz_cities % filter(row_number() != 1)

Now we’re going to modify the data type of two variables.

bachelors_degree_pct and population_tot need to be numeric variables, but when we scraped them, they were read in as character variables. This being the case, we’re going to use some techniques to parse/coerce them into numerics.

#===================================================================================
# MODIFY VARIABLES
# - both bachelors_degree_pct and population_tot were scraped as character variables
# but we need them in numeric format
# - we will use techniques to parse/coerce these variable from char to numeric
#===================================================================================
#--------------------------------
# PARSE AS NUMBER: population_tot
#--------------------------------
df.amz_cities % head()
#-----------------------------
# COERCE: bachelors_degree_pct
#-----------------------------
df.amz_cities

Next, we're going to create a variable that contains the city name.

When we read in the data, there was a variable for ‘metro_area.' This variable contains values like “New York-Newark-Jersey City.” The metro_area variable might be useful for some things, but when we geocode our data (to get the lat-long coordinates), these metro names may cause errors. They are too broad. We need a variable that contains specific city names; for geocoding, it will be better if we have a strict city name.

This being the case, we will create a new city variable by extracting the city names from the metro names. To do this, will will use stringr::str_extract() along with an appropriate regular expression that can pull out the city names that we want.

#=============================================================
# CREATE VARIABLE: city
# - here, we're using the stringr function str_extract() to
# extract the primary city name from the metro_area variable
# - to do this, we're using a regex to pull out the city name
# prior to the first '-' character
#=============================================================
df.amz_cities % mutate(city = str_extract(metro_area, "^[^-]*"))

Now that we have proper city names, we will geocode our data. We will use the geocode() function to retrieve the latitude and longitude for each city. Then we will merge the geo-data back onto the dataframe using cbind().

Quickly, we'll use the dplyr::rename() function to rename the ‘lon‘ variable to ‘long.'

#==============================================================
# RENAME VARIABLE: lon -> long
# - we'll rename lon to lon, just because 'long' is consistent
# with the name for longitude in other data sources
# that we will use
#==============================================================
df.amz_cities % names()

Next, to make our data a little easier to read, we will re-order the variables. We’ll organize it so that the city, state, and metro are first, followed by the geo-data, and then with the analysis variables at the end (population and ‘college degree percent’).

#================================================
# GET USA MAP
# - this is the map of the USA states, upon which
# we will plot our city data points
#================================================
map.states

Finally, we're ready to plot.

We'll initially do a “first iteration” to check that everything looks good.

At a high level, everything looks OK. The points are in the right locations, and at a glance, everything looks good.

Keep in mind, that compared to the finalized version below, the ‘first iteration’ is much much simpler to build. This is a great example of the 80/20 rule in data analysis: in this visualization, you can get 80% of the way with only 20% of the total ggplot() code.

Now that we have an initial version, we’ll polish it by adding titles, formatting theme elements, and by adjusting the legends.

A quick note: this is not supposed to be a comprehensive analysis

I want to point out that this is not intended to be comprehensive or conclusive in any way. Without detailed selection criteria, it will be difficult to come to any solid conclusions.

Rather, this is intended to give you a hint of what’s possible using R tools. If you were so inclined, you could certainly extend this into a much more comprehensive analysis by gathering more data and producing more charts.

Creating great visualizations gets easier once you master your toolkit

As you progress as a data scientist, you will get better at creating visualizations like this.

If you practice and master individual R functions, you will be able to create visualizations like this quickly.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.