Quickly play with Polity IV and OECD data (and see the danger of US democracy)

The Polity IV Project released new data yesterday, with democratization scores for 169 countries up to 2016. I wanted to check if the ongoing erosion of US democratic institutions since the 2016 elections registered in the US’s Polity score, and, lo and behold, it did! We dropped from our solid, historically consistent 10 to an 8.

But is that bad? How does that compare to other advanced democracies, like countries in the OECD?

What follows below shows how relatively easy it is to quickly and reproducibly grab the new data, graph it, and compare scores across countries. (This notebook is also in a GitHub repository.)

First, we have to download the new Polity data. We could navigate to the Polity IV data page and download the data manually, but that’s not scriptable. Instead, we can use GET() from the httr package (or httr::GET() for short) to download the file directly rather than hunting it down with a browser. After saving the Excel file to a temporary file, we use readxl::read_excel() to load and parse the data. The chain of functions following read_excel() (chained together with dplyr’s %>% pipes) selects and renames the Correlates of War country code, year, and polity score; ensures that those columns are integers (rather than text or decimal-based numbers); and finally filters the data to 2001 and beyond.

The Polity project does not include information about OECD membership, so we have to download that on our own. It’s relatively easy to google “OECD member countries” and copy/paste the results from any of the resulting webpages into a CSV file, but that’s not reproducible or scriptable.

Instead, we can use the rvest package to scrape data directly from the OECD’s most recent list of members. There are several ways to target HTML elements on a page, including both CSS selectors and XPath queries, each with advantages and disadvantages. Using CSS is fairly simple if you’re familiar with the structure of the HTML—if you want to select a table inside an <article> element in the main <section id="content"> section, you can specify #content > article > table.

The syntax for XPath is easier to use, but is really hard to write by hand. Fortunately, you don’t have to do it manually. In Chrome, go to the OECD’s list, right click on the webpage, and select “Inspect” to open Chrome’s built in web inspector. Activate the inspection tool and find the table with the list of countries:

Right click on <table align="center" border="0" style="width: 75%;"> in the inspector and choose “Copy” > “Copy XPath” to copy an XPath query for that table to the clipboard:

We can then use that query (which should be something like //*[@id="webEditContent"]/table[2]) in a dplyr chain that will load the webpage, select the HTML<table> element, and parse it into something R can read.

The dplyr chain has a few interesting quirks. read_html() will download the given URL as HTML, and html_nodes() will select the HTML element specified by the XPath query. html_table() will parse that element into an R object (here we specify fill = TRUE since not all the rows in the table have the same number of columns; filling the table adds additional empty columns to rows that lack them). bind_rows() and as_data_frame() convert the parsed HTML table into a data frame.

# Pro tip: the HTML structure of the OECD page can change over time, and the# XPath query will inevitably break. To avoid this, use a snapshot of the# webpage from the Internet Archive instead of the current OECD page, since the# archived version won't change.
oecd.url <-"https://web.archive.org/web/20170821160714/https://www.oecd.org/about/membersandpartners/list-oecd-member-countries.htm"
oecd.countries.raw <- read_html(oecd.url)%>%# Use single quotes instead of double quotes since XPath uses " in the query
html_nodes(xpath ='//*[@id="webEditContent"]/table[2]')%>%
html_table(fill =TRUE)%>%
bind_rows()%>% as_data_frame()

Because this table has extra empty columns, R doesn’t recognize header rows automatically and names each column X1, X2, and so on.

Before we start plotting, we have to manipulate and filter the data a little bit more. First, we’ll create a data frame of just US Polity scores, selecting only countries with a COW code of 2 (which is the US)

We then want to calculate the average Polity score for all OECD countries over time, excluding the US. We select all rows with a COW code in oecd.countries$cowcode using %in% in the filter query and then exclude the US with cowcode != 2. We then calculate the yearly mean with group_by() and summarise():

In the final plot, we want to include shaded regions showing Polity’s general classifications of regime type, where autocracies range from −10 to −6, anocracies range from −5 to 5, and democracies range from 6 to 10. We can use dplyr::tribble() to quickly create a small data frame with these ranges. The mutate() command at the end uses forcats::fct_inorder to change the democracy column into an ordered factor so the ranges are plotted in the correct order. Finally, to prevent gaps, I add/subtract 0.5 from the start and end values (i.e. since democracies end at 6 and anocracies start at 5, there would be an empty gap in the plot between 5 and 6).

Now that we have all the cleaned up data frames, we can finally put it all together in one final plot. Check the heavily commented code below for an explanation of each layer

# We start with an empty ggplot object since we're using so many separate data frames
ggplot()+# First we include the shaded ranges for democracies, anocracies, and# autocracies. Since we want the regions to go from edge to edge# horizontally, we set xmin and xmax to ±Inf. ymin and ymax use the ranges we# created in the polity.breaks data frame. We use democracy as the fill# variable. The alpha value makes the layer 80% transparent.
geom_rect(data = polity.breaks, aes(xmin =-Inf, xmax =+Inf,
ymin = end, ymax = start,
fill = democracy),
alpha =0.2)+# We then add the line for the US polity score
geom_line(data = us.polity, aes(x = year, y = polity), size =2, color ="#00A1B0")+# And then the line for the average OECD polity score
geom_line(data = oecd.polity, aes(x = year, y = polity), size =2, color ="#EB6642")+# We add a label for the US line above the line at y = 11, using the same color
geom_text(aes(x =2000, y =11, label ="United States"),
hjust =0, family ="Open Sans Condensed Bold", color ="#00A1B0")+# We add a similar label for the OECD line at y = 8.5, again with the same color
geom_text(aes(x =2000, y =8.5, label ="OECD average"),
hjust =0, family ="Open Sans Condensed Bold", color ="#EB6642")+# All of the data happens in the democracy region, but we want to show the# full range of Polity values, so we force the y axis to go from -10 to 11
coord_cartesian(ylim =c(11,-10))+# Remove the "democracy" title from the legend
guides(fill = guide_legend(title =NULL))+# Add titles and labels and captions
labs(x =NULL, y ="Polity IV score",
title ="Democracy in the USA", subtitle ="I wonder what happened in 2016…",
caption ="Source: Polity IV Project")+# Use a light theme with Open Sans Light as the default font
theme_light(base_family ="Open Sans Light")+# Move the legend to the bottom and make the title font bigger, bolder, and condenseder
theme(legend.position ="bottom",
plot.title = element_text(family ="Open Sans Condensed Bold",
size = rel(1.6)))

Once we have this basic plot, we can extend it with more data. For instance, we can compare the US’s Polity score not only to the OECD average, but to specific countries: