Other sites

Talk like a pirate day 2017

International Talk Like A Pirate Day

Q: What has 8 legs, 8 arms and 8 eyes?
A: 8 pirates.

Avast, ye scurvy scum! Today be September 19, the International Talk Like A Pirate Day. While it’s a silly “holiday”, it’s a great chance to switch your Facebook to Pirate English (“Arr, this be pleasing to me eye!”). Also, it’s as good a reason as any to go sailing the Internet Main, scouring for booty (data)!

Well, any excuse, y’know…

What is today known (more or less) as “Pirate English” (the bit with all the “arrr!”, “ahoy, ye matey!”, etc) is in fact the accent employed by Robert Newton, the actor who played Long John Silver in the first non-silent film (“talkie”) Treasure Island. He speaks with a Bristol accent, which was in fact the main trading port with the West Indies (Bristol, not the accent, of course). The accent we today associate with pirates could well have been historical!

More useless information on pirates (such as that there is no historical evidence of a pirate ever owning a parrot as a pet – again this stereotype evolved from the Treasure Island movie) is on the wonderful as always treasure trove of superfluous information, the QI page on pirates.

Data

First, let’s get some data on pirates. I have some hazy memories on seeing data on historical shipping, and also piracy somewhere a while ago, but I cannot find it any more. I must be getting older. So, I’ll do what any self-respecting young whippersnapper would do and grab data from Wikipedia. There is a list of famous pirates; it’s biased, of course. I see relatively few entries on Barbary pirates, for instance. No matter, the list covers most of the Golden Age of Piracy in the Caribbean, and I’ll use it to generate a dataset on these pirates.

We’ll grab the wiki page, linking to the static ID of the page for reproducability.

We then use purrr::map_dfr to grab all tables in the HTML, and push them into a single dataframe and do some basic cleaning. Man, I hate Windows; utf-8 and Windows still in 2017 is a major pain in the stern.

## grab all tables on the page; tables 3-5 are the ones of main interest
# 3: Rise of the English Sea Dogs and Dutch Corsairs: 1560-1650
# 4: Age of the Buccaneers: 1650-1690
# 5: Golden Age of Piracy: 1690-1730
pirates.rawpirates.wiki%>%# grab all data tables; they all have the class "wikitable"
html_nodes(".wikitable")%>%# extract all tables and put them into one dataframe
purrr::map_dfr(html_table,fill=TRUE)%>%# clean column names
setNames(make.names(names(.)))%>%mutate(# stupid wiki has capital and non-capital letters in headers...
Years.active=if_else(is.na(Years.active),Years.Active,Years.active),# empty table cells should be NA
Years.active=parse_character(Years.active),Life=parse_character(Life),# god, I hate windows... R on Win will chocke on ‒ and – characters, so replace them with ASCII
# http://unicode-search.net/unicode-namesearch.pl?term=dash
# U+2012 to U+2015 are dashes
Life=stringr::str_replace_all(Life,"(\u2012|\u2013)","-"),Years.active=stringr::str_replace_all(Years.active,"(\u2012|\u2013)","-"))%>%# keep columns we want and re-order
select(Name,Life,Country.of.origin,Years.active,Comments)

With the raw data pulled, we now need to clean it a bit. Our goal is to generate a data set of the “pirating times” of each individual pirate. Wiki lists the active times, the “flourished” times, and/or the life times. We’ll split the age columns (usually yyyy-yyyy) into two, providing some padding if there is only one of the dates available. After this, we’ll clean up the columns into integer columns, ignoring the rows with did not provide accurate enough year data (e.g. “17th century”). This is a fun project, not a scientific one.

## generate our pirating times dataset
pirates.dtapirates.raw%>%mutate(# born and died forced into the "yyyy-yyyy" schedule for later tidyr::separate()
Life.parsed=gsub("(b.|born) *([[:digit:]]+)","\\2-",x=Life),Life.parsed=gsub("(d.|died) *([[:digit:]]+)","-\\2",x=Life.parsed),Years.active=gsub("(to) *([[:digit:]]+)","-\\2",x=Years.active),# move the "flourished" dates to its own column
flourished=if_else(grepl("fl.",Life.parsed),gsub("fl. *","",Life.parsed),as.character(NA)),# remove "flourished" dates from life
Life.parsed=if_else(grepl("fl.",Life.parsed),as.character(NA),Life.parsed))%>%# split the life, active, and flourished years into 2 columns (start/stop), each
separate(Life.parsed,c("born","died"),sep="-",extra="merge",fill="right")%>%separate(Years.active,c("active.start","active.stop"),sep="-",extra="merge",fill="right",remove=FALSE)%>%separate(flourished,c("fl.start","fl.stop"),sep="-",extra="merge",fill="right",remove=FALSE)%>%#
mutate_at(vars(born,died,active.start,active.stop,fl.start,fl.stop),f.keep.last.four.digits)%>%mutate_at(vars(born,died,active.start,active.stop,fl.start,fl.stop),#as.integer
parse_integer)%>%# get the best guess of piracing time: active > flourished > estimated from life
mutate(# take active if available, if not take flourished
piracing.start=if_else(is.na(active.start),fl.start,active.start),piracing.stop=if_else(is.na(active.stop),fl.stop,active.stop),# neither active, nor flourished date, estimate from life
piracing.start=if_else(is.na(piracing.start)&is.na(piracing.stop),f.activity.from.life(b=born,d=died),piracing.start),piracing.stop=if_else(is.na(piracing.start)&is.na(piracing.stop),f.activity.from.life(b=born,d=died,return.b=FALSE),piracing.stop),# finally, let's get single-year start/stops cleaned up
piracing.start=if_else(is.na(piracing.start),piracing.stop,piracing.start),piracing.stop=if_else(is.na(piracing.stop),piracing.start,piracing.stop))%>%# reorder
select(Name,Life,piracing.start,piracing.stop,Country.of.origin,everything())

With the dataset now sufficiently cleared up, we will now change the dates to decades only – so change 1643 to 1640. We’ll graph our data by decade. We’ll use some rowwise() %>% do() magic to generate one row per pirate per activity decade (so two rows if he was active from 1648 to 1651). We will add up the pirates active in each decade to get to our visualisation of pirate activity, and some pirates were active for far longer than others!

## building the pirate activity, with each pirate having one row per decade in which he was active
pirate.activitypirates.dta%>%mutate(piracing.start=piracing.start%/%10L*10L,piracing.stop=piracing.stop%/%10L*10L,arrr=piracing.stop-piracing.start,country=if_else(Country.of.origin%in%top.n.countries(10),Country.of.origin,"other"))%>%filter(piracing.start>1000)%>%rowwise()%>%do(data.frame(Name=.$Name,Life=.$Life,piracing.start=.$piracing.start,piracing.stop=.$piracing.stop,# need one row per decade active
piracing.years=seq(.$piracing.start,.$piracing.stop,by=10L),arrr=.$arrr,Country.of.origin=.$Country.of.origin,country=.$country,stringsAsFactors=FALSE))

With the data now set up, we can set out to graph the pirate activity, and then also to build a random pirate generator!

England is over-represented by quite a large margin, especially if you group Wales/Scotland/Ireland, and potentially Colonial America. The Netherlands and France come in second and third place, respectively. No big surprises here.

We’ll pull the data on pirate activity to build a graph of the overall pirate activity over the decades. This is a rather straightforward bar chart:

One can nicely see the increase in pirate’s activity during the Golden Age (roughly 1560-1700). It drops shortly in 1700, building up again shortly thereafter. The middle of the 18th century is more or less free from pirates (at least those infamous enough to end up on Wikipedia), until piracy starts again around 1800 with the English-French wars. The Somalian pirates of the 2000s are the last on the graph!

This leads us to the next questions: Somalia was the heyday of pirates in the 2000s, but we cannot reasonably expect it to be in the 1600s. When was a country’s individual golden age of pirating?

To answer this, we’ll build a ridgeplot (formaly also known as a joyplot, from the Joy Division band t-shirt; the name is now discouraged for obvious reasons), showcasing the density plots of pirate activity per country.

The “usual suspects” add up nicely – England, the Netherlands, France. It’s interesting to see Germany show up as early (around 1350-1400). These are the pirates of the Hansa, mostly. Venezuela also had a “late blooming” pirate life in the 19th century, mostly. Interesting, I had not known about that!

A Pirate’s Life For Me!

With the historical data on pirates and their activity times, we can also try to generate a random pirate story! For this, we will restrict the data to the golden age of piracy, dropping all pirates which started their pirate life before 1500, and after 1750. We will want to create a random pirate with a statistically “correct” starting year of piracy, a statistically “correct” piracy tenure, and varnish him (or her) with a random pirate name!

Unfortunately, it’s a heavily left-skewed distribution, and after some hours of looking into distributions, I couldn’t find a good fit, and we’ll have to resort to pulling from the empirical distribution (ie, the data itself) for generating a random start year for a pirate:

f.gen.pirating.startyearfunction(){# generate a startyear by pulling from the empirical distribution (pirates.fit$piracing.start)
base::sample(pirates.fit$piracing.start,1)}

Golden Age Pirate: piracing tenure

We have better luck with the tenure of a pirate: the distribution follows a nice count-data distribution, I’m guessing a negative binomial.

pirates.fit%>%ggplot()+geom_density(aes(x=piracing.start))

As it’s discrete data, we’ll use the negative binomial distribution to estimate the theoretical distribution of a pirate’s tenure using library(fitdistrplus).

The results are excellent: we can clearly see the oversampling of “round” years (10, in particular), which comes from the fact that with uncertain dates, Wikipedia noted a pirate’s activity in decades, rather than actual years. But still, our distribution of rnbinom(x, size = 0.3876683, mu = 6.395578) works nicely.

Golden Age Pirate: silly name generator

All that is missing now is a random pirate name generator. Fantasy name generator have a nice collection of name generators, and a pirate one is included! They allow the use of the names “to name anything in any of your own works”, so let’s use them here!

We source the names into R and create a simple function pulling a random name, allowing you to specify male or female.

Golden Age Pirate: a pirate’s story!

With all the pieces set up, we can now generate a random pirate:

f.pirate.storyfunction(female=FALSE){# build the pirate story by joining name, startyear, and tenure time
yyf.gen.pirating.startyear()rrpaste0("Your name is ",f.gen.pirate.name(female)," and you roamed the Caribbean from ",yy," to ",yy+f.gen.pirating.tenure(),". Arrr! \u2620")return(rr)}