Scrape country data from Wikipedia

Someone recently asked me if there’s a way to translate a 2 letter country code (i.e. US) to a country name (i.e. United States), and similarly, if there’s a way to translate a 3 letter country code (i.e. CAN) to a country name (i.e. Canada).

Wikipedia has a page that lists different data properties for each country. This data include country codes, mobile country code, country top level domain, etc.

In this tutorial we will scrape Wikipedia for the information about each country, and then translate between the different possible country names. We will perform the following steps:

The countries are listed in different pages on Wikipedia, so we will first get the urls for all these pages automatically

We will iterate over all of these urls and extract the listed details for each country

Save the results in a file

Load the data from the file and create an object that translates a country code to a full country name

Requirements

Step 1: Get urls for all countries

First, we will extract all the urls for the countries’ data.

In order to do that we will fetch the Wikipedia page using the requests module.
After that we will use BeautifulSoup to create a soup object from the content of the page. The soup object will help us to easily retrieve the data that we want from the HTML.

We can see that for each letter we have an <a> tag with “href” that has the following template "/wiki/Country_codes:_<LETTER / LETTER RANGE>". We want the “href” values of these <a> tags, they will lead us to the country data.

In the next one liner we perform the following actions:

soup.findAll(‘a’) - extracts all the <a> tags from the page.

a_elem.attrs.get(‘href’, ‘’).startswith('/wiki/Country_codes') - we check for each a_elem that the has a “href” attribute and starts with "/wiki/Country_codes".

Save the “href” of the a_elem that fulfills the required condition in countries_urls.

Step 2: Get the details for each country

Now that we have all the urls for the country data saved in “countries_urls“, we will extract the data that we actually want from these urls.

Each url holds a list of countries and information about them.
You can see examples of how this data looks in the following page or in the following image:

In the code we’ll create a new function called “scrape_countries_details“.
This function will help us collect the data for all of the countries from each url.
We will use the function as we iterate on the urls that we fetched in step 1.
The “scrape_countries_details“ function will return a list of country data from the url.
We will save these results in the “all_countries_details“ list.

# Fetch all elements that hold country namescountry_names_elems=soup.findAll('span','mw-headline')

3) Iterate over "country_names_elems" and for each "country_name_elem" retrieve the relevant table of contents

# Find the next table element after country name while holds# the country's datacountry_table=country_name_elem.parent.findNext("table")

4) From the country’s data table, retrieve all the cells and create a dictionary, “country_data”, from the keys (cell names) and values (cell values) in the table

# Fetch all the cells in the tabletds=country_table.findAll("td")# Each cell holds the cell name and the value # so we can create a dict by reading each cell# with the cell name as the key and the cell data as the valuecountry_data={td.find("a").text:td.find("span").textfortdintds}

5) Add the country name and Wikipedia page url for the country to "country_data" dictionary from the "country_name_elem" object

6) Add the "country_data" to the "countries_data" list that will contain all the countries and their data from the page. This list will be the output of the function.

countries_data.append(country_data)

A full resolution of the function’s code:

defscrape_countries_details(url):# Get remote pageresponse=requests.get(url)# Create soup object from page contentsoup=bs4.BeautifulSoup(response.text,"html.parser")countries_data=[]# Fetch all elements that holds country namecountry_names_elems=soup.findAll('span','mw-headline')# For each country element retrieve all the relevant dataforcountry_name_elemincountry_names_elems:# Find the next table element after country name while holds# the country's datacountry_table=country_name_elem.parent.findNext("table")ifnotcountry_table:continue# Fetch all the cells in the tabletds=country_table.findAll("td")# Each cell holds the column name and the value # so we can create a dict by reading each cell# with the column name as the key and the cell data as the valuecountry_data={td.find("a").text:td.find("span").textfortdintds}# Add the country name and wikipedia page url for the countrycountry_a_elem=country_name_elem.find('a')country_data["country_name"]=country_a_elem.text.replace("\n","").strip()country_data["country_url"]=BASE_URL+country_a_elem['href']countries_data.append(country_data)returncountries_data

In the end we will want to save our data in a file so we can use it later.

Step 3: Create country code translator

Once we have a file with all the country data, we can create our country code translator.
We want to create an object that will resolve every type of country code to the country name.
For example, we want to be to do the following translations:

“US” => “United States”

“USA” = > “United States”

“ca” => “Canada”

“can” => “Canada”

In order to do that, we’ll create an object that gets a file path and loads the data from the file: