Someone recently asked me if there’s a way to translate a 2 letter country code (i.e. US) to a country name (i.e. United States), and similarly, if there’s a way to translate a 3 letter country code (i.e. CAN) to a country name (i.e. Canada).
Wikipedia has a page that lists different data properties for each country. This data include country codes, mobile country code, country top level domain, etc.
In this tutorial we will scrape Wikipedia for the information about each country, and then translate between the different possible country names. We will perform the following steps:
The countries are listed in different pages on Wikipedia, so we will first get the urls for all these pages automatically
We will iterate over all of these urls and extract the listed details for each country
Save the results in a file
Load the data from the file and create an object that translates a country code to a full country name
The code for this tutorial can be found on Github.
First, we will extract all the urls for the countries’ data.
In order to do that we will fetch the Wikipedia page using the requests module.
After that we will use BeautifulSoup to create a soup object from the content of the page. The soup object will help us to easily retrieve the data that we want from the HTML.
In the page source, the data that we want to extract looks like this:
We can see that for each letter we have an <a> tag with “href” that has the following template "/wiki/Country_codes:_<LETTER / LETTER RANGE>". We want the “href” values of these <a> tags, they will lead us to the country data.
In the next one liner we perform the following actions:
soup.findAll(‘a’) - extracts all the <a> tags from the page.
a_elem.attrs.get(‘href’, ‘’).startswith('/wiki/Country_codes') - we check for each a_elem that the has a “href” attribute and starts with "/wiki/Country_codes".
Save the “href” of the a_elem that fulfills the required condition in countries_urls.
After running this line, "countries_urls" should have the following values:
Step 2: Get the details for each country
Now that we have all the urls for the country data saved in “countries_urls“, we will extract the data that we actually want from these urls.
Each url holds a list of countries and information about them.
You can see examples of how this data looks in the following page or in the following image:
In the code we’ll create a new function called “scrape_countries_details“.
This function will help us collect the data for all of the countries from each url.
We will use the function as we iterate on the urls that we fetched in step 1.
The “scrape_countries_details“ function will return a list of country data from the url.
We will save these results in the “all_countries_details“ list.
The "scrape_countries_details" function will work in the following way:
1) Get the url content and convert it to a soup object
2) Fetch all elements that hold country names
3) Iterate over "country_names_elems" and for each "country_name_elem" retrieve the relevant table of contents
4) From the country’s data table, retrieve all the cells and create a dictionary, “country_data”, from the keys (cell names) and values (cell values) in the table
5) Add the country name and Wikipedia page url for the country to "country_data" dictionary from the "country_name_elem" object
6) Add the "country_data" to the "countries_data" list that will contain all the countries and their data from the page. This list will be the output of the function.
A full resolution of the function’s code:
In the end we will want to save our data in a file so we can use it later.
Once we have a file with all the country data, we can create our country code translator.
We want to create an object that will resolve every type of country code to the country name.
For example, we want to be to do the following translations:
“US” => “United States”
“USA” = > “United States”
“ca” => “Canada”
“can” => “Canada”
In order to do that, we’ll create an object that gets a file path and loads the data from the file:
The next code is for creating dictionaries that will allow us to do the following:
Translate between 2 letter country code to country name (using the ‘ISO 3166-1 alpha-2’ values from each country)
Translate between 3 letter country code to country name (using the ‘ISO 3166-1 alpha-3’ values from each country)
Translate country name to country details.
We will also create a function for retrieving country details no matter what type of country name we choose: