Chapter 2 Data sources

As shown in the introduction, we need a wide variety of information, including economic, social welfare, education, greenhouse emissions, electricity production structure etc. Therefore, we found an comprehensive dataset that touches on all of these issues World Development Indicators from the world bank. The data is a collection of development indicators by the world bank, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

The data set contains several files. WDISeries.csv contains the information about what series are collected, with 1443 records and 20 columns. Each records the information of a series and includes the name of the series, a short code for it, the topic of the series, the source of the series and so on. WDICountry.csv contains the information of countries and regions that are recorded, with 265 rows and 30 columns. Each row represents to a country or region and records the information of its name, region, income group and so on. WDIData.csv contains all the data we may use for every country and every series stated above. Since there is an extra Not classified class for countries, there are in total (265+1) * 1443 = 383838 records. Each row contains the country name and code, the series name and code, and the value since 1960 to 2020, which are 66 columns altogether.

When matching the country name of WDICountry.csv and WDIData.csv we found that some names are not the same. There are two reasons that causes the difference: the first one is that WDICountry.csv uses country names with non-ASCII characters, such as Côte d'Ivoire and Curaçao while WDIData.csv transforms them into ASCII characters; another reason is the minor difference in presenting country groups. For example, WDICountry.csv uses IDA & IBRD while WDIData.csv uses IDA & IBRD countries. We manually changed the names in WDICountry.csv to match those in WDIData.csv.