Chapter 3 Data transformation
Importing the data is straight forward using read_csv function in R. We import WDIData.csv, WDISeries.csv and WDICountry.csv as dataframe df_data, df_series and df_countries respectively. Since the country name and series name are repeated to cover every combination, we treat them as factors to ease the operation.
First, as stated in the previous section, we match the names of series and countries in df_series and df_countries to that in df_data to avoid any potential mismatching issues.
After that, since there are too many kinds of series included in the data set, we put our focus on the series with Topic in the following list that are related to our research:
- Economic Policy & Debt: National accounts: Adjusted savings & income
- Economic Policy & Debt: National accounts: US$ at constant 2015 prices: Aggregate indicators
- Education: Efficiency
- Education: Outcomes
- Education: Participation
- Environment: Emissions
- Environment: Energy production & use
- Health: Population: Structure
- Poverty: Income distribution
- Poverty: Poverty rates
In addition, since there are both country-wise and region-wise data in df_data, we want to concentrate on the country-wise data first, therefore we extract country data in the selected topic as df_countryData_selected. The reduced data set has 217 countries and 279 selected indicators, which leads to a total of 217 * 279 = 60543 rows and remains 66 columns of data.