Project 3: GDP and life expectancy

by RJ, November 2nd 2015

This is the project notebook for Week 3 of The Open University's Learn to code for Data Analysis course.

Does a high population density have an impact on life expectancy? The following analysis checks whether there is any correlation between population density of a country in 2013 and the life expectancy of people born in that country in 2013.

Getting the data

Two datasets of the World Bank are considered. One dataset, available at http://data.worldbank.org/indicator/EN.POP.DNST, lists the Population density of the world's countries for various years. The other dataset, available at http://data.worldbank.org/indicator/SP.DYN.LE00.IN, lists the life expectancy of the world's countries.

The datasets are downloaded directly, using the unique indicator name given in the URL.

In [12]:
from pandas import *
from pandas.io.wb import download

YEAR = 2013
POPDENS_INDICATOR = 'EN.POP.DNST'
popdens = download(indicator=POPDENS_INDICATOR, country='all', start=YEAR, end=YEAR)
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
life = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR)

Cleaning the data

Inspecting the data with head() and tail() shows that:

  1. country names are the row indices, not column values;
  2. the first 34 rows are aggregated data, for the Arab World, the Caribbean small states, and other country groups used by the World Bank;
  3. Population density and life expectancy values are missing for some countries.

The data is therefore cleaned by:

  1. transforming the dataframe index into columns and creating a new index 0, 1, 2, etc.;
  2. removing the first 34 rows;
  3. removing rows with unavailable values.
In [13]:
popdens.head(3)
Out[13]:
EN.POP.DNST
country year
Arab World 2013 27.684115
Caribbean small states 2013 17.230626
Central Europe and the Baltics 2013 93.943347
In [15]:
popdens = popdens.reset_index()[34:].dropna()
life = life.reset_index()[34:].dropna()
popdens.head(3)
Out[15]:
index country year EN.POP.DNST
34 68 Canada 2013 3.866307
35 69 Cayman Islands 2013 243.204167
36 70 Central African Republic 2013 7.561524

The unnecessary columns can be dropped.

In [16]:
COUNTRY = 'country'
POPDENS = 'Population Desity (people per sq. km of land area)'
popdens[POPDENS] = popdens[POPDENS_INDICATOR].apply(round)
headings = [COUNTRY, POPDENS]
popdens = popdens[headings]
popdens.head()
Out[16]:
country Population Desity (people per sq. km of land area)
34 Canada 4
35 Cayman Islands 243
36 Central African Republic 8
37 Chad 10
38 Channel Islands 853

The World Bank reports the population desnity and life expectancy with several decimal places. After rounding, the original column is discarded.

In [17]:
LIFE = 'Life expectancy (years)'
life[LIFE] = life[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
life = life[headings]
life.head()
Out[17]:
country Life expectancy (years)
34 Afghanistan 61
35 Albania 78
36 Algeria 71
39 Angola 52
40 Antigua and Barbuda 76

Combining the data

The tables are combined through an inner join on the common 'country' column.

In [18]:
pdVsLife = merge(popdens, life, on=COUNTRY, how='inner')
pdVsLife.head()
Out[18]:
country Population Desity (people per sq. km of land area) Life expectancy (years)
0 Canada 4 81
1 Central African Republic 8 50
2 Chad 10 51
3 Channel Islands 853 80
4 Chile 24 80

Calculating the correlation

To measure if the life expectancy and the GDP grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.

In [19]:
from scipy.stats import spearmanr

pdColumn = pdVsLife[POPDENS]
lifeColumn = pdVsLife[LIFE]
(correlation, pValue) = spearmanr(pdColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
    print('It is statistically significant.', pValue)
else:
    print('It is not statistically significant.', pValue)
('The correlation is', 0.2940620186159793)
('It is statistically significant.', 0.00012623396817285377)
In [ ]:
 
In [ ]:
 

The value shows a statistically significant pValue and a correlation of 0.29. Not sure what to conclude from this, other than that it appears there supposedly is some connection between population density and life expectency.

In [ ]:
 

Showing the data

Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot.

In [22]:
%matplotlib inline
pdVsLife.plot(x=POPDENS, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(10, 4))
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b2fb610>

All I conlcude from the scatterplot is that the countries that have a high population density are also in the upper ranges of life expectancy, but a low pop. density doesn't lead to a low life expectancy.

It should also be noted that population density is measured per country, and as there are many inhabitable areas in the world this number is likely to misrepresent the majority of the population in some cases.

In [24]:
# the 10 countries with lowest GDP
pdVsLife.sort(POPDENS).head(10)
Out[24]:
country Population Desity (people per sq. km of land area) Life expectancy (years)
88 Mongolia 2 68
47 Iceland 3 83
93 Namibia 3 64
135 Suriname 3 71
42 Guyana 4 66
71 Libya 4 75
83 Mauritania 4 62
0 Canada 4 81
31 Gabon 6 63
58 Kazakhstan 6 70
In [25]:
# the 10 countries with lowest life expectancy
pdVsLife.sort(LIFE).head(10)
Out[25]:
country Population Desity (people per sq. km of land area) Life expectancy (years)
122 Sierra Leone 86 46
69 Lesotho 69 49
136 Swaziland 73 49
1 Central African Republic 8 50
91 Mozambique 34 50
8 Congo, Dem. Rep. 32 50
11 Cote d'Ivoire 68 51
2 Chad 10 51
100 Nigeria 190 52
22 Equatorial Guinea 28 53

Conclusions

Based on the information shown below I would not dare conlcusing there is a relationship between country-wide population density and life expectancy in those countries. It would be interesting to apply this analysis on a regional basis (consistent population density and life expectancy data per region) if that data was available.