by RJ, November 2nd 2015
This is the project notebook for Week 3 of The Open University's Learn to code for Data Analysis course.
Does a high population density have an impact on life expectancy? The following analysis checks whether there is any correlation between population density of a country in 2013 and the life expectancy of people born in that country in 2013.
Two datasets of the World Bank are considered. One dataset, available at http://data.worldbank.org/indicator/EN.POP.DNST, lists the Population density of the world's countries for various years. The other dataset, available at http://data.worldbank.org/indicator/SP.DYN.LE00.IN, lists the life expectancy of the world's countries.
The datasets are downloaded directly, using the unique indicator name given in the URL.
from pandas import *
from pandas.io.wb import download
YEAR = 2013
POPDENS_INDICATOR = 'EN.POP.DNST'
popdens = download(indicator=POPDENS_INDICATOR, country='all', start=YEAR, end=YEAR)
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
life = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR)
Inspecting the data with head()
and tail()
shows that:
The data is therefore cleaned by:
popdens.head(3)
popdens = popdens.reset_index()[34:].dropna()
life = life.reset_index()[34:].dropna()
popdens.head(3)
The unnecessary columns can be dropped.
COUNTRY = 'country'
POPDENS = 'Population Desity (people per sq. km of land area)'
popdens[POPDENS] = popdens[POPDENS_INDICATOR].apply(round)
headings = [COUNTRY, POPDENS]
popdens = popdens[headings]
popdens.head()
The World Bank reports the population desnity and life expectancy with several decimal places. After rounding, the original column is discarded.
LIFE = 'Life expectancy (years)'
life[LIFE] = life[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
life = life[headings]
life.head()
The tables are combined through an inner join on the common 'country' column.
pdVsLife = merge(popdens, life, on=COUNTRY, how='inner')
pdVsLife.head()
To measure if the life expectancy and the GDP grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.
from scipy.stats import spearmanr
pdColumn = pdVsLife[POPDENS]
lifeColumn = pdVsLife[LIFE]
(correlation, pValue) = spearmanr(pdColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
print('It is statistically significant.', pValue)
else:
print('It is not statistically significant.', pValue)
The value shows a statistically significant pValue and a correlation of 0.29. Not sure what to conclude from this, other than that it appears there supposedly is some connection between population density and life expectency.
Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot.
%matplotlib inline
pdVsLife.plot(x=POPDENS, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(10, 4))
All I conlcude from the scatterplot is that the countries that have a high population density are also in the upper ranges of life expectancy, but a low pop. density doesn't lead to a low life expectancy.
It should also be noted that population density is measured per country, and as there are many inhabitable areas in the world this number is likely to misrepresent the majority of the population in some cases.
# the 10 countries with lowest GDP
pdVsLife.sort(POPDENS).head(10)
# the 10 countries with lowest life expectancy
pdVsLife.sort(LIFE).head(10)
Based on the information shown below I would not dare conlcusing there is a relationship between country-wide population density and life expectancy in those countries. It would be interesting to apply this analysis on a regional basis (consistent population density and life expectancy data per region) if that data was available.