What is the World Happiness Report Database ?

Context

The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

Content

The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

What do the columns succeeding the Happiness Score (like Family, Generosity, etc.) describe?

The following columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption describe the extent to which these factors contribute in evaluating the happiness in each country. The Dystopia Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country as stated in the previous answer.

Before doing anything, I have to set up R, and load the data, which is in the data_raw file of the package

knitr::opts_chunk$set(echo = TRUE, message = TRUE, eval = TRUE)
library(readxl)
library(utils)
library(assertthat)
library(dplyr)
library(tidyverse)
library(ProjetAlex)
library(dygraphs)
library(ggplot2)
library(rvest)

When building my package, I used the console to read the .csv files I downloaded on Kaggle, and converted it to .rda using devtools::use_data function. Now I am loading them.

load("../data/data2015.rda")
load("../data/data2016.rda")
load("../data/data2017.rda")

Now that I have the data for each year, my first goal is to aggregate them in order to be able to observe the evolution of all countries over time

# first, I have to mutate the data in order to add a column with the year in each database
# I'm going to use a function I coded, in the R file, called add_year

data2015 <- add_year(data2015, 2015)
data2016 <- add_year(data2016, 2016)
data2017 <- add_year(data2017, 2017)
# Now that I have the year of each observation, in order to bind all my data together, I have to tidy it so I have the same columns in each of my dataframes, and only vectors

data2015 <- data2015 %>% 
  select(-Region, -Standard.Error)
indx <- sapply(data2015 %>% 
                 select(-Country)
               , is.factor)
data2015[indx] <- lapply(data2015[indx], function(x) as.numeric(as.character(x)))

data2016 <- data2016 %>% 
  select(-Region, -Lower.Confidence.Interval, -Upper.Confidence.Interval)

indx <- sapply(data2016 %>% 
                 select(-Country)
               , is.factor)
data2016[indx] <- lapply(data2016[indx], function(x) as.numeric(as.character(x)))

data2017 <- data2017 %>% 
  select(-Whisker.high, -Whisker.low)
indx <- sapply(data2017 %>% 
                 select(-Country)
               , is.factor)
data2017[indx] <- lapply(data2017[indx], function(x) as.numeric(as.character(x)))

rm(indx)
#at long last, I can finally bind the data together
data_Happiness <- bind_rows(data2015, data2016, data2017)
# and make it a bit clearer
data_Happiness <- data_Happiness %>% 
  rename( Economy = Economy..GDP.per.Capita.) %>% 
  rename( Health = Health..Life.Expectancy.) %>% 
  rename( Trust = Trust..Government.Corruption.)

Let's start the analysis!

Now, I can finally see how happiness has evolved in the last three years, for, say, top 5 countries in average happiness over the last 3 years.

top_5 <- data_Happiness %>%
  group_by(Country) %>% 
  mutate(average_happiness = sum(Happiness.Score)/3) %>% 
  select(Country, average_happiness) %>% 
  distinct() %>% 
  head(5) %>% 
  pull(Country)

top_5

evolution_top_5 <- ggplot(data_Happiness %>% 
                             filter(Country %in% top_5),
                           aes(x = year, y = Happiness.Score, color = Country)) + geom_line() + theme_classic()   

evolution_top_5

The most surprising thing in this graph is the apparent drop of happiness in Canada. How can we explain it?

#Here, we are going to use a separate function I coded which gives us the evolution of the main indicators for any given country

country_analysis("Canada")

This doesn't make much sense... The happiness in Canada dropped in a significant way from 2016 to 2017, but most indicators are either stable or increasing (especially Family, which should supposedly have the most impact on Happiness, along with GDP per Capita, which increases too).

In hindsight, this is not such a surprise : all the indicators aren't actually scores of the countries, simply the extent to which this factors make Canada happier than the imaginary country of Dystopia. So this actually means that Family and Social Support accounts for Happiness in Canada more than in 2016.

This is interesting, but not helpful at all for an analysis... I still don't know why Canada is less happy than it was. And apparently I have no way to figure it out, since looking for the evolution of the indicators is meaningless.

My first great learning of this project : choose the database more wisely. This data is not actionable, not easy to analyze over time, and doesn't have enough depth... I'll continue anyway, but it's a lesson for next time.

I guess the best way to analyze this data is actually to compare the extent to which one given factor contributes to Happiness in all countries.

For example, let's see which country benefits the most from its GDP per capita, from an happiness perspective.

data_Happiness %>%
           filter(year == 2016) %>%
           select(Country, Economy) %>% 
           arrange(desc(Economy), Country) %>% 
           top_n(10, Economy)

This is interesting, because it doesn't quite match the ranking of countries per GDP. GDP per capita seems to be contributing to happiness in Qatar the most, but it is ranked 5th in GDP per Capita. More striking, Kuwait is respectively ranked 4th in economic contribution to happiness, and 29th in GDP. Conversely, Ireland is ranked 4th in GDP per Capita, but doesn't appear in our top.

This is kind of obvious, but we now know that GDP per capita and GDP per capita actual contribution to happiness are two different things.

Let's see if we observe the same thing for the poorest countries :

data_Happiness %>%
           filter(year == 2017) %>%
           select(Country, Economy) %>%
           arrange(desc(Economy), Country) %>% 
           top_n(-10, Economy)

The rankings are not quite the same, but there are fewer surprises than for high GDP per capita.

We can probably assume that once a certain threshold has been reached, Economical contribution to Happiness and GDP per capita are not related anymore.

Let's try to confirm this hypothesis.

First, I have to import actual GDP per Capita data. I can do this scraping wikipedia.

 wikiGDP <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita")
 datawiki <- html_table(wikiGDP,fill = TRUE)

Now I want to keep only the third dataframe, which is 2016 data from the IMF, in USD, and tidy it.

dataGDP <- datawiki[3]
dataGDP <- dataGDP[[1]]
dataGDP <- dataGDP %>% 
           rename( "GDP.per.Capita" = "US$")
dataGDP
# I have to convert Rank and GDP.per.Capita to numbers, drop the lines where Rank is equal to "–" and replace all commas by points in the GDP.per.Capita column

dataGDP[, 1] <- sapply(dataGDP[,1], as.numeric)
dataGDP[] <- lapply(dataGDP, gsub, pattern =',', replacement = '')
dataGDP[,3] <- sapply(dataGDP[,3], as.numeric)
dataGDP <- na.omit(dataGDP)
#I can now join my two dataframes
dataGDP <- dataGDP %>% 
  inner_join(data_Happiness %>% 
               filter(year == 2016), by = 'Country')
dataGDP <- dataGDP %>% 
  select(Rank, Country, GDP.per.Capita, Economy, Happiness.Score)

dataGDP

Now, let's try to see what we get when we plot each countries GDP.per.Capita and Economic contribution to Happiness on the same graph.

ggplot(dataGDP, aes(x = GDP.per.Capita, y = Economy)) + geom_point() + theme_classic() 

This is interesting ! We can see that for very poor countries, a slight increase in GDP per Capita results in a great increase in the Economic contribution to Happiness. But once the GDP has exceeded a given threshold, GDP per capita increase doesn't result in an increase in economic contribution to Happiness. This tends to confirm my hypothesis.

There SHOULD be some way to find something quite interesting with a non-linear regression - unfortunately my knowledge of non-linear regression (and how to do it with R) is not enough... yet.

It would have been really interesting to be able to quantify the uplift in Happiness resulting from an increase in GDP per capita, but well, there's no point just wasting time regretting the lack of relevance of my database.

There is just one last thing I want to do : see how suicide rates and happiness are related, in order to judge the relevance of the KPI.

I am going to scrape it from Wikipedia, using 2017 data.

 wikiSuicide <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate")
 datawikileretour <- html_table(wikiSuicide,fill = TRUE)
 dataSuicide <- datawikileretour[2]
 dataSuicide <- dataSuicide [[1]]
dataSuicide <- dataSuicide %>% 
               select(1:3) %>% 
              rename( "Suicides.per.100k" = "Both sexes") 
colnames(dataSuicide)[1] <- "Suicide.Rank"
dataSuicide[, 1] <- sapply(dataSuicide[,1], as.numeric)
dataSuicide[, 2] <- sapply(dataSuicide[,2], gsub, pattern =' \\(more info)', replacement = '')
dataSuicide[, 2] <- sapply(dataSuicide[,2], gsub, pattern ='\\*', replacement = '')
dataSuicide[, 3] <- sapply(dataSuicide[,3], as.numeric)
dataSuicide <- na.omit(dataSuicide)
dataSuicide <- dataSuicide %>% 
  inner_join(data_Happiness %>% 
               filter(year == 2017), by = 'Country')
dataSuicide <- dataSuicide %>% 
               select(1:6) %>% 
               select(-year)
ggplot(dataSuicide, aes(x = Suicides.per.100k, y = Happiness.Score)) + geom_point() + theme_classic() 

*Wow! Opposite of what I expected, there's no way Happiness Score and Suicide rate can be related ! Or if they are, I'll probably not be able to find a suitable model before I'm a full fledged data scientist... The most suicidal country is far from being the unhappiest one, and a lot of unhappy countries seem less suicidal than happier countries. *

Conclusion

During this analysis, I found two main results : - Increase of GDP for poor countries leads to a higher spike in Economical Contribution to Happiness than for richer countries - Suicides and Happiness Score don't seem related

This database was definitely disappointing... - No explainability of the Happiness.Score - Not enough data to predict anything in a realistic way - No way to account for the evolution of all happiness factors over time in a relevant way - Happiness is apparently unrelated to suicide, which seem quite surprising - maybe suicidal people are extreme cases, not likely taken into account in the studies used to construct the Happiness Score

Now I know much better how to judge a database before starting an analysis, I think ; that's a great learning! And I'm eager to learn more of regression to be able to analyze with more depth.

Thanks for reading! If you're interested in the subject, and you haven't seen it yet, here is a nice, inspiring video!



alexchouraki/ProjetAlex documentation built on May 28, 2019, 4:54 p.m.