knitr::opts_chunk$set(echo = FALSE)
# Load packages

# install.packages("tidybiology")
library(tidyverse)
# install.packages("fontawesome")
library(fontawesome)
library(tidybiology)

# Load data
data(happy_full)
data(happy_select)

Analysing the World Happiness Report r fa("globe-europe")

In this exercise, you will apply what you've learned in class to perform exploratory data analysis (EDA) on the World Happiness Report

This dataset was downloaded from the website Kaggle. We will use the 2021 data in this exercise. This dataset is stored in an object called happy_full

In this exercise, you will practice:

Take a look at your data

What does the dataset look like?

A couple useful things to know about your dataset are -
- The number of rows and columns - The types of variables the dataset contains

What function can you use to get this information?

dim(happy_full)
glimpse(happy_full)

Double data type

We can see that happy_full contains many variables that are of type double.

The double data type refers to which of the following?

Selecting Variables

The happy_full dataset contains many variables. This gives us the chance to practice our select()-ing skills!

Simple selects

Let's warm up by performing some basic select operations

How would you select just the columns region and ladder_score?

happy_full %>% 
  select(region, ladder_score)

Now select everything between (and including) social_support and generosity

happy_full %>% 
  select(social_support:generosity)

Slightly-more-difficult selects

Let's try something more challenging now. Select all variables that do not have underscores in their names

Hint: You'll need a helper function. Also, don't forget !

happy_full %>% 
  select(!contains("_"))

Helper functions can be really...helpful! Ok, no we're ready to select the variables we will need for the rest of this exercise. Create a new dataframe called happy_df that contains the following variables (in the specified order!) - country_name, region, ladder_score_in_dystopia, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption

Avoid simply typing out the names of all these variables. Add glimpse(happy_df) as the last line to see if you got the right answer


Note: happy_df hasn't actually been saved anywhere. So we will be using an identical dataset called happy_select for the rest of this exercise.

Filtering

happy_select contains both numeric and character variables, with lots and lots of observations (rows). This gives us a great opportunity to practice our filtering skills!

Simple filters

Say we're only interested in looking at data for countries in East Asia. How would we do this?

Hint: To see which major regions are in this dataset, run unique(happy_select$region)

happy_select %>% 
  filter(region == "East Asia")

Filter%in%g

Now use filter() to only keep data for the following countries - Algeria, Belgium, India, Tunisia, and Uganda. Try to do this without writing multiple filter statements

Hint: Try the %in% operator

happy_select %>% 
  filter(country_name %in% c("Algeria", "Belgium", "India", "Tunisia", "Uganda"))

Numeric filtering

Finally, let's filter out information for countries that have a below average ladder score

Hint: Use the base R mean() function

happy_select %>% 
  filter(ladder_score > mean(ladder_score))

Creating new variables

Creating normalized ladder scores

Create a new variable called normalized_ladder_score that contains the ladder scores for each county in happy_full divided by the ladder score in dystopia

As a bonus, re-order the resulting data frame in descending order of normalized_ladder_score

happy_select %>% 
  mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia) %>% 
  arrange(desc(normalized_ladder_score))

Discarding variables

Do the same as for the previous question but only keep the following columns - ladder_score, ladder_score_in_dystopia, normalized_ladder_score

happy_select %>% 
  mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia) %>% 
  arrange(desc(normalized_ladder_score)) %>% 
  select(contains("ladder"))

# or
happy_select %>% 
  mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia,
         .keep = "used") %>% 
  arrange(desc(normalized_ladder_score))

Summary statistics

Which regions are the happiest?

Let's now find out which are the happiest regions in the world. We'll do this by working out the average ladder_score of all the countries in each region

Hint: Remember the best friends, group_by() and summarise()

happy_select %>% 
  group_by(region) %>% 
  summarise(mean(ladder_score))

Maxed out

Now let's work out the maximum value for each numeric variable, for each region. Save the output of this code as an object named output

Hint: across() is helpful here. To learn more about this function, run ?across

output <- happy_select %>% 
            group_by(region) %>% 
            summarise(across(where(is.numeric), max))

Save output as a new file (e.g. csv)

write_csv(output, "dplyr_output.csv")

For more help

Run the following to access the Dplyr vignette

browseVignettes("dplyr")


hirscheylab/tidybiology documentation built on May 20, 2022, 10:55 p.m.