knitr::opts_chunk$set(echo = FALSE) # Load packages # install.packages("tidybiology") library(tidyverse) # install.packages("fontawesome") library(fontawesome) library(tidybiology) # Load data data(happy_full) data(happy_select)
r fa("globe-europe")
In this exercise, you will apply what you've learned in class to perform exploratory data analysis (EDA) on the World Happiness Report
This dataset was downloaded from the website Kaggle. We will use the 2021 data in this exercise. This dataset is stored in an object called happy_full
In this exercise, you will practice:
A couple useful things to know about your dataset are -
- The number of rows and columns
- The types of variables the dataset contains
What function can you use to get this information?
dim(happy_full) glimpse(happy_full)
We can see that happy_full
contains many variables that are of type double.
The double
data type refers to which of the following?
The happy_full
dataset contains many variables. This gives us the chance to practice our select()
-ing skills!
Let's warm up by performing some basic select operations
How would you select just the columns region
and ladder_score
?
happy_full %>% select(region, ladder_score)
Now select everything between (and including) social_support
and generosity
happy_full %>% select(social_support:generosity)
Let's try something more challenging now. Select all variables that do not have underscores in their names
Hint: You'll need a helper function. Also, don't forget !
happy_full %>% select(!contains("_"))
Helper functions can be really...helpful! Ok, no we're ready to select the variables we will need for the rest of this exercise. Create a new dataframe called happy_df
that contains the following variables (in the specified order!) - country_name
, region
, ladder_score_in_dystopia
, logged_gdp_per_capita
, social_support
, healthy_life_expectancy
, freedom_to_make_life_choices
, generosity
, perceptions_of_corruption
Avoid simply typing out the names of all these variables. Add glimpse(happy_df)
as the last line to see if you got the right answer
Note: happy_df
hasn't actually been saved anywhere. So we will be using an identical dataset called happy_select
for the rest of this exercise.
happy_select
contains both numeric and character variables, with lots and lots of observations (rows). This gives us a great opportunity to practice our filtering skills!
Say we're only interested in looking at data for countries in East Asia. How would we do this?
Hint: To see which major regions are in this dataset, run unique(happy_select$region)
happy_select %>% filter(region == "East Asia")
Now use filter()
to only keep data for the following countries - Algeria, Belgium, India, Tunisia, and Uganda. Try to do this without writing multiple filter statements
Hint: Try the %in%
operator
happy_select %>% filter(country_name %in% c("Algeria", "Belgium", "India", "Tunisia", "Uganda"))
Finally, let's filter out information for countries that have a below average ladder score
Hint: Use the base R mean()
function
happy_select %>% filter(ladder_score > mean(ladder_score))
Create a new variable called normalized_ladder_score
that contains the ladder scores for each county in happy_full
divided by the ladder score in dystopia
As a bonus, re-order the resulting data frame in descending order of normalized_ladder_score
happy_select %>% mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia) %>% arrange(desc(normalized_ladder_score))
Do the same as for the previous question but only keep the following columns -
ladder_score
, ladder_score_in_dystopia
, normalized_ladder_score
happy_select %>% mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia) %>% arrange(desc(normalized_ladder_score)) %>% select(contains("ladder")) # or happy_select %>% mutate(normalized_ladder_score = ladder_score/ladder_score_in_dystopia, .keep = "used") %>% arrange(desc(normalized_ladder_score))
Let's now find out which are the happiest regions in the world. We'll do this by working out the average ladder_score
of all the countries in each region
Hint: Remember the best friends, group_by()
and summarise()
happy_select %>% group_by(region) %>% summarise(mean(ladder_score))
Now let's work out the maximum value for each numeric variable, for each region. Save the output of this code as an object named output
Hint: across()
is helpful here. To learn more about this function, run
?across
output <- happy_select %>% group_by(region) %>% summarise(across(where(is.numeric), max))
Save output
as a new file (e.g. csv)
write_csv(output, "dplyr_output.csv")
Run the following to access the Dplyr vignette
browseVignettes("dplyr")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.