knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE ) #necessary to render tutorial correctly library(learnr) library(htmltools) #tidyverse library(dplyr) library(ggplot2) #non tidyverse library(knitr) source("./www/discovr_helpers.R") #Read dat files needed for the tutorial ice_tib <- discovr::ice_bucket
# Create bib file for R packages here::here("inst/tutorials/discovr_02/packages.bib") |> knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats', 'tibble', 'datawizard', 'knitr'), file = _)
r cat_space(fill = blu)
Welcome to the discovr
space pirate academyHi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()
s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj()
, and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.
On your way you will face many challenges, but follow Mae's system to keep yourself on track:
r bmu(height = 1.5)
This icon flags materials for teleporters. That's what we like to call the new cat-dets, you know, the ones who have just teleported into the academy. This material is the core knowledge that everyone arriving at space academy must learn and practice. For accessibility, these sections will also be labelled with [(1)]{.alt}.r user_visor(height = 1.5)
Once you have been at space pirate academy for a while, you get your own funky visor. It has various modes. My favourite is the one that allows you to see everything as a large plate of tuna. More important, sections marked for cat-dets with visors goes beyond the core material but is still important and should be studied by all cat-dets. However, try not to be disheartened if you find it difficult. For accessibility, these sections will also be labelled with [(2)]{.alt}.r user_astronaut(height = 1.5)
Those almost as brilliant as Mae (because no-one is quite as brilliant as her) get their own space suits so that they can go on space pirate adventures. They get to shout RRRRRR really loudly too. Actually, everyone here gets to should RRRRRR really loudly. Try it now. Go on. It feels good. Anyway, this material is the most advanced and you can consider it optional unless you are a postgraduate cat-det. For accessibility, these sections will also be labelled with [(3)]{.alt}.It's not just me that's here to help though, you will meet other characters along the way:
r alien(height = 1.5)
aliens love dropping down onto the planet and probing humanoids. Unfortunately you'll find them probing you quite a lot with little coding challenges. Helps is at hand though. r robot(height = 1.5)
bend-R is our coding robot. She will help you to try out bits of r rproj()
by writing the code for you before you encounter each coding challenge.r bug(height = 1.5)
we also have our friendly alien bugs that will, erm, help you to avoid bugs in your code by highlighting common mistakes that even Mae Jemstone sometimes makes (but don't tell her I said that or my tuna supply will end). Also, use hints and solutions to guide you through the exercises (Figure 1).
By for now and good luck - you'll be amazing!
Before attempting this tutorial it's a good idea to work through this tutorial on how to install, set up and work within r rproj()
and r rstudio()
.
The tutorials are self-contained (you practice code in code boxes). However, so you get practice at working in r rstudio()
I strongly recommend that you create a Quarto document within an r rstudio()
project and practice everything you do in the tutorial in the Quarto document, make notes on things that confused you or that you want to remember, and save it. Within this Quarto document you will need to load the relevant packages and data.
This tutorial uses the following packages:
datawizard
[@R-datawizard]here
[@R-here]knitr
[@R-knitr]It also uses these tidyverse
packages [@R-tidyverse; @tidyverse2019]: readr
[@R-readr], dplyr
[@R-dplyr], forcats
[@R-forcats], ggplot2
[@wickhamGgplot2ElegantGraphics2016] and tibble
[@R-tibble].
There are (broadly) two styles of coding:
Explicit: Using this style you declare the package when using a function: package::function()
. For example, if I want to use the mutate()
function from the package dplyr
, I will type dplyr::mutate()
. If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).
Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name)
, and then refer to functions without their package. For example, if I want to use the mutate()
function from the package dplyr
, I will use library(dplyr)
in my first code chunk and type the function as mutate()
when I use it subsequently.
Coding style is a personal choice. The Google r rproj()
style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse
because the dplyr
and ggplot2
packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2::
before every function from ggplot2
.
You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse
package (and any others if you're using a concise style) at the beginning of your Quarto document:
library(tidyverse)
To work outside of this tutorial you need to download the following data file:
Set up an r rstudio()
project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:
ice_tib <- here::here("data/ice_bucket.csv") |> readr::read_csv()
r bmu()
Frequency distributions [(1)]{.alt}We will use data about memes, which typically have a common pattern: they start small, rapidly expand in popularity and then die out. Specifically we'll use data from the ice bucket challenge. You can check Wikipedia for the full story, but it all started (arguably) with golfer Chris Kennedy tipping a bucket of iced water on his head to raise awareness of the disease amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig's disease). The idea is that you are challenged and have 24 hours to post a video of you having a bucket of iced water poured over your head. In this video you also challenge at least three other people. If you fail to complete the challenge your forfeit is to donate to charity (in this case ALS). Many people completed the challenge and made donations.
The ice bucket challenge generated something like 2.3 million on YouTube. The data are stored in [ice_tib]{.alt}, which contains one variable upload_day that is the number of days after Chris Kennedy’s initial challenge that each of 2,323,452 ice bucket related videos were uploaded to YouTube. For example, if the value is 21 it means that the video was uploaded 21 days after Chris Kennedy’s initial challenge. We can't see the pattern of the data easily from 2,323,452 scores. However, we can if we count the frequency of each score and place it in a table.
r bmu()
Frequency tables [(1)]{.alt}To get a basic frequency table we use the group_by()
and summarise()
and n()
functions from the dplyr
package (automatically loaded as part of tidyverse). These functions do the following:
group_by()
: groups the data by whatever variable (or variables) you name within the function.summarise()
: creates a summary table based on whatever variables you create within the functionn()
: counts the number of scoresTo count the frequencies of scores we need to do two things:
r rproj()
to treat values that are the same, as being in the same group or category. For example, tell it that scores of 21 are from the same group/category. We do this using group_by(upload_day)
, which tells r rproj()
to treat scores that are the same within upload_day as being in the same group. Any subsequent operation will be conducted on these 'groups'. summarize()
, within which we create a variable called frequency that counts how many items are in each group created by group_by()
. This variable will, therefore, contain the number of times each value of upload_day occurs.r robot()
Code exampleWe can create a frequency table called [freq_tbl]{.alt} as follows:
freq_tbl <- ice_tib |> dplyr::group_by(upload_day) |> dplyr::summarise( frequency = n() ) freq_tbl
To sum up, the code feeds the data stored in [ice_tib]{.alt} into the group_by()
function and asks it to group the output by the variable upload_day. Having done this, a variable called frequency is created that counts how many scores is in each 'group'.
We can see, for example, that 29000 videos were uploaded 30 days after Chris Kennedy’s initial challenge, and 10000 were uploaded 27 days after.
r bmu()
Grouped frequency tables [(1)]{.alt}The frequency table that we just produced is large and a little unwieldy. It might be useful to instead have a table that counts the frequencies for a range of scores. For example, how many videos were uploaded between 21 and 24 days, and 25 and 28 days. This is known as a grouped frequency distribution. To produce one of these, first we have to place the values of upload_days into what are known as bins. The values of the number of days range from 21 to 76 days after Chris Kennedy’s initial challenge. Imagine we wanted to group the data in such a way as to see how many videos were uploaded every 4 days (instead of every day). To do this, we'd need to group the values of days into bins of 21-24, 25-28, 29-32 and so on. Notice that each 'bin' is made up of 4 days worth of values.
r robot()
Code exampleWe can create these bins using ggplot2::cut_width()
which takes this form:
ggplot2::cut_width(variable, width_of_bin)
In which we place the variable that we wish to spread across bins, and [width_of_bin]{.alt} is how wide we want the bins to be. If we want to split the variable upload_day into bins containing 4 days worth of data we'd use:
ggplot2::cut_width(upload_day, 4)
Combining this function with dplyr::mutate()
(which we encountered in the discovr_01 tutorial) to create a new variable called days_group, we could execute:
r robot()
Code examplegp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) gp_freq_dist
This creates a new object called [gp_freq_dist]{.alt} that contains each value within [ice_tib]{.alt} but with an extra column/variable called days_group that indicates into which 'bin' the value of upload_day falls.
Having done this we can again use summarize()
and n()
to count the scores like we did before. However, this time we want to group the summary by days_group instead of upload_day. Doing so means that the number of times a score falls within each bin will be counted rather than the number times each score occurs.
r alien()
Alien coding challengeCreate a grouped frequency table called [gp_freq_dist]{.alt} by starting with the code in the code example and then using the code we used to create [freq_tbl]{.alt} to create a pipe that summarizes the grouped scores.
# Start with the text from the code example, which creates the bins: gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) )
# Previously we used this code to count scores # Think about how to connect this code to the previous # code. Focus on the right hand side. freq_tbl <- ice_tib |> dplyr::group_by(upload_day) |> dplyr::summarise( frequency = n() )
# We can combine the two pieces of code like this. # We take the code that bins the data, then # pipe the resulting tibble into # group_by and summarize. gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) |> dplyr::group_by(upload_day) |> dplyr::summarise( frequency = n() ) # If you run the above code you'll get the same # table as before (freq_tbl) because we have grouped| # by upload day
# We want to group by days_group not upload day.. # Change upload_day in the group_by() function # to be days_group gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) |> dplyr::group_by(days_group) |> dplyr::summarise( frequency = n() ) # Job done. We want to see the resulting table so we add: gp_freq_dist
We can see, for example that 534000 videos were uploaded after 38 days and up to and including 42 days.
r bmu()
Relative frequencies [(1)]{.alt}We now have an object [gp_freq_dist]{.alt} that contains the number of days grouped into bins of 4 days (days_group) and the number of videos uploaded during each of the time periods represented by those bins (frequency).
r robot()
Code exampleIf we want to calculate the relative frequency (i.e., the proportion of videos uploaded during each of the time periods represented by the bins) we can use dplyr::mutate()
to add a variable that divides the frequency by the total number of videos uploaded. We can find this total using sum()
.
... |> dplyr::mutate( relative_freq = frequency/sum(frequency) )
Within the mutate()
function we create a new column called relative_freq using [frequency/sum(frequency)]{.alt}. The effect of this command is that for each value of the variable frequency within [gp_freq_dist]{.alt}, a new value is computed that is the original value divided by the sum (or total) of all of the frequencies. This process converts the raw frequency into a relative frequency.
r alien()
Alien coding challengeUsing the code example, pipe the [gp_freq_dist]{.alt} object into mutate()
, calculate the relative frequency and store it as a variable called relative_freq.
gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) |> dplyr::group_by(days_group) |> dplyr::summarise( frequency = n() )
# We want to add a variable to an existing tibble (gp_freq_dist) # so a good place to start is: gp_freq_dist <- gp_freq_dist |> ... # Think about what will be on the other wise of the pipe
# We pipe the data into the mutate() from the code example gp_freq_dist <- gp_freq_dist |> dplyr::mutate( relative_freq = frequency/sum(frequency) )
# To see the results we need to execute the name of rthe table: gp_freq_dist <- gp_freq_dist |> dplyr::mutate( relative_freq = frequency/sum(frequency) ) gp_freq_dist
r user_visor()
Efficient code [(2)]{.alt}In the tasks above we created the table of relative frequencies step by step:
r robot()
Code exampleUsually it's more efficient to carry out these steps in one piece of code. For example, if we combine all of the previous operations we get:
gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) |> dplyr::group_by(days_group) |> dplyr::summarise( frequency = n() ) |> dplyr::mutate( relative_freq = frequency/sum(frequency), percent = relative_freq*100 ) gp_freq_dist
gp_freq_dist <- ice_tib |> dplyr::mutate( days_group = ggplot2::cut_width(upload_day, 4) ) |> dplyr::group_by(days_group) |> dplyr::summarise( frequency = n() ) |> dplyr::mutate( relative_freq = frequency/sum(frequency), percent = relative_freq*100 ) gp_freq_dist
This code takes the data in [ice_tib]{.alt}, feeds it into mutate()
to create a new variable called days_group which contains the 'bin' to which the score in upload_day belongs. Having added these 'variable bins', we group the data by them using group_by()
, count how many scores fall into each bin using summarize()
and store these in a variable called frequency. We then feed this summary table into mutate()
to create new variables containing the relative frequency in its raw form and expressed as a percentage. We store the summary table in an object called [gp_freq_dist]{.alt}.
r bmu()
Histograms [(1)]{.alt}As well as tabulated frequency distributions, we can visualise distributions using histograms. ggplot2
is a powerful package for producing data visualisations that we will explore in detail in the discovr_05 tutorial. For now, we will use it to create a histogram without worrying about how it works.
r bmu()
A basic histogram [(1)]{.alt}Let's start by plotting a histogram of number of days between the original ice bucket challenge video and when subsequent videos were uploaded. Remember, the data are stored in a variable called upload_day in the [ice_tib]{.alt} tibble. To initiate the plot we use the ggplot()
function, which at its simplest has this general form:
r robot()
Code example`ggplot2::ggplot(my_tib, aes(variable_for_x_axis, variable_for_y_axis))`
Within the ggplot()
function replace [my_tib]{.alt} with the name of the tibble containing the data you want to plot, and within the aes()
function replace [variable_for_x_axis]{.alt} with the name of the variable to be plotted on the x-axis (horizontal), and replace [variable_for_y_axis]{.alt} with the name of the variable to be plotted on the y-axis (vertical).
To plot the days since the first ice bucket challenge video that each video was uploaded, we could execute the code below, which uses the tibble called [ice_tib]{.alt}, and plots the variable upload_day on the x axis (for a histogram we don't need to specify y). Try it.
ggplot2::ggplot(ice_tib, aes(upload_day))
That doesn't look right!?
Don't worry, you haven't done anything wrong: this command tells ggplot2
what to plot, but not how to plot it. We need to add something called a geom to display the data. For a histogram we use geom_histogram()
.
r alien()
Alien coding challengeTake the code from the code example (reproduced below), add + geom_histogram()
to it and run it.
ggplot2::ggplot(ice_tib, aes(upload_day))
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram()
The completed command tells r rproj()
to take the object created by ggplot2::ggplot(ice_tib, aes(upload_day))
and add (+
) a layer to it using geom_histogram()
. You should see a beautiful histogram now!
r bmu()
Changing the bin widths [(1)]{.alt}By default ggplot2
constructs the bins of the histogram to be 1/30th the width of the data. You can over-ride this default by specifying [binwidth =]{.alt} within the geom_histogram()
function. In the code box below, type [binwidth = 1]{.alt} into the brackets after geom_histogram()
and execute the code. Note how the histogram changes. A binwidth of 1 makes sense for these data because responses could only be whole numbers (days), but try some other values and note how the histogram changes.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram()
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1)
r bmu()
Changing the colours of the bars [(1)]{.alt}We can change the colour of the bars by including [fill =]{.alt} within the geom_histogram()
function.
r robot()
Code exampleFor example, we could specify the colour red as:
geom_histogram(binwidth = 1, fill = "red")
r alien()
Alien coding challengeAdapt the code below to colour the histogram bars red.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1)
You can also specify any HEX colour code. For example, the shade of blue defined by hex code #56B4E9 is good for colour blind people, so we could specify this by replacing ["red"]{.alt} with ["#56B4E9"]{.alt}.
r alien()
Alien coding challengeReplacing ["red"]{.alt} with ["#56B4E9"]{.alt} and run the code to see the effect this change has. Using the HEX colour code website, try changing the fill colour to a different colour using a hex code.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "red")
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9")
r bmu()
Changing the transparency of the bars [(1)]{.alt}You can also make filled objects semi-transparent by using [alpha =]{.alt} where alpha is a proportion (i.e., between 0 and 1). For example, if you want the histograms to have 20% opacity you could include [alpha = 0.2]{.alt} in the geom_histogram()
function (remembering to separate it from other options with a comma).
r alien()
Alien coding challengeTaking our previous histogram, try setting 50% opacity. Try out some other values of alpha (between 0 and 1) and note the effect on the transparency of the bars.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9")
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5)
r bmu()
Editing axis labels [(1)]{.alt}r robot()
Code exampleTo change the labels on the x- and y-axis we can use the labs()
function. To do this, we add a +
after the geom_histogram()
function and on the next line type:
labs(y = "label_for_y_axis", x = "label_for_x_axis")
Replacing [label_for_y_axis]{.alt} with the text we want on the y-axis and [label_for_x_axis]{.alt} with the text that we want on the x-axis. For the current plot, we are plotting the frequency of uploads (because it's a histogram) on the y-axis and the number of days since the original ice bucket challenge on the x-axis, so we might use:
labs(y = "Frequency", x = "Days since first ice bucket challenge video")
r alien()
Alien coding challengeRun the code in the box below to reproduce our histogram, then add the code from the code example and run it again. You should see that the axis labels change.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5)
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video")
r bmu()
Changing theme [(1)]{.alt}ggplot2
has various built in themes that change the appearance of the plot. The two we will use most often are theme_bw()
, which applies a black and white theme and theme_minimal()
which applies a minimalist theme. Both of these themes are good for scientific plots (like the ones you'll find in journal articles).
r robot()
Code exampleTo apply a theme we add +
after the previous function and then type theme_bw()
or theme_minimal()
. For example, to apply a minimalist theme to our histogram code we'd use the code below. Note that we have taken the code from the previous section and simply tagged + theme_minimal()
to the end. Compare the result to the previous version of the histogram.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video") + theme_minimal()
r alien()
Alien coding challengeTry applying theme_bw()
, theme_classic()
and theme_dark()
to see what effect it has on the plot.
ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video")
# Theme black and white: ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video") + theme_bw()
# Theme classic: ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video") + theme_classic()
# Theme dark: ggplot2::ggplot(ice_tib, aes(upload_day)) + geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5) + labs(y = "Frequency", x = "Days since first ice bucket challenge video") + theme_dark()
We'll return to the ggplot2
package in more depth in discovr_05.
r bmu()
Summarizing data [(1)]{.alt}We've seen already that functions take the form of a command followed by brackets. We also saw that there are usually options that can be placed within those brackets (for example, in the previous section we changed the binwidth and bar colour of the histogram by placing instructions within the geom_histogram()
function). We'll explore this idea more by looking at the function to estimate the mean and median of the ice bucket scores.
r bmu()
The mean and median [(1)]{.alt}We can compute the mean using the built-in function mean()
. The full format of the function is:
mean(variable, trim = 0, na.rm = FALSE)
Which just says that you need to include a reference to the variable/data that you want the mean for, and that you can set two arguments:
r rproj()
missing values are denoted as [NA]{.alt} (not available), so by setting [na.rm = TRUE]{.alt} (or [na.rm = T]{.alt} for short) you ask r rproj()
to remove missing values before computing the mean. The function for the median has a similar format except that there isn't an argument to trim the data because that wouldn't make sense (the median is effectively the data with a 50% trim):
median(variable, na.rm = FALSE)
r robot()
Code exampleIf you are happy with the default settings you don't need to specify those arguments explicitly. For example, to find the mean of the variable upload_day from the [ice_tib]{.alt} tibble, we could execute:
mean(ice_tib$upload_day)
However, if we wanted to remove missing values we need to override the default setting for the [na.rm]{.alt} argument:
mean(ice_tib$upload_day, na.rm = TRUE)
We can obtain the median in much the same way.
r alien()
Alien coding challengeFind the median number of days after the original ice bucket video that other videos were uploaded.
median(ice_tib$upload_day)
r bmu()
Quantifying the 'fit' of the mean [(1)]{.alt}We can use the functions var()
, sd()
to get the variance and standard deviation of the ice bucket scores. These functions behave exactly like mean()
in that we input the variable for which we want the variance and standard deviation and specify how we treat missing values (by default they are not removed):
r robot()
Code examplevar(variable_name, na.rm = FALSE) sd(variable_name, na.rm = FALSE)
r alien()
Alien coding challengeUse what you learned in the previous section and the code example above to get the variance and standard deviation of the days since the original ice bucket video that other videos were uploaded.
var(ice_tib$upload_day) sd(ice_tib$upload_day)
r bmu()
The inter-quartile range [(1)]{.alt}We can use the IQR()
function to obtain the interquartile range of a set of scores. This function has an additional option of [type =]{.alt} which allows you to specify one of 8 different ways to calculate the IQR. The default is 7. There is an argument for using [type = 8]{.alt}, which uses a method recommended by [@hyndman_sample_1996].
r robot()
Code exampleIQR(variable_name, na.rm = FALSE, type = 7)
r alien()
Alien coding challengeAdapt the code example to get the inter-quartile range of the days since the original ice bucket video. Recalcutae it using method 8.
IQR(ice_tib$upload_day, type = 8)
In the code box below, use this function to get the inter-quartile range of the upload_day variable.
# The code example is: IQR(variable_name, na.rm = FALSE, type = 7) # Replace variable_name with the name of the variable that we want to summarize.
# This gives us: IQR(ice_tib$upload_day, na.rm = FALSE, type = 7) # We have no missing values, and to start with we want the default method # of 7 so we can omit these arguments
# Solution 1: IQR(ice_tib$upload_day) # To use method 8, insert type = 8 into the function
# Solution 2: IQR(ice_tib$upload_day, type = 8)
r user_visor()
Creating a summary table [(2)]{.alt}So far we have looked at computing individual statistics for a set of scores, but what if we want to combine these values into a table? We can do this using the summarise()
function described earlier in the tutorial.
r robot()
Code exampleThe code looks like this in general:
ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), ... )
The code feeds the data stored in [ice_tib]{.alt} into the summarise()
function. In this function new variables are created. The first variable we name median
and it stores the output of median(upload_day)
. In other words, we create a variable that we chose to call median
(left hand side of the command) that stores the value of the median of the variable upload_day (right-hand side of the command). Similarly, we store the mean upload day in a variable called mean and so on. We can add as many new variables as we wish, but for the last variable we create we need to omit the comma at the end of the line (like we do when using mutate()
).
r alien()
Alien coding challengeCreate a summary table containing the mean, median, IQR, variance and SD of the number of days since the original ice bucket video.
#start with the code example: ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), ... )
# Now add the command to get the IQR into the summarize() function ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day) )
# Now add the command to get the variance into the summarize() function ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day) )
# Now add the command to get the standard deviation into the summarize() function ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day), std_dev = sd(upload_day) ) # Check that you have commas at the end of every line except the last within summarise.
r robot()
Code exampleIf we want to store this table of summary statistics we can do so by assigning it to a new object. Let's say we want to assign it to an object called [upload_summary]{.alt} then we'd add upload_summary <-
to the beginning of the command
upload_summary <- ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day), std_dev = sd(upload_day) )
r user_visor()
Rounding values [(1)]{.alt}We can use the round()
function to round values, and we can use the kable()
function from knitr
to round an entire table of values.
r user_visor()
The round()
function [(1)]{.alt}The round()
function takes this form
round(x, digits = 0)
In which [x]{.alt} is the thing we want to round, and [digits]{.alt} is the number of decimal places we want (the default is 0, which returns a whole number). For example, to round the value 3.211420 to a whole number we would execute:
round(3.211420)
But to round it to two decimal places we would include the number two after a comma:
round(3.211420, 2)
We could also use a pipe to feed a mean, median or variance into the round function. For example, to round the value of the variance of upload_day to 3 decimal places we could execute:
r robot()
Code examplevar(ice_tib$upload_day) |> round(3)
r alien()
Alien coding challengeRound the standard deviation and mean of upload_day to 2 decimal places.
# SD sd(ice_tib$upload_day) |> round(2) # Mean mean(ice_tib$upload_day) |> round(2)
r alien()
Alien coding challengeIf we have a table that only contains numbers, then we can also apply round()
to the entire table. For example, in the previous section we created an object called [upload_summary]{.alt} using the following code in the code box.
upload_summary <- ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day), std_dev = sd(upload_day) ) upload_summary
upload_summary <- ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day), std_dev = sd(upload_day) ) upload_summary |> round(2)
Execute the code in the code box to view the table. Then edit the last line to pipe the table into round()
using
upload_summary |> round(2)
Execute this new code and you should find the values are now rounded to 2 decimal places.
r user_visor()
The kable()
function [(1)]{.alt}The code in the previous section works only because every column in the table contains numbers. If one of the columns contained text, or something other than a number, r rproj()
will throw an error because round()
doesn't know what to do with it. A more straightforward option for tables is to pipe them into the kable()
function from knitr()
. It has these main arguments
knitr::kable(x = my_table, digits = number_of_decimal_places, row.names = a_list_of_row_names, col.names = a_list_of_column_names, align, caption = "Write a caption" )
The function takes your table as the main input, but has options to set the maximum number of decimal places using [digits]{.alt}, you can add row and column names (in the case of column names you can provide a list to override the variable names in the tibble), you can set the alignment of each column, and include a caption. For now, let's add a caption and set the maximum number of decimal places to 2 using
upload_summary |> knitr::kable(digits = 2, caption = "Summary statistics for the ice bucket challenge." )
r alien()
Alien coding challengeLet's again use the code to create an object called [upload_summary]{.alt} but this time pipe it through kable()
.
upload_summary <- ice_tib |> dplyr::summarise( median = median(upload_day), mean = mean(upload_day), IQR = IQR(upload_day), variance = var(upload_day), std_dev = sd(upload_day) ) upload_summary |> knitr::kable(digits = 2, caption = "Summary statistics for the ice bucket challenge." )
Execute this new code and you should again find the values are rounded to 2 decimal places. However, kable()
has the advantage that it will intelligently round the columns, so it won't try to round columns with text in, and we have options to align and caption the table. So, other things being equal, use kable()
to round values in tables.
r user_visor()
Quick descriptives with datawizard
[(2)]{.alt}Creating a bespoke table of summary statistics is useful and helps us to practice our tidyverse
skills. However, if we want a quick set of descriptive statistics we can use the describe_distribution()
function from the datawizard
package. It has the following arguments:
datawizard::describe_distribution(x = my_data, select = NULL, exclude = NULL, centrality = "mean", dispersion = TRUE, iqr = TRUE, range = TRUE, quartiles = FALSE, include_factors = FALSE, ci = NULL)
You replace [my_data]{.alt} with the name of a tibble that you want to summarize, you can use [select]{.alt} to select specific variables within the tibble, or [exclude]{.alt} to exclude them (as you would if you used the select()
function from dplyr
). By default mean is displayed [centrality = "mean"]{.alt} but this can be set to "median", "MAP" (the maximum a posteriori probability, which we don't cover) or "all" to get all of them. The remaining arguments can be set to [TRUE]{.alt} or [FALSE]{.alt} to indicate which summary statistics to include, by default you get measures of dispersion (standard deviation for the mean and mean absolute deviation for the median), the interquartile range [iqr]{.alt}, and the range, but not the upper ($q_{0.75}$) and lower ($q_{0.25}$) quartiles or confidence intervals (which we'll learn about in discovr_03
).
r alien()
Alien coding challengeUsing the default options, get some descriptive statistics for the ice bucket challenge data.
#replace the xxx datawizard::describe_distribution(xxx)
#replace the xxx datawizard::describe_distribution(ice_tib) # now use kable() to round the valeues to 2dp
datawizard::describe_distribution(ice_tib) |> knitr::kable(digits = 2)
r rproj()
r rproj()
and r rstudio()
.r rstudio()
cheat sheets.r rstudio()
list of online resources.I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.