adventr: summarizing data

library(forcats)
library(learnr)
library(tidyverse)
#library(ggplot2)
knitr::opts_chunk$set(echo = FALSE)
tutorial_options(exercise.cap = "Exercise")

hint_text <- function(text, text_color = "#E69F00"){
  hint <- paste("<font color='", text_color, "'>", text, "</font>", sep = "")
  return(hint)
}

#Read dat files needed for the tutorial


ha_tib <- adventr::ha_dat

An Adventure in R: Summarizing data (introducing ggplot2)

Overview

This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]

Story précis

Why a précis?

Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.

General context for the story

It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.

Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.

But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.

Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions

Everyone has a star, a limitless space on which to store their digital world.

Zach and Alice are Clocktarians. Their technology consists mainly of:

Main Protagonists

How Zach's adventure begins

Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.

On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.

Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.

Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.

Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.

Packages and data

Packages

This tutorial uses the following packages:

This package is automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.

Data

This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:

You can load the file in several ways:

ggplot2

In Chapter 3 of the book, Zach visits Alice's boss, Professor Catherine Pincus, in the hope that she might have clues as to where Alice has gone. When inspecting Alice's file, which he obtained from her counsellor, Zach found completed questionnaires (the relationship assessment scale, RAS) measuring her relationship satisfaction. He is convinced that Alice has left him because she is dissatisfied with their relationship. Prof. Pincus helps Zach to explore this idea by looking at data from a study by [@RN9017] who investigated what characteristics 1,913 teenagers (aged 13-18) valued in relationship partners. She gave adolescents a list of 21 characteristics of a future partner (reliable, honest, kind, attractive, healthy, sense of humour, gets along with friends, interesting personality, caring, romantic, flexible, intelligent, ambitious, easy going, educated, creative, wants to have children, high salary, good family, has relationship experience, and religious) and asked them to rate each one along a 10-point scale ranging from 1 (not important at all) to 10 (very important).’ A sub-sample of these scores are in a tibble called ha_tib. Use the code box to inspect this tibble.


ha_tib

You should see that the tibble contains 13 variables:

Catherine produced a histogram of the scores for the variables hi_salary, kind, humour and ambitious.

The best package for producing graphs in R is ggplot2 which automatically installs as part of the tidyverse package. ggplot2 is great because it is so versatile, but the price for its versatility is that it is extremely complicated. In my other book [@RN4832] I dedicate an entire chapter to it and still only scratch the surface. Through these tutorials we will learn by doing rather than me trying to explain every aspects of ggplot.

Figure 1 shows how ggplot2 works. You begin with some data and you initialize a plot with the ggplot() function within which you name the tibble or data frame that you want to use, then you set a bunch of aesthetics using the aes() function. Primarily, you name the variable you want plotted on the x-axis, the variable for the y-axis and any aesthetics that you want to set for the plot using a variable (for example, you might want to vary the colour of bars by levels of a variable.). You then add layers to the plot that control what the plot shows and the visual properties. For example, you might add bars to show group means, then layer on top error bars. There are various key concepts that relate to controlling aspects of the layers of the plot:

This is a lot to take in, so consider this a reference point (rather than expecting to remember all of the above) and we'll get a feel for ggplot2 by doing examples as we progress through the module. You may also find the official reference guide helpful.

Figure 1: the anatomy of *ggplot()*

Histograms

A basic histogram using ggplot2

Let's start by plotting a histogram of the scores for humour. To initiate the plot we use the ggplot() function, which at its simplest has this general form:

my_plot <- ggplot2::ggplot(my_data, aes(variable for x axis, variable for y axis))

This command creates a new graph object called my_plot, and within the ggplot() function I have told ggplot2 to use the tibble called my_data, and I name the variables to be plotted on the x (horizontal) and y (vertical) axis within the aes() function. We can also set other aesthetic values at this top level. For example, if we had a variable called sex representing the biological sex of participants and we wanted our layers/geoms to display data from males and females in different colours then we could execute:

my_plot <- ggplot2::ggplot(my_data, aes(variable for x axis, variable for y axis, colour = sex))

In doing so any subsequent geom that we define will take on the aesthetic of producing different colours for males and females assuming that this is a valid aesthetic for the particular geom. If you set a general aesthetic like this, you can override it within the specific geom function. To plot the ratings given to the characteristic of being humorous we could execute:

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))

This command creates a new graph object called humour_hist using the tibble called ha_tib, and plotting the variable humour on the x axis (for a histogram we don't need to specify y. This command tells ggplot2 what to plot, but not how to plot it. We need to add a geom to display the data. If we want a histogram we could execute:

humour_hist + geom_histogram()

This command tells R to take the object humour_hist (which we created above) and add (+) a layer to it using geom_histogram().

r hint_text("Tip: for more complex plots you will add lots of different layers, so it is helpful to structure the command with a different layer on each line. I tend to specify a layer then include the + symbol, then hit return to specify the next layer. **RStudio** will indent the lines for you making it easier to read. See the example in the code box below.")

The full code is in the box below, execute it to see what happens.

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram()

By default ggplot2 constructs the bins of the histogram to be 1/30th the width of the data. You can over-ride this default by specifying binwidth = within the geom_histogram() function. In the code box above, type binwidth = 1 into the brackets after geom_histogram and execute the code. Note how the histogram changes. Feel free to try other binwidths, but 1 makes sense because responses could only be whole numbers.

Changing the colours of the bars

We can change the colour of the bars by including fill = within the geom_histogram() function. For example, we could specify the colour red as:

geom_histogram(binwidth = 1, fill = "red")

Try this in the code box below and run the code.

r hint_text("Tip: Note that options within a function are separated by a comma. In this example, \'binwidth = 1, fill = \"red\"\' will work but \'binwidth = 1 fill = \"red\"\' (note the comma is missing) would throw an error.")

You can also specify any Hex colour code. For example, the shade of blue defined by hex code "#56B4E9" is good for colour blind people, so we could specify this:

geom_histogram(binwidth = 1, fill = "#56B4E9")

Try this below by replacing "red" with "#56B4E9" and running the code. Play around with other hex codes.

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1)

You can also make filled objects semi-transparent by using alpha = where alpha is a proportion (i.e., between 0 and 1). For example, if you want the histograms to have 20% opacity you could include alpha = 0.2 in the geom_histogram() function (remembering to separate it from other options with a comma). In the code box above try setting 50% opacity by editing the geom to be:

geom_histogram(binwidth = 1, fill = "#56B4E9", alpha = 0.5)

Editing axis labels

Run the code below to view the histogram so far. Lets add a layer than changes the y-axis label to 'Frequency' and the x-axis label to 'Importance of humour (1-10)'. We can do this by adding a + after the geom_histogram() function and on the next line typing:

labs(y = "Frequency", x = "Importance of humour (1-10)")

Do this and run the code again.

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9")
humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)")

Editing axis limits and breaks

Run the code below to view the histogram so far.

At the moment the x-axis is scaled from 5 to 10. Let's show the full range of the scale. To do this we need to set the limits of the x-axis using the coord_cartesian() function:

coord_cartesian(xlim = c(begin, end), ylim = c(begin, end))

You set the limits of the x-axis using xlim and the limits of the y-axis with ylim. After each you specify numbers representing the start and end values for the axis. You need to collect these values into a single object by enclosing them in c(). So, to set the x-axis to begin at 1 and end at 10 (the lowest and highest points of the scale) add a + after the geom_histogram() function and on the next line type:

coord_cartesian(xlim = c(1, 10))

Add this line to the code box below and run the code to see how the limits of the x-axis change.

r hint_text("Tip: You may notice that you can also set limits using scale_x_continuous(). However, if you do this then data outside of the limits you set are discarded. Therefore, most of the time we would use coord_cartesian() because it leaves the underlying data alone.")

Now let's add a layer that changes what breaks are displayed on the x-axis. At the moment the x-axis displays 2.5, 5, 7.5, and 10 Let's change this to display the numbers 1 to 10 In R we can generate this sequence of numbers by using :. For example, 1:5 will return 1, 2, 3, 4, 5. The x-axis displays a continuous variable (humour) so we can use the function scale_x_continuous() to change aspects of this axis. In particular, the breaks = option will over-ride the default breaks along the axis.

In the code box below, type + after the coord_cartesian() function and on the next line type:

scale_x_continuous(breaks = 1:10)

Execute this code and pay attention to the numbers displayed along the x-axis.

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") + 
  labs(y = "Frequency", x = "Importance of humour (1-10)")
humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10)

Changing theme

Run the code below to view the histogram so far. Let's apply the built-in theme theme_bw() (which is probably the most useful for creating publication style images). To do this type + after the scale_x_continuous() function and on the next line type:

theme_bw()

Add this line to the code box below and run the code to see the theme change:

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10)
humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10) + 
  theme_bw()

Now try changing theme_bw() to theme_minimal(), theme_classic() and theme_dark() to see what effect it has on the plot.

Other variables

To see histograms of the other variables (e.g., hi_salary, kind and ambitious) we can simply replace humour in the code box above with these variables. Try this and run the code to view the resulting histograms.

Frequency polygons

A basic frequency polygon

We can plot frequency polygons in the same way but replacing geom_histogram() with geom_freqpoly(). The code in the exercise has done this. Note a few changes:

humour_poly <- ggplot2::ggplot(ha_tib, aes(humour))
humour_poly +
  geom_freqpoly(binwidth = 1, colour = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10) +
  theme_bw()

Run the code to see what the polygon looks like.

Line size

We can change the size of the line by including the size = option in the geom_freqpoly() function. Remember that options within a function need to be separated by commas, so in the code box above add size = 2 to the geom_freqpoly() function so that it reads:

geom_freqpoly(binwidth = 1, colour = "#56B4E9", size = 2)

Try some other sizes if you like.

Line style

We can change the style of the line by including the linetype = option in the geom_freqpoly() function. Line types can either be defined using numbers (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash) or as text ("blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash"). For example, we could change the line to a dashed line by adding either linetype = 2 or linetype = "dashed" to the geom_freqpoly() function. Remember that options within a function need to be separated by commas, so in the code box above edit the geom_freqpoly() function to read:

geom_freqpoly(binwidth = 1, colour = "#56B4E9", linetype = 2) or geom_freqpoly(binwidth = 1, colour = "#56B4E9", linetype = "dashed")

Try some other styles if you like.

We can of course specify both the style and size by adding both options and separating them with a comma. In the code box above edit the geom_freqpoly() function to read:

geom_freqpoly(binwidth = 1, colour = "#56B4E9", size = 2, linetype = 2)

Try out different combinations of styles and sizes.

Adding geoms

We can layer several geoms onto the same plot. Imagine, for example, we wanted to plot the frequency polygon on top of the original histogram. We can do this by adding the geom_freqpoly() function to our original histogram. The code for the histogram is below. Try adding

geom_freqpoly(binwidth = 1) +

underneath the geom_histogram() function (the + is there because there are commands on the following lines). Now try moving the same command to the line above the geom_histogram() function. What happens?

You should see the frequency polygon disappear behind the histogram. This illustrates the idea of layering a plot. ggplot2 processes the functions in order. If geom_freqpoly(binwidth = 1) comes before geom_histogram() then ggplot2 draws the polygon first and then layers the histogram on top whereas if geom_histogram() comes before geom_freqpoly(binwidth = 1) then the histogram is drawn first and the polygon is layered on top of it.

You can use what you know to change the frequency polygon's line to be a pleasant red ("#DF4738") and size 2.

humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10) + 
  theme_bw()
humour_hist <- ggplot2::ggplot(ha_tib, aes(humour))
humour_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9") +
  geom_freqpoly(binwidth = 1, colour = "#DF4738", size = 2) +
  labs(y = "Frequency", x = "Importance of humour (1-10)") +
  coord_cartesian(xlim = c(1, 10)) +
  scale_x_continuous(breaks = 1:10) + 
  theme_bw()

As we work through the module we'll learn how to do other things with ggplot2 but you should, hopefully, now have an idea of how the package works.

Saving plots

You can use the ggsave() function to save plots (this is what I generally use) but most people find it easier to click on in the pane of RStudio where the plot is displayed. This activates a drop-down menu that lets you save the current plot as an image or a PDF file or copy it to the clipboard:

Saving to an image or PDF will open a fairly self-explanatory dialog box in which you specify the name of the file, the directory in which to save it, the image format and the size of the image. If you use a Notebook or R Markdown file you don't need to worry about saving images.

Other resources

Statistics

R

References



Try the adventr package in your browser

Any scripts or data that you put into this service are public.

adventr documentation built on July 1, 2020, 11:50 p.m.