knitr::opts_chunk$set(
  echo = TRUE,
  fig.retina = 3
)

Questions

How can I create publication-quality graphics in R?

Objectives

To be able to use ggplot2 to generate publication quality graphics.

To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and colouring or panelling by groups.

Plotting the data is one of the best ways to quickly explore it and generate hypotheses about various relationships between variables.

There are several plotting systems in R, but today we will focus on ggplot2 which implements grammar of graphics - a coherent system for describing components that constitute visual representation of data. For more information regarding principles and thinking behind ggplot2 graphic system, please refer to Layered grammar of graphics by Hadley Wickham (@hadleywickham).

The advantage of ggplot2 is that it allows R users to create publication quality graphics with a few lines of code. ggplot2 has a large user base and is constantly developed and extended by the community.

Getting started

ggplot2 is a core member of tidyverse family of packages. Installing and loading the package under the same name will load all of the packages we will need for this workshop. Lets get started!

# install.packages("tidyverse")
# install.packages("palmerpenguins")
library(tidyverse)

If above code produces an error "there is no package called ‘tidyverse’", uncomment (remove #) the line above and run install.packages()command before you load the library. You only need to install the package once, but you will have to reload it, using the library() command, every time you restart R.

Today we will be working with the penguins dataset, which is the excerpt from the penguins data. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

First let's load the data.

penguins <- palmerpenguins::penguins

Here, we grab the data called penguins from the palmerpenguins package, and assign (<-) it to an object in our R environment we call penguins. Notice how you can see it in your environment pane in RStudio. You can have a look at the content of the penguins data frame by typing penguins either in the R-chunk or in the console. Data frame is a rectangular collection of data, where variables are organized as columns and observations are listed as rows.

penguins

The dataset contains the following fields:

More information about the package and the data is available in help. Type ?penguins in console, located in the bottom panel of your RStudio, or type penguins in the search field of the Help tab of the bottom-right RStudio panel. Whenever you are unsure about anything in R, it is a good idea to check out the help file using one of the two methods described above.

Creating the first plot

Here's a question that we would like to answer using penguins data: Do penguins with deep beaks also have long beaks? This might seem like a silly question, but it gets us exploring our data.

To plot penguins, run the following code in the R-chunk or in console. The following code will put bill_depth_mm on the x-axis and bill_length_mm on the y-axis:

ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_depth_mm,
                  y = bill_length_mm)
  )

Note that we split the function into several lines. In R, any function has a name and is followed by parentheses. Inside the parentheses we place any information the function needs to run. Here, we are using two main functions, ggplot() and geom_point(). To save screen space, we have placed each function on its own line, and also split up arguments into several lines. How this is done depends on you, there are no real rules for this. We will use the tidyverse coding style throughout this course, to be consistent and also save space on the screen. The plus sign indicates that the ggplot is not over yet and that the next line should be interpreted as additional layer to the preceding ggplot() function. In other words, when writing a ggplot() function spanning several lines, the + sign goes at the end of the line, not in the beginning.

The plot shows positive linear relationship between bill length and body mass.

Does this graph confirm or disprove your initial hypothesis about the relationship between these variables?

Note that in order to create a plot using ggplot2 system, you should start your command with ggplot() function. It creates an empty coordinate system and initializes the dataset to be used in the graph (which is supplied as a first argument into the ggplot() function). In order to create graphical representation of the data, we can add one or more layers to our otherwise empty graph. Functions starting with the prefix geom_ create a visual representation of data. In this case we added scattered points, using geom_point() function. There are many geoms in ggplot2, some of which we will learn in this lesson.

geom_ functions create mapping of variables from the earlier defined dataset to certain aesthetic elements of the graph, such as axis, shapes or colours. The first argument of any geom_ function expects the user to specify these mappings, wrapped in the aes() (short for aesthetics) function. In this case, we mapped bill_depth_mm and bill_length_mm variables from penguins dataset to x and y-axis, respectively (using x and y arguments of aes() function).

Challenge 1. {.tabset}

Assignment

Room: plenary
Duration: 5 minutes

1a: How has bill length changed over time? What do you observe?

Hint: the penguins dataset has a column called year, which should appear on the x-axis.

1b: Try a different geom_ function called geom_jitter. It will spread the points apart a little bit using random noise.

Solution

## 1a
ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = year, 
                  y = bill_length_mm)
  )

# 1b
ggplot(data = penguins) +
  geom_jitter(
    mapping = aes(x = year, 
                  y = bill_length_mm)
  )

Mapping data pt.1

What if we want to combine graphs from the previous two challenges and show the relationship between three variables in the same graph? Turns out, we don't necessarily need to use third geometrical dimension, we can employ colour.

The following graph maps island variable from penguins dataset to the colour aesthetic of the plot. Let's take a look:

ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm, 
                  colour = island)
  )

Challenge 2. {.tabset}

Assignment

Room: break-out
Duration: 10 minutes

2a: What will happen if you switch colour to also be by year? Is the graph still useful? Why or why not?

2b: What if you map colour aesthetic to species? What has changed? How is year different from species or island? What is the limitation of the colour aesthetic, when used to visualize different types of data?

Solution

## 2a
ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm,
                  colour = year)
  )

# 2b
ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm,
                  colour = species)
  )

Mapping data pt.2

There are other aesthetics that can come handy. One of them is size. The idea is that we can vary the size of data points to illustrate another continuous variable, such as species bill depth. Lets look at four dimensions at once!

ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm, 
                  colour = species, 
                  size = year)
  )

It might be even better to try another type of aesthetic, like shape, for categorical data like species.

ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm, 
                  colour = species, 
                  shape = species)
  )

Playing around with different aesthetic mappings until you find something that really makes the data "pop" is a good idea. A plot is rarely made nice on the first try, we all try different configurations until we find the one we like.

Setting values

Until now, we explored different aesthetic properties of a graph mapped to certain variables. What if you want to recolour or use a certain shape to plot all data points? Well, that means that such colour or shape will no longer be mapped to any data, so you need to supply it to geom_ function as a separate argument (outside of the mapping). This is called "setting" in the ggplot2-world. We "map" aesthetics to data columns, or we "set" single values outside aesthetics to apply to the entire geom or plot. Here's our initial graph with all colours coloured in blue.

ggplot(data = penguins) + 
  geom_point(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm),
    colour = "blue"
  )

Once more, observe that the colour is now not mapped to any particular variable from the penguins dataset and applies equally to all data points, therefore it is outside the mapping argument and is not wrapped into aes() function. Note that set colours are supplied as characters (in quotes).

Challenge 4. {.tabset}

Assignment

Room: break-out
Duration: 10 minutes

4a: Change the transparency (alpha) of the data points by year. _Hint: alpha takes a value from 0 (transparent) to 1 (solid).

4b: Move the transparency outside the aes() and set it to 0.7. What can we benefit of each one of these methods?

Solution

## 4a
ggplot(data = penguins) + 
  geom_point(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm, 
                  alpha = year)
  )

## 4b
ggplot(data = penguins) + 
  geom_point(
    mapping = aes(x = bill_depth_mm, 
                  y = bill_length_mm),
    alpha = 0.7)

Controlling the transparency can be a great way to "mute" the visual effect of certain data, while still keeping it visible. Its a great tool when you have many data points or if you have several geoms together, like we will see soon.

Geometrical objects

Next, we will consider different options for geoms. Using different geom_ functions user can highlight different aspects of data.

A useful geom function is geom_boxplot(). It adds a layer with the "box and whiskers" plot illustrating the distribution of values within categories. The following chart breaks down bill length by island, where the box represents first and third quartile (the 25th and 75th percentiles), the middle bar signifies the median value and the whiskers extent to cover 95% confidence interval. Outliers (outside of the 95% confidence interval range) are shown separately.

ggplot(data = penguins) + 
  geom_boxplot(
    mapping = aes(x = species, 
                  y = bill_length_mm)
  )

Layers can be added on top of each other. In the following graph we will place the boxplots over jittered points to see the distribution of outliers more clearly. We can map two aesthetic properties to the same variable. Here we will also use different colour for each island.

ggplot(data = penguins) + 
  geom_jitter(
    mapping = aes(x = species, 
                  y = bill_length_mm, 
                  colour = species)
  ) +
  geom_boxplot(
    mapping = aes(x = species,
                  y = bill_length_mm)
  )

Now, this was slightly inefficient due to duplication of code - we had to specify the same mappings for two layers. To avoid it, you can move common arguments of geom_ functions to the main ggplot() function. In this case every layer will "inherit" the same arguments, specified in the "parent" function.

ggplot(data = penguins,
       mapping = aes(x = island, 
                     y = bill_length_mm)
) + 
  geom_jitter(aes(colour = island)) +
  geom_boxplot(alpha = .6)

You can still add layer-specific mappings or other arguments by specifying them within individual geoms. Here, we've set the transparency of the boxplot to .6, so we can see the points behind it, and also mapped colour to island in the points. We would recommend building each layer separately and then moving common arguments up to the "parent" function.

We can use linear models to highlight differences in dependency between bill length and body mass by island. Notice that we added a separate argument to the geom_smooth() function to specify the type of model we want ggplot2 to built using the data (linear model). The geom_smooth() function has also helpfully provided confidence intervals, indicating "goodness of fit" for each model (shaded gray area). For more information on statistical models, please refer to help (by typing ?geom_smooth)

ggplot(data = penguins, 
       mapping = aes(x = bill_depth_mm, 
                     y = bill_length_mm)
) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

Challenge 5. {.tabset}

Assignment

Room: Break-out
Duration: 10 minutes

5a. Modify the plot so the the points are coloured by island, but there is a single regression line.

5b. Add a regression line to the plot that plots one line for each species, while also plotting one across all species. Hint: Add another geom!

Solution

In the graph above, each geom inherited all three mappings: x, y and colour. If we want only single linear model to be built, we would need to limit the effect of colour aesthetic to only geom_point() function, by moving it from the "parent" function to the layer where we want it to apply. Note, though, that because we want the colour to be still mapped to the island variable, it needs to be wrapped into aes() function and supplied to mapping argument.

# 5a
ggplot(data = penguins, 
       mapping = aes(x = bill_depth_mm, 
                     y = bill_length_mm)) +
  geom_point(mapping = aes(colour = species),
             alpha = 0.5) +
  geom_smooth(method = "lm")

# 5b
ggplot(penguins, 
       aes(x = bill_depth_mm, 
           y = bill_length_mm)) +
  geom_point(aes(colour = species),
             alpha = 0.5) +
  geom_smooth(method = "lm", 
              aes(colour = species)) +
  geom_smooth(method = "lm", 
              colour = "black")

Look at that! The data actually reveals something called the "simpsons paradox". It's when a relationship looks to go in a specific direction, but when looking into groups within the data the relationship is the opposite. Here, the overall relationship between bill length and depths looks negative, but when we take into account that there are different species, the relationship is actually positive.

Wrap-up

We learned about different parameters of ggplot functions, and how to combine different geoms into more complex charts.



Athanasiamo/swc.tidyverse documentation built on Dec. 17, 2021, 9:48 a.m.