knitr::opts_chunk$set( echo = TRUE, fig.retina = 3 )
Questions
How can I create publication-quality graphics in R?
Objectives
To be able to use
ggplot2
to generate publication quality graphics.To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and colouring or panelling by groups.
Plotting the data is one of the best ways to quickly explore it and generate hypotheses about various relationships between variables.
There are several plotting systems in R, but today we will focus on ggplot2
which implements grammar of graphics - a coherent system for describing components that constitute visual representation of data. For more information regarding principles and thinking behind ggplot2
graphic system, please refer to Layered grammar of graphics by Hadley Wickham (@hadleywickham).
The advantage of ggplot2
is that it allows R users to create publication quality graphics with a few lines of code. ggplot2
has a large user base and is constantly developed and extended by the community.
ggplot2
is a core member of tidyverse
family of packages. Installing and loading the package under the same name will load all of the packages we will need for this workshop. Lets get started!
# install.packages("tidyverse") # install.packages("palmerpenguins") library(tidyverse)
If above code produces an error "there is no package called ‘tidyverse’", uncomment (remove #) the line above and run install.packages()
command before you load the library. You only need to install the package once, but you will have to reload it, using the library()
command, every time you restart R.
Today we will be working with the penguins
dataset, which is the excerpt from the penguins data. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
First let's load the data.
penguins <- palmerpenguins::penguins
Here, we grab the data called penguins
from the palmerpenguins package, and assign (<-
) it to an object
in our R environment we call penguins
.
Notice how you can see it in your environment pane in RStudio.
You can have a look at the content of the penguins
data frame by typing penguins
either in the R-chunk or in the console. Data frame is a rectangular collection of data, where variables are organized as columns and observations are listed as rows.
penguins
The dataset contains the following fields:
More information about the package and the data is available in help. Type ?penguins
in console, located in the bottom panel of your RStudio, or type penguins
in the search field of the Help tab of the bottom-right RStudio panel. Whenever you are unsure about anything in R, it is a good idea to check out the help file using one of the two methods described above.
Here's a question that we would like to answer using
penguins
data: Do penguins with deep beaks also have long beaks? This might seem like a silly question, but it gets us exploring our data.
To plot penguins
, run the following code in the R-chunk or in console. The following code will put bill_depth_mm
on the x-axis and bill_length_mm
on the y-axis:
ggplot(data = penguins) + geom_point( mapping = aes(x = bill_depth_mm, y = bill_length_mm) )
Note that we split the function into several lines.
In R, any function has a name and is followed by parentheses. Inside the parentheses we place any information the function needs to run.
Here, we are using two main functions, ggplot()
and geom_point()
.
To save screen space, we have placed each function on its own line, and also split up arguments into several lines.
How this is done depends on you, there are no real rules for this.
We will use the tidyverse coding style throughout this course, to be consistent and also save space on the screen.
The plus sign indicates that the ggplot is not over yet and that the next line should be interpreted as additional layer to the preceding ggplot()
function. In other words, when writing a ggplot()
function spanning several lines, the +
sign goes at the end of the line, not in the beginning.
The plot shows positive linear relationship between bill length and body mass.
Does this graph confirm or disprove your initial hypothesis about the relationship between these variables?
Note that in order to create a plot using ggplot2
system, you should start your command with ggplot()
function. It creates an empty coordinate system and initializes the dataset to be used in the graph (which is supplied as a first argument into the ggplot()
function). In order to create graphical representation of the data, we can add one or more layers to our otherwise empty graph. Functions starting with the prefix geom_
create a visual representation of data. In this case we added scattered points, using geom_point()
function. There are many geoms
in ggplot2
, some of which we will learn in this lesson.
geom_
functions create mapping of variables from the earlier defined dataset to certain aesthetic elements of the graph, such as axis, shapes or colours. The first argument of any geom_
function expects the user to specify these mappings, wrapped in the aes()
(short for aesthetics) function. In this case, we mapped bill_depth_mm
and bill_length_mm
variables from penguins
dataset to x and y-axis, respectively (using x
and y
arguments of aes()
function).
Room: plenary
Duration: 5 minutes
1a: How has bill length changed over time? What do you observe?
Hint: the
penguins
dataset has a column calledyear
, which should appear on the x-axis.1b: Try a different
geom_
function calledgeom_jitter
. It will spread the points apart a little bit using random noise.
## 1a ggplot(data = penguins) + geom_point( mapping = aes(x = year, y = bill_length_mm) ) # 1b ggplot(data = penguins) + geom_jitter( mapping = aes(x = year, y = bill_length_mm) )
What if we want to combine graphs from the previous two challenges and show the relationship between three variables in the same graph? Turns out, we don't necessarily need to use third geometrical dimension, we can employ colour.
The following graph maps island
variable from penguins
dataset to the colour
aesthetic of the plot. Let's take a look:
ggplot(data = penguins) + geom_jitter( mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = island) )
Room: break-out
Duration: 10 minutes
2a: What will happen if you switch colour to also be by year? Is the graph still useful? Why or why not?
2b: What if you map
colour
aesthetic tospecies
? What has changed? How isyear
different fromspecies
orisland
? What is the limitation of thecolour
aesthetic, when used to visualize different types of data?
## 2a ggplot(data = penguins) + geom_jitter( mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = year) ) # 2b ggplot(data = penguins) + geom_jitter( mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species) )
There are other aesthetics that can come handy. One of them is size
. The idea is that we can vary the size of data points to illustrate another continuous variable, such as species bill depth. Lets look at four dimensions at once!
ggplot(data = penguins) + geom_jitter( mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species, size = year) )
It might be even better to try another type of aesthetic, like shape, for categorical data like species.
ggplot(data = penguins) + geom_jitter( mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species, shape = species) )
Playing around with different aesthetic mappings until you find something that really makes the data "pop" is a good idea. A plot is rarely made nice on the first try, we all try different configurations until we find the one we like.
Until now, we explored different aesthetic properties of a graph mapped to certain variables. What if you want to recolour or use a certain shape to plot all data points? Well, that means that such colour or shape will no longer be mapped to any data, so you need to supply it to geom_
function as a separate argument (outside of the mapping
).
This is called "setting" in the ggplot2-world. We "map" aesthetics to data columns, or we "set" single values outside aesthetics to apply to the entire geom or plot.
Here's our initial graph with all colours coloured in blue.
ggplot(data = penguins) + geom_point( mapping = aes(x = bill_depth_mm, y = bill_length_mm), colour = "blue" )
Once more, observe that the colour is now not mapped to any particular variable from the penguins
dataset and applies equally to all data points, therefore it is outside the mapping
argument and is not wrapped into aes()
function. Note that set colours are supplied as characters (in quotes).
Room: break-out
Duration: 10 minutes
4a: Change the transparency (alpha) of the data points by year. _Hint:
alpha
takes a value from 0 (transparent) to 1 (solid).4b: Move the transparency outside the
aes()
and set it to0.7
. What can we benefit of each one of these methods?
## 4a ggplot(data = penguins) + geom_point( mapping = aes(x = bill_depth_mm, y = bill_length_mm, alpha = year) ) ## 4b ggplot(data = penguins) + geom_point( mapping = aes(x = bill_depth_mm, y = bill_length_mm), alpha = 0.7)
Controlling the transparency can be a great way to "mute" the visual effect of certain data, while still keeping it visible. Its a great tool when you have many data points or if you have several geoms together, like we will see soon.
Next, we will consider different options for geoms
. Using different geom_
functions user can highlight different aspects of data.
A useful geom function is geom_boxplot()
. It adds a layer with the "box and whiskers" plot illustrating the distribution of values within categories. The following chart breaks down bill length by island, where the box represents first and third quartile (the 25th and 75th percentiles), the middle bar signifies the median value and the whiskers extent to cover 95% confidence interval. Outliers (outside of the 95% confidence interval range) are shown separately.
ggplot(data = penguins) + geom_boxplot( mapping = aes(x = species, y = bill_length_mm) )
Layers can be added on top of each other. In the following graph we will place the boxplots over jittered points to see the distribution of outliers more clearly. We can map two aesthetic properties to the same variable. Here we will also use different colour for each island.
ggplot(data = penguins) + geom_jitter( mapping = aes(x = species, y = bill_length_mm, colour = species) ) + geom_boxplot( mapping = aes(x = species, y = bill_length_mm) )
Now, this was slightly inefficient due to duplication of code - we had to specify the same mappings for two layers. To avoid it, you can move common arguments of geom_
functions to the main ggplot()
function. In this case every layer will "inherit" the same arguments, specified in the "parent" function.
ggplot(data = penguins, mapping = aes(x = island, y = bill_length_mm) ) + geom_jitter(aes(colour = island)) + geom_boxplot(alpha = .6)
You can still add layer-specific mappings or other arguments by specifying them within individual geoms. Here, we've set the transparency of the boxplot to .6, so we can see the points behind it, and also mapped colour to island in the points. We would recommend building each layer separately and then moving common arguments up to the "parent" function.
We can use linear models to highlight differences in dependency between bill length and body mass by island. Notice that we added a separate argument to the geom_smooth()
function to specify the type of model we want ggplot2
to built using the data (linear model). The geom_smooth()
function has also helpfully provided confidence intervals, indicating "goodness of fit" for each model (shaded gray area). For more information on statistical models, please refer to help (by typing ?geom_smooth
)
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm) ) + geom_point(alpha = 0.5) + geom_smooth(method = "lm")
Room: Break-out
Duration: 10 minutes
5a. Modify the plot so the the points are coloured by island, but there is a single regression line.
5b. Add a regression line to the plot that plots one line for each species, while also plotting one across all species. Hint: Add another geom!
In the graph above, each geom inherited all three mappings: x, y and colour. If we want only single linear model to be built, we would need to limit the effect of colour
aesthetic to only geom_point()
function, by moving it from the "parent" function to the layer where we want it to apply. Note, though, that because we want the colour
to be still mapped to the island
variable, it needs to be wrapped into aes()
function and supplied to mapping
argument.
# 5a ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point(mapping = aes(colour = species), alpha = 0.5) + geom_smooth(method = "lm") # 5b ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point(aes(colour = species), alpha = 0.5) + geom_smooth(method = "lm", aes(colour = species)) + geom_smooth(method = "lm", colour = "black")
Look at that! The data actually reveals something called the "simpsons paradox". It's when a relationship looks to go in a specific direction, but when looking into groups within the data the relationship is the opposite. Here, the overall relationship between bill length and depths looks negative, but when we take into account that there are different species, the relationship is actually positive.
We learned about different parameters of ggplot functions, and how to combine different geoms into more complex charts.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.