require(learnr) require(printr) knitr::opts_chunk$set(echo=TRUE)
We now wish to learn about data visualisation. Data visualisation is a crucial part of any statistical enquiry. It allows to better understand the data by means of graphical exploration. In this laboratory, we are going to rely on R for Data Science, a book by Hadley Wickham and Garrett Grolemund.
The first thing we ought to do is to load the library we will use.
library(ggplot2) library(dplyr)
mpg
data frameWe want to understand whether cars with big engines consume more fuels with respect to cars with smaller engines. What do you think?
question("The bigger the engine:", answer("The greater the fuel consumption", message = "* Good. We will see!", correct = TRUE), answer("The smaller the fuel consumption", message = "* Good. We will see!", correct = TRUE), answer("The fuel consumption should be the same", message = "* Good. We will see!", correct = TRUE), type = "single" )
We are going to use a data frame containing information for 38 models of car, created by the US Environmental Protection Agency.
The data frame is already contained in ggplot2
. It is a package that allows for visualising data in an elegant and structured fashion.
Let us give a closer look at the data.
# type mpg and analyse the output
Amongst the column of the mpg
data frame, we are interested in two: displ
, the engine size in litres, and hwy
, the car's fuel efficiency on a highway (in miles per gallon).
Let us plot the data.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
There seems to be a negative correlation between the engine size and the fuel efficiency, at least after a first look. This plot is called a scatterplot: values for two variables are displayed on the x and y axes.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot()
. Setting the argument data = mpg
told ggplot about the data we want to use. This function creates an empty coordinate system.geom_point
. geom_point
is a particular type of geom
. Every geom
has a mapping
argument. This is paired with aes
. Arguments of aes
is to tell where on the axes we want to put variables.The generic syntax to make a plot is then the following:
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Here you have a list of exercises! If you find yourself stuck, click on the solution button!
Run ggplot(data = mpg)
. What happens?
ggplot(data = mpg)
How many rows and columns in the data frame?
mpg
Make a scatterplot of hwy
vs cyl
(the number of cylinders).
ggplot(data = mpg) + geom_point(mapping = aes(x = hwy, y = cyl))
Make a scatterplot of class
(the "type" of car) and drv
..
ggplot(data = mpg) + geom_point(mapping = aes(x = class, y = drv))
question("Do you believe that the last plot was particularly useful?", answer("No", correct = TRUE, message = "It was not. Scatterplots are not very useful when applied to categorical variables. Each plot serves its purpose!"), answer("Yes", message = "Why do you think so? The plot doesn't show anything interesting. Is it because we are only considering categorical variables?") )
There appears to be a bunch of points in the bottom right corner that do not seem to follow the trend.
What may be the reason? Maybe the better efficiency is because those cars are hybrid? Let us find out!
To gain more insight regarding why this may happen, we exploit the mappings of the plot. We map the class
of the cars to a colour.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
These cars are "2seater". It makes sense: sports car have smaller bodies and gain in efficiency. With hindsight, it was unlikely for hybrid cars to have bigger engines.
(Notice that if you prefer to use British English, you can also type colour
.)
We can map the class
variable in other ways. For instance, we can map it to size.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class))
Notice the warning message! We are mapping a discrete variable to a continuous mapping and this is not advised.
Another possibility is to use different shapes.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))
We see that "suv" vehicles are not displayed. ggplot
handles at most six shapes.
A third option would be that to map it to the alpha
aesthetic. alpha
regulates the transparency of the points.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
Notice that we obtain a similar warning to the one we had for size
.
If we simply wish to change the output, regardless of the points, we can do something like the following.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), colour = "blue", alpha = 0.8)
Notice how some points are more transparent than others. It is because the colour becomes more intense if two or more points overlap.
What is wrong with this plot? Fix it so that the output has the same colour.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
We can add more variables by combining different aesthetics. If a variables in the data frame is discrete, we can split the plots into facets.
To do this, we add a layer called facet_wrap
.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)
The argument of facet_wrap
is called a formula (~ class
). This is not to be confused with a formula in the mathematical sense; it is a data structure of R.
If we wish to facet the plot according to two variables, we can use the facet_grid
function in a similar way. We can decompose according to drv
(f
is front-wheel, r
is rear-wheel and 4
is 4wd) and cyl
is the number of cylinders.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
What happens if you facet considering a variable with continuous values?
What does the following code do? What is the effect of .
?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl)
Take a look at the following plots.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
These plots use the same data and the same variables. The difference is given by the different choice of geom
layers.
A geom
is, in facts, a geometrical object employed for the representation of the data.
Every geom
takes as input a mapping
argument. Not every aesthetic, however, is compatible with every geom
. For instance, there is no such thing as the shape
of a line. Yet, we can define a linetype
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
In this case, we have three different smoothed lines, obtained by separating the different classes. If this sounds strange, let us superpose the plots and add some colours.
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, colour = drv)) + geom_point(mapping = aes(x = displ, y = hwy, linetype = drv, colour = drv))
There are more than thirty geometric objects in the base ggplot2
package and many more extensions have been created. A cheatsheet is available to guide you in the process!
We notice, however, that we had to duplicate the code:
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, colour = drv)) + geom_point(mapping = aes(x = displ, y = hwy, linetype = drv, colour = drv))
We can avoid this by telling ggplot
that we want it to push the arguments to the following geom
s as follows:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) + geom_smooth(mapping = aes(linetype = drv)) + geom_point()
The same idea works backward. If we wish to only draw a line for the "midsize" cars, we can do the following.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(colour = drv)) + geom_smooth(data = filter(mpg, class == "midsize"), se = FALSE)
Do you think these graphs will differ?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() ggplot() + geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
Congratulations! You have made your first steps in the amazing field of data visualisation. If you want to learn more, check the ggplot2 website.
ggplot2
is widely employed. Next time you read some news on the BBC® website, you may want to take a closer look...
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.