library(learnr) library(gapminder) library(ggrepel) library(tidyverse) knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align="center", fig.width = 5, fig.height = 4) tutorial_options(exercise.timelimit = 60, exercise.blanks = "___+", exercise.eval=T) #no factors please gapminder <- gapminder %>% mutate(country = as.character(country), continent = as.character(continent)) gap_92 <- gapminder %>% filter(year == 1992) %>% mutate(gdp = gdpPercap * pop / 1e9) df <- gapminder %>% filter(country == 'Romania')
Don't look as good
Hard to build more complex plots, and fine-tune
plot(gapminder$gdpPercap, gapminder$lifeExp)
ggplot2
What is a statistical graphic?
Take variables from a dataset
map
them to aes()
thetic attributes
of geom_
etric objects
How are variables mapped to aesthetic attributes of points?
gapminder %>% filter(year == 1992) %>% mutate(gdp = gdpPercap * pop / 1e9) %>% ggplot(aes(gdp, lifeExp)) + geom_point(aes(color = continent, size = pop)) + scale_x_log10() + xlab('Gross Domestic Product (Billions $)') + ylab('Life Expectancy at birth (years)') + ggtitle('Gapminder for 1992')
Construct a graphic by adding modular pieces
ggplot(data, mapping)
Define aesthetic mappings with aes()
function
e.g. aes(x = var1, y = var2)
Add 'layers' of geometric objects
e.g. geom_point()
Adjustments to axis scales, colors, labels, aesthetic mods
"Chaining" together ggplot components (use +
rather than %>%
)
+
rather than %>%
is unfortunate and hard to remember!The key is to understand the concepts and basic mechanics
The details for any given plot type, or attribute are easy to look up
gap_92 <- gapminder %>% filter(year == 1992) %>% mutate(gdp = gdpPercap * pop / 1e9) gap_92 %>% head(4)
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point()
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + scale_x_log10()
Change how data values are translated to visual properties
scale_x_log10()
, scale_y_reverse()
Change limits of axes:
xlim(0, 10)
Applies to other attributes as well
Fine-tune color, shape, size aesthetics.
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, shape = continent)) + geom_point() + scale_x_log10()
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, color = continent)) + geom_point() + scale_x_log10()
labs
function adds custom axis labels and titles
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)', title = 'Gapminder for 1992')
Comparing 2 continuous variables
Scatterplot: geom_point()
Line graph: geom_line()
Smoothing functions: geom_smooth()
Summarizing distribution of a single variable
Histogram: geom_histogram()
Density: geom_density()
Discrete vs continuous
Boxplot: geom_boxplot
Bar graph: geom_col()
Violin plot: geom_violin()
And many more...
df <- gapminder %>% filter(country == 'Romania') ggplot(df, mapping = aes(x = year, y = lifeExp)) + geom_line()
We can add as many geoms to a plot as we want, stacked on as 'layers' in order
ggplot(df, mapping = aes(x = year, y = lifeExp)) + geom_line() + geom_point()
What if we had multiple data points per year?
df <- gapminder %>% filter(country %in% c('Romania', 'Thailand')) ggplot(df, mapping = aes(x = year, y = lifeExp)) + geom_line() + geom_point()
Need to separate them by country (group
aesthetic)
ggplot(df, mapping = aes(x = year, y = lifeExp, group = country)) + geom_line() + geom_point()
Often useful to color lines by group, use color
aesthetic with a categorical variable and it automatically groups
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + geom_line() + geom_point()
ggplot()
but can override this for individual 'geoms'ggplot(df, mapping = aes(x = year, y = lifeExp)) + geom_line(mapping = aes(color = country)) + geom_point()
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + geom_line(linetype = 'dashed', size = 0.5) + geom_point(color = 'black', size = 3, alpha = 0.75)
How to depict the 'average' relationship between noisy variables?
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
geom_line()
doesn't work!
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_line() + geom_point() + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
geom_smooth()
shows the average ('smoothed') relationship
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + geom_smooth() + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
Can be used to show a linear trendline
ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + geom_smooth(method = 'lm') + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
Can be very helpful to condense down relationships from complicated data
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10()
Can be very helpful to condense down relationships from complicated data
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_smooth(method = 'lm') + scale_x_log10()
Above were all examples based around plotting 2 continuous variables (other 'aesthetics' can encode additional variables
Other common scenarios are:
Plot distribution of a single variable (continuous or discrete)
Plot the distribution of a continuous variable against a discrete variable
Given a single discrete variable we can plot its distribution as a 'bar plot' using geom_bar()
ggplot(gapminder, mapping = aes(x = continent)) + geom_bar()
For a single continuous variable, we can generate a histogram using geom_histogram
which bins the values and then makes a bar plot
ggplot(gapminder, mapping = aes(x = gdpPercap)) + geom_histogram()
We can adjust the axis scale and other features as usual
ggplot(gapminder, mapping = aes(x = gdpPercap)) + geom_histogram() + scale_x_log10()
We can change the number of bins (can also specify details of bin positions)
ggplot(gapminder, aes(gdpPercap)) + geom_histogram(bins = 100) + scale_x_log10()
Can also encode different continents in different colors by stacking the histograms
ggplot(gapminder, mapping = aes(x = gdpPercap, color = continent)) + geom_histogram() + scale_x_log10()
ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) + geom_histogram() + scale_x_log10()
Density plots are another way to depict the distribution of a continuous variable. They are just a smoothed histogram
ggplot(gapminder, mapping = aes(x = gdpPercap)) + geom_density() + scale_x_log10()
Separate by continent and give spearate fill
colors
ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) + geom_density(alpha = 0.5) + scale_x_log10()
The boxplot is the most common choice for showing the distribution of a continuous variable broken down by a categorical variable
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) + geom_boxplot() + scale_y_log10()
The violin plot is similar, but shows the distribution as a density plot, rather than a box.
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) + geom_violin() + scale_y_log10()
Another useful option is a 'dotplot' or 'beeswarm' plot.
library(ggbeeswarm) ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) + geom_beeswarm(size = 0.5, alpha = 0.75, cex = 1) + scale_y_log10()
By default x-axis values ordered alphabetically
Need to use the idea of a factor
Factors used to encode categorical variables, specify the possible 'levels', and optionally an ordering
cont_order <- c('Oceania', 'Europe', 'Americas', 'Asia', 'Africa') gap_cat <- gapminder %>% mutate(continent = factor(continent, levels = cont_order)) head(gap_cat)
ggplot(gap_cat, mapping = aes(x = continent, y = gdpPercap)) + geom_boxplot() + scale_y_log10()
forcats package has lots of useful helper functions for changing order of factor variables.
gap_cat <- gap_cat %>% mutate(continent = fct_reorder(continent, gdpPercap, median)) ggplot(gap_cat, mapping = aes(x = continent, y = gdpPercap)) + geom_boxplot() + scale_y_log10()
If you want to plot a single value for each of a continuous variable, use geom_col
gap_82 <- gapminder %>% filter(year == 1982, continent == 'Americas') ggplot(gap_82, mapping = aes(x = country, y = gdpPercap)) + geom_col()
You can customize MANY details of the plot using the theme
function
It's a bit complicated at first, but most common changes are easy to google.
ggsave
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) + geom_violin() + scale_y_log10() ggsave(filename = here::here('results', 'my_fig.png'))
You don't need to remember the details, just the basic mechanics. You can quickly look up the details (check out this useful ggplot cheat sheet)
Find example plots online that you like and just copy/paste as a template. Browse the ggplot gallery
If we map a continuous variable to color it won't group automatically
ggplot(df, mapping = aes(x = year, y = lifeExp, color = gdpPercap)) + geom_line() + geom_point(size = 3)
We need to specify group manually
ggplot(df, mapping = aes(x = year, y = lifeExp, group = country, color = gdpPercap)) + geom_line() + geom_point(size = 3)
Assume continuous map for numeric data, discrete map for strings
Make numeric data into factors if you want discrete colors
my_df <- gapminder %>% filter(year %in% c(1957, 1977, 1997)) ggplot(my_df, mapping = aes(x = gdpPercap, y = lifeExp, color = factor(year))) + geom_point() + scale_x_log10() + labs(color = 'year')
We can use scale_color_manual
to set the color of each group manually
my_cols <- c(Romania = 'green', Thailand = 'orange') ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + geom_line() + scale_color_manual(values = my_cols)
scale_color_brewer
offers some useful default color schemes
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + geom_line() + scale_color_brewer(palette = 'Dark2')
https://www.r-bloggers.com/a-detailed-guide-to-ggplot-colors/
Facets allow you to easily break a single plot into multiple plots based on variable.
gap_early <- gapminder %>% filter(year < 1970)
ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10() + facet_wrap(~continent)
Or based on multiple variables
ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10() + facet_grid(year ~ continent)
gap_df <- gapminder %>% filter(year == 1992, continent == 'Americas') %>% mutate(gdp = gdpPercap * pop / 1e9) %>% head(20)
You can add text labels to the points with geom_text
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) + geom_text() + geom_point() + geom_smooth(method = 'lm', se = FALSE) + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
Or with geom_label
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) + geom_label() + geom_point() + geom_smooth(method = 'lm', se = FALSE) + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')
Text labels are often not placed optimally
ggrepel is a very useful package that will automatically find good positioning for labels
library(ggrepel)
ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp)) + geom_point() + geom_smooth(method = 'lm', se = FALSE) + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') + geom_label_repel(aes(label = country), size = 2.5)
There are lots of ways to add aesthetic improvements to your figures relatively easily
my_plot <- ggplot(gap_92, aes(gdp, lifeExp)) + geom_point(aes(color = continent, size = pop)) + scale_x_log10() + xlab('Gross Domestic Product (Billions $)') + ylab('Life Expectancy at birth (years)') + ggtitle('Gapminder for 1992')
my_plot
There are a number of pre-packaged 'themes' you can apply
my_plot + theme_minimal()
Set the marker shape to one that can be 'filled' (pch = 21 is a filled circle), then use a thin white border around a filled shape to help distinguish overlaps.
ggplot(gap_92, aes(gdp, lifeExp)) + geom_point(pch = 21, stroke = 0.5, alpha = 0.8, size = 2.5, color = 'white', aes(fill = continent)) + scale_x_log10() + labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)', title = 'Gapminder for 1992') + theme_minimal()
Add stats directly to your figures
library(ggpubr) my_comparisons <- list( c("Africa", "Asia"), c('Europe', 'Oceania')) ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) + geom_violin() + scale_y_log10() + stat_compare_means(method = 'wilcox.test', comparisons = my_comparisons)
Easily add correlation coefficients
ggplot(gap_92, mapping = aes(x = lifeExp, y = gdpPercap)) + geom_point() + scale_y_log10() + geom_smooth(method = 'lm') + stat_cor()
Great tool for combining multiple 'panels' into one plot
library(cowplot) p1 <- ggplot(mtcars, aes(disp, mpg)) + geom_point() p2 <- ggplot(mtcars, aes(qsec, mpg)) + geom_point() plot_grid(p1, p2, labels = c('A', 'B'))
Great tool for making heatmaps. See VERY detailed documentation with examples here
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.