The transferrable skills from ggplot2 are not the idiosyncracies of plotting syntax, but a powerful way of thinking about visualisation, as a way of mapping between variables and the visual properties of geometric objects that you can perceive.
--- Hadley Wickham
Base plotting is imperative,it’s about what you do. You set up your layout(), then you go to the first group (drug) You add the points for that group (drug) along with a title. Then you fit and plot a best-fit-line for the first grouping, then the second grouping, and so on. Then you go on to the next plot. After 20 of those, you end with a legend.
ggplot2 plotting is declarative, it’s about what your graph is. The graph has drug group mapped to the x-axis, prevalence rate mapped to the y, and abuse type mapped to the color. The graph displays both points and best-fit lines for each drug group and it is faceted into one-plot-per-drug group, with a drug group described by its market name.
ggplot2
is a huge package: philosophy + functions ...but it is very well organized
ggplot2
has it's one website with some very good examples and how to do common task.
See http://ggplot2.tidyverse.org/reference
```{block2, type='rmdwarning'} On 6/15 ggplot2 2.3.0 will come out and there are Breaking Changes. If you upgrade your version of ggplot there may be instances that code from the book (or websites) will no longer work.
## Example Going to throw a lot at you ...but you'll know where and what to look for. For just about every plotting task there are multiple ways to achieve the desired result. ```r knitr::include_graphics("images/dataviz/api_life_means.png", dpi = 450) # knitr::include_graphics("images/dataviz/api_nmu_means.png", dpi = 450) knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)
What is similar / different between these plots? What is and what isn't driven by data?
We'll build this style of plot in stages. In chapter 9 of R for Data Science we will go into detail about how to get our data in this format.
All plots start with data. `ggplot
expects the data to be in a "Tidy Data" format. We'll dive deeper into "tidy data" in Chapter 9 of R for Data Science, but for now the basic principle is
library(tidyverse) dat <- readRDS("./data/bargraphdat.RDS") dat
p <- ggplot(data = dat) p
That's uninteresting. We haven't mapped the data to our plot yet. Let's work on getting the bar chart roughly right.
Aesthetics map data to visual elements or parameters.
drug
-> x-axismean
-> y-axisuse_type
-> colorp <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) p
Geoms are short for geometric objects which are displayed on the plot. Some of the more familiar ones are
| Type | Function |
|:----:|:--------:|
| Point | geom_point()
|
| Line | geom_line()
|
| Bar | geom_bar()
, geom_col()
|
| Histogram | geom_histogram()
|
| Regression | geom_smooth()
|
| Boxplot | geom_boxplot()
|
| Text | geom_text()
|
| Vert./Horiz. Line | geom_{vh}line()
|
| Count | geom_count()
|
| Density | geom_density()
|
Those are just the top 10 most popular geoms
See http://ggplot2.tidyverse.org/reference/ for many more options
Or just start typing geom_
in RStudio
# geom_ old_width = options(width = 80) lsf.str("package:ggplot2") %>% grep("^geom_", ., value = TRUE) options(width = old_width$width)
There are also many ggplot extensions that add other useful geoms. See https://www.ggplot2-exts.org/ for many useful features and extensions.
``{block type='rmdnote'}
There are two types of bar charts:
geom_barmakes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use
geom_colinstead.
geom_bar` will calculate the counts or proportions from the raw data. There is no reason to precompute those.
```r p <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) + geom_col() p
Oops....
The color
only controls the border of our bar chart, what we want to do is fill
the bar. Also, note that by default the bars are stacked
. We can fix that by having the position of each subgroup dodge
each other.
p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type)) + geom_col(position = "dodge") p
geom_*(mapping, data, stat, position)
data
Geoms can have their own data
map
Geoms can have their own aesthetics
geom_point
needs x
and y
, optional shape
, color
, size
, etc.geom_ribbon
requires x
, ymin
and ymax
, optional fill
?geom_ribbon
stat
Some geoms apply further transformations to the data
stat = 'identity'
geom_histogram
uses stat_bin()
to group observationsposition
Some adjust location of objects
'dodge'
, 'stack'
, 'jitter'
Now lets add the error bars to our plot. We will have to add the upper and lower bounds to our aesthetics, and align them with our bars.
p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) + geom_col(position = "dodge", width = 0.75) + geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) p
We've come pretty close to recreating the original plot. We still have some tweaking to do.
The first one is handled with our data. Factors to the rescue. while the second can be done with a named vector.
# convert the use_type to a factor with the correct label dat$use_type <-factor(dat$use_type, levels = c("use", "nmu"), labels = c("Lifetime Use", "Lifetime Non-Medical Use")) p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) + geom_col(position = "dodge", width = 0.75) + geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) p
Scales control the details of how data values are translated to visual properties. Override the default scales to tweak details like the axis labels or legend keys, or to use a completely different translation from data to aesthetic.
labs()
xlab()
ylab()
and ggtitle()
modify the axis, legend, and plot labels.
bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue") p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) + geom_col(position = "dodge", width = 0.75) + geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) + scale_fill_manual(values=bar_colors) + # change the bar colors scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5) ) + # change the y-axis scale labs(x = NULL, # Remove the x-axis label "drug" y = "Prevalence % (95% CI)") # Change the y-axis label p
``{block2, type='rmdtip'}
library
scales` provides many useful functions for automatically determining breaks and labels for axes and legends. Also has many useful formatting functions such as commas and percentages
### Themes Themes control the display of all non-data elements of the plot. You can change just about everything, fonts, font sizes, background colors, etc. You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme() and the element_ functions. There are a handful of built in themes and tons of packages that have additional themes. `ggthemes` has a collection of themes used by various organization (Ex. The Economist, Fivethiryeight.com, The Wall St. Journal, etc) Themes contain a huge number or parameters, grouped by plot area: * Global options: `line`, `rect`, `text`, `title` * `axis`: x-, y- or other axis title, ticks, lines * `legend`: Plot legends * `panel`: Actual plot area * `plot`: Whole image * `strip`: Facet labels ```r p + theme_classic()
This is almost what we want. Our final code would look like:
library(tidyverse) dat <- readRDS("./data/bargraphdat.RDS") # convert the use_type to a factor with the correct label dat$use_type <-factor(dat$use_type, levels = c("use", "nmu"), labels = c("Lifetime Use", "Lifetime Non-Medical Use")) bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue") p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) + geom_col(position = "dodge", width = 0.75) + geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) + scale_fill_manual(values=bar_colors) + # change the bar colors coord_cartesian(ylim=c(0, 50)) + scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)+5), 5), # change the y-axis scale expand = c(0,0)) + # remove the spacing between the x axis and the bars labs(x = NULL, # Remove the x-axis label "drug" y = "Prevalence % (95% CI)") + # Change the y-axis label theme_classic() + theme(legend.position = "bottom", # move the legend to the bottom legend.title = element_blank(), # remove the legend variable axis.text.x = element_text(angle = 90, hjust = 1), # rotate the x-axis text axis.ticks.x = element_blank()) # remove the x asix tick marks p
Facets are subplots of the data with each subplot displaying one subset of the data. there are two ways to create facets: facet_grid
and facet_wrap
.
facet_grid
forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.
facet_wrap
wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.
p <- ggplot(data = dat, aes(x = fct_reorder(drug, mean), y = mean, fill = use_type, ymin = lower, ymax = upper)) + geom_col(width = 0.75) + geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) + facet_wrap(~ use_type, scales = "free") + scale_fill_manual(values=bar_colors) + # change the bar colors scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5), # change the y-axis scale expand = c(0,0)) + # remove the spacing between the x axis and the bars labs(x = NULL, # Remove the x-axis label "drug" y = "Prevalence % (95% CI)") + # Change the y-axis label theme_classic() + theme(legend.position = "bottom", # move the legend to the bottom legend.title = element_blank()) + # remove the legend variable coord_flip() p
While we didn't use them for this particular plot stat_*()
function can be a huge time saver. stat_*
functions display statistical summaries of the data. For a bar plot there is no reason the count then number of items in a group (or percentage) on the data. Instead we can use the appropriate function have it calculated automatically for us.
# geom_ old_width = options(width = 80) lsf.str("package:ggplot2") %>% grep("^stat_", ., value = TRUE) options(width = old_width$width)
There are many more useful stat_*()
functions in various packages.
Save your plot with ggsave
. Use the correct extension for the plot type you wish to save. E.g .pdf for pdf, .png for png, etc. See ?ggsave
for details and other parameters.
dat <- readRDS("./data/bargraphdat2.RDS")
knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)
If you wanted to make this style of plot a function, what would you need to pass to the function? What customization would you allow a user to make and what would you not?
For the plot you brought, create a data set and create the the plot using ggplot.
For the above plot (exercise 3). Re-imagine a different visualization for the data and create the plot using ggplot.
Begin making a RADARS theme. What is our font, font size for various elements, background, etc. We will end up making a custom theme based on this for everyone to use. This will allow us to get presentation quality graphics quickly.
Read Chapter 2 (Workflow: Basics)
Learn more
ggplot2 docs: http://ggplot2.tidyverse.org/
Hadley Wickham's ggplot2 book: https://www.amazon.com/dp/0387981403/
Noteworthy RStudio Add-Ins
ggplotThemeAssist: Customize your ggplot theme interactively
ggedit: Layer, scale, and theme editing
General Help and How-To's
Tips and Tricks
Math and symbols
Base Plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.