In DavisBrian/rclassnotes: Does not matter.

Data Visualization

Why ggplot2

The transferrable skills from ggplot2 are not the idiosyncracies of plotting syntax, but a powerful way of thinking about visualisation, as a way of mapping between variables and the visual properties of geometric objects that you can perceive.

--- Hadley Wickham

Base plotting is imperative,it’s about what you do. You set up your layout(), then you go to the first group (drug) You add the points for that group (drug) along with a title. Then you fit and plot a best-fit-line for the first grouping, then the second grouping, and so on. Then you go on to the next plot. After 20 of those, you end with a legend.

ggplot2 plotting is declarative, it’s about what your graph is. The graph has drug group mapped to the x-axis, prevalence rate mapped to the y, and abuse type mapped to the color. The graph displays both points and best-fit lines for each drug group and it is faceted into one-plot-per-drug group, with a drug group described by its market name.

Functional data visualization
1. Wrangle your data
2. Map data elements to visual elements
3. Tweak scales, guides, axis, labels, theme
Easy to reason about how the data drives the visualization
Easy to iterate
East to be consistent

ggplot2 is a huge package: philosophy + functions ...but it is very well organized

ggplot2 has it's one website with some very good examples and how to do common task.

See http://ggplot2.tidyverse.org/reference

```{block2, type='rmdwarning'} On 6/15 ggplot2 2.3.0 will come out and there are Breaking Changes. If you upgrade your version of ggplot there may be instances that code from the book (or websites) will no longer work.

## Example

Going to throw a lot at you ...but you'll know where and what to look for.  For just about every plotting task there are multiple ways to achieve the desired result.

```r
knitr::include_graphics("images/dataviz/api_life_means.png", dpi = 450)
# knitr::include_graphics("images/dataviz/api_nmu_means.png", dpi = 450)
knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)

What is similar / different between these plots? What is and what isn't driven by data?

We'll build this style of plot in stages. In chapter 9 of R for Data Science we will go into detail about how to get our data in this format.

Data

All plots start with data. `ggplot expects the data to be in a "Tidy Data" format. We'll dive deeper into "tidy data" in Chapter 9 of R for Data Science, but for now the basic principle is

Each variable forms a column
Each observation forms a row
Each observational unit forms a table

library(tidyverse)
dat <- readRDS("./data/bargraphdat.RDS")
dat

p <- ggplot(data = dat)
p

That's uninteresting. We haven't mapped the data to our plot yet. Let's work on getting the bar chart roughly right.

Aesthetics

Aesthetics map data to visual elements or parameters.

drug -> x-axis
mean -> y-axis
use_type -> color

p <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) 
p

Geoms

Geoms are short for geometric objects which are displayed on the plot. Some of the more familiar ones are

| Type | Function | |:----:|:--------:| | Point | geom_point() | | Line | geom_line() | | Bar | geom_bar(), geom_col() | | Histogram | geom_histogram() | | Regression | geom_smooth() | | Boxplot | geom_boxplot() | | Text | geom_text() | | Vert./Horiz. Line | geom_{vh}line() | | Count | geom_count() | | Density | geom_density() |

Those are just the top 10 most popular geoms

See http://ggplot2.tidyverse.org/reference/ for many more options

Or just start typing geom_ in RStudio

# geom_
old_width = options(width = 80)
lsf.str("package:ggplot2") %>% grep("^geom_", ., value = TRUE)
options(width = old_width$width)

There are also many ggplot extensions that add other useful geoms. See https://www.ggplot2-exts.org/ for many useful features and extensions.

``{block type='rmdnote'} There are two types of bar charts:geom_barmakes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, usegeom_colinstead.geom_bar` will calculate the counts or proportions from the raw data. There is no reason to precompute those.

```r
p <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) +
  geom_col()
p

Oops....

The color only controls the border of our bar chart, what we want to do is fill the bar. Also, note that by default the bars are stacked. We can fix that by having the position of each subgroup dodge each other.

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type)) +
  geom_col(position = "dodge")
p

geom_*(mapping, data, stat, position)

data Geoms can have their own data
- Has to map onto global coordinates
map Geoms can have their own aesthetics
- Inherits global aesthetics
- Have geom-specific aesthetics
  - geom_point needs x and y, optional shape, color, size, etc.
  - geom_ribbon requires x, ymin and ymax, optional fill
- ?geom_ribbon
stat Some geoms apply further transformations to the data
- All respect stat = 'identity'
- Ex: geom_histogram uses stat_bin() to group observations
position Some adjust location of objects
- 'dodge', 'stack', 'jitter'

Now lets add the error bars to our plot. We will have to add the upper and lower bounds to our aesthetics, and align them with our bars.

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5)
p

We've come pretty close to recreating the original plot. We still have some tweaking to do.

Reorder the grouping so that "Use" comes before "Non-Medical Use" and use the full description.
Change the fill colors
Change the y-axis label to "Prevalence % (95% CI)"
Remove the x-axis label "drug".
Change the y-axis scales to go in increments of 5
Rotate the x-axis labels
Remove the variable name over the legend.
Move the legend to the bottom

The first one is handled with our data. Factors to the rescue. while the second can be done with a named vector.

# convert the use_type to a factor with the correct label
dat$use_type <-factor(dat$use_type, 
                      levels = c("use", "nmu"), 
                      labels = c("Lifetime Use", "Lifetime Non-Medical Use"))

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5)
p

Scales

Scales control the details of how data values are translated to visual properties. Override the default scales to tweak details like the axis labels or legend keys, or to use a completely different translation from data to aesthetic.

labs() xlab() ylab() and ggtitle() modify the axis, legend, and plot labels.

bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue")

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5) ) +  # change the y-axis scale
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)")          # Change the y-axis label
p

``{block2, type='rmdtip'} libraryscales` provides many useful functions for automatically determining breaks and labels for axes and legends. Also has many useful formatting functions such as commas and percentages

### Themes

Themes control the display of all non-data elements of the plot. You can change just about everything, fonts, font sizes, background colors, etc.  You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme() and the element_ functions. 

There are a handful of built in themes and tons of packages that have additional themes.  `ggthemes` has a collection of themes used by various organization (Ex. The Economist, Fivethiryeight.com, The Wall St. Journal, etc)

Themes contain a huge number or parameters, grouped by plot area:

*  Global options: `line`, `rect`, `text`, `title`
*  `axis`: x-, y- or other axis title, ticks, lines
*  `legend`: Plot legends
*  `panel`: Actual plot area
*  `plot`: Whole image
*  `strip`: Facet labels

```r
p + theme_classic()

This is almost what we want. Our final code would look like:

library(tidyverse)
dat <- readRDS("./data/bargraphdat.RDS")
# convert the use_type to a factor with the correct label
dat$use_type <-factor(dat$use_type, levels = c("use", "nmu"), labels = c("Lifetime Use", "Lifetime Non-Medical Use"))
bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue")

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  coord_cartesian(ylim=c(0, 50)) +
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)+5), 5),  # change the y-axis scale
                     expand = c(0,0)) +      # remove the spacing between the x axis and the bars
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)") +        # Change the y-axis label
  theme_classic() +
  theme(legend.position = "bottom",          # move the legend to the bottom
        legend.title    = element_blank(),   # remove the legend variable
        axis.text.x     = element_text(angle = 90, hjust = 1),   # rotate the x-axis text
        axis.ticks.x    = element_blank())       # remove the x asix tick marks
p

Facets

Facets are subplots of the data with each subplot displaying one subset of the data. there are two ways to create facets: facet_grid and facet_wrap.

facet_grid forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

p <- ggplot(data = dat, aes(x = fct_reorder(drug, mean), y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  facet_wrap(~ use_type, scales = "free") +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5),  # change the y-axis scale
                     expand = c(0,0)) +      # remove the spacing between the x axis and the bars
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)") +        # Change the y-axis label
  theme_classic() +
  theme(legend.position = "bottom",          # move the legend to the bottom
        legend.title    = element_blank()) + # remove the legend variable
  coord_flip()
p

Stats

While we didn't use them for this particular plot stat_*() function can be a huge time saver. stat_* functions display statistical summaries of the data. For a bar plot there is no reason the count then number of items in a group (or percentage) on the data. Instead we can use the appropriate function have it calculated automatically for us.

# geom_
old_width = options(width = 80)
lsf.str("package:ggplot2") %>% grep("^stat_", ., value = TRUE)
options(width = old_width$width)

There are many more useful stat_*() functions in various packages.

Saving

Save your plot with ggsave. Use the correct extension for the plot type you wish to save. E.g .pdf for pdf, .png for png, etc. See ?ggsave for details and other parameters.

Exercises

Modify the above code to produce the plot below. You can read in the data with: dat <- readRDS("./data/bargraphdat2.RDS")

knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)

If you wanted to make this style of plot a function, what would you need to pass to the function? What customization would you allow a user to make and what would you not?
For the plot you brought, create a data set and create the the plot using ggplot.
For the above plot (exercise 3). Re-imagine a different visualization for the data and create the plot using ggplot.
Begin making a RADARS theme. What is our font, font size for various elements, background, etc. We will end up making a custom theme based on this for everyone to use. This will allow us to get presentation quality graphics quickly.
Read Chapter 2 (Workflow: Basics)

Resources and Links

Learn more

ggplot2 docs: http://ggplot2.tidyverse.org/
Hadley Wickham's ggplot2 book: https://www.amazon.com/dp/0387981403/

Noteworthy RStudio Add-Ins

ggplotThemeAssist: Customize your ggplot theme interactively
ggedit: Layer, scale, and theme editing

General Help and How-To's

Tips and Tricks

Math and symbols

Base Plot

DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DavisBrian/rclassnotes
Does not matter.

In DavisBrian/rclassnotes: Does not matter.

Data Visualization

Why ggplot2

Data

Aesthetics

Geoms

Scales

Facets

Stats

Saving

Exercises

Resources and Links

R Package Documentation

Browse R Packages

We want your feedback!

DavisBrian/rclassnotes Does not matter.

In DavisBrian/rclassnotes: Does not matter.

Data Visualization

Why ggplot2

Data

Aesthetics

Geoms

Scales

Facets

Stats

Saving

Exercises

Resources and Links

R Package Documentation

Browse R Packages

We want your feedback!

DavisBrian/rclassnotes
Does not matter.