Data Visualization

Why ggplot2

The transferrable skills from ggplot2 are not the idiosyncracies of plotting syntax, but a powerful way of thinking about visualisation, as a way of mapping between variables and the visual properties of geometric objects that you can perceive.

--- Hadley Wickham

Base plotting is imperative,it’s about what you do. You set up your layout(), then you go to the first group (drug) You add the points for that group (drug) along with a title. Then you fit and plot a best-fit-line for the first grouping, then the second grouping, and so on. Then you go on to the next plot. After 20 of those, you end with a legend.

ggplot2 plotting is declarative, it’s about what your graph is. The graph has drug group mapped to the x-axis, prevalence rate mapped to the y, and abuse type mapped to the color. The graph displays both points and best-fit lines for each drug group and it is faceted into one-plot-per-drug group, with a drug group described by its market name.

ggplot2 is a huge package: philosophy + functions ...but it is very well organized

ggplot2 has it's one website with some very good examples and how to do common task.

See http://ggplot2.tidyverse.org/reference

```{block2, type='rmdwarning'} On 6/15 ggplot2 2.3.0 will come out and there are Breaking Changes. If you upgrade your version of ggplot there may be instances that code from the book (or websites) will no longer work.

## Example

Going to throw a lot at you ...but you'll know where and what to look for.  For just about every plotting task there are multiple ways to achieve the desired result.

```r
knitr::include_graphics("images/dataviz/api_life_means.png", dpi = 450)
# knitr::include_graphics("images/dataviz/api_nmu_means.png", dpi = 450)
knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)

What is similar / different between these plots? What is and what isn't driven by data?

We'll build this style of plot in stages. In chapter 9 of R for Data Science we will go into detail about how to get our data in this format.

Data

All plots start with data. `ggplot expects the data to be in a "Tidy Data" format. We'll dive deeper into "tidy data" in Chapter 9 of R for Data Science, but for now the basic principle is

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each observational unit forms a table
library(tidyverse)
dat <- readRDS("./data/bargraphdat.RDS")
dat
p <- ggplot(data = dat)
p

That's uninteresting. We haven't mapped the data to our plot yet. Let's work on getting the bar chart roughly right.

Aesthetics

Aesthetics map data to visual elements or parameters.

p <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) 
p

Geoms

Geoms are short for geometric objects which are displayed on the plot. Some of the more familiar ones are

| Type | Function | |:----:|:--------:| | Point | geom_point() | | Line | geom_line() | | Bar | geom_bar(), geom_col() | | Histogram | geom_histogram() | | Regression | geom_smooth() | | Boxplot | geom_boxplot() | | Text | geom_text() | | Vert./Horiz. Line | geom_{vh}line() | | Count | geom_count() | | Density | geom_density() |

Those are just the top 10 most popular geoms

See http://ggplot2.tidyverse.org/reference/ for many more options

Or just start typing geom_ in RStudio

# geom_
old_width = options(width = 80)
lsf.str("package:ggplot2") %>% grep("^geom_", ., value = TRUE)
options(width = old_width$width)

There are also many ggplot extensions that add other useful geoms. See https://www.ggplot2-exts.org/ for many useful features and extensions.

``{block type='rmdnote'} There are two types of bar charts:geom_barmakes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, usegeom_colinstead.geom_bar` will calculate the counts or proportions from the raw data. There is no reason to precompute those.

```r
p <- ggplot(data = dat, aes(x = drug, y = mean, color = use_type)) +
  geom_col()
p

Oops....

The color only controls the border of our bar chart, what we want to do is fill the bar. Also, note that by default the bars are stacked. We can fix that by having the position of each subgroup dodge each other.

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type)) +
  geom_col(position = "dodge")
p

geom_*(mapping, data, stat, position)

Now lets add the error bars to our plot. We will have to add the upper and lower bounds to our aesthetics, and align them with our bars.

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5)
p

We've come pretty close to recreating the original plot. We still have some tweaking to do.

  1. Reorder the grouping so that "Use" comes before "Non-Medical Use" and use the full description.
  2. Change the fill colors
  3. Change the y-axis label to "Prevalence % (95% CI)"
  4. Remove the x-axis label "drug".
  5. Change the y-axis scales to go in increments of 5
  6. Rotate the x-axis labels
  7. Remove the variable name over the legend.
  8. Move the legend to the bottom

The first one is handled with our data. Factors to the rescue. while the second can be done with a named vector.

# convert the use_type to a factor with the correct label
dat$use_type <-factor(dat$use_type, 
                      levels = c("use", "nmu"), 
                      labels = c("Lifetime Use", "Lifetime Non-Medical Use"))

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5)
p

Scales

Scales control the details of how data values are translated to visual properties. Override the default scales to tweak details like the axis labels or legend keys, or to use a completely different translation from data to aesthetic.

labs() xlab() ylab() and ggtitle() modify the axis, legend, and plot labels.

bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue")

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5) ) +  # change the y-axis scale
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)")          # Change the y-axis label
p

``{block2, type='rmdtip'} libraryscales` provides many useful functions for automatically determining breaks and labels for axes and legends. Also has many useful formatting functions such as commas and percentages

### Themes

Themes control the display of all non-data elements of the plot. You can change just about everything, fonts, font sizes, background colors, etc.  You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme() and the element_ functions. 

There are a handful of built in themes and tons of packages that have additional themes.  `ggthemes` has a collection of themes used by various organization (Ex. The Economist, Fivethiryeight.com, The Wall St. Journal, etc)

Themes contain a huge number or parameters, grouped by plot area:

*  Global options: `line`, `rect`, `text`, `title`
*  `axis`: x-, y- or other axis title, ticks, lines
*  `legend`: Plot legends
*  `panel`: Actual plot area
*  `plot`: Whole image
*  `strip`: Facet labels

```r
p + theme_classic()

This is almost what we want. Our final code would look like:

library(tidyverse)
dat <- readRDS("./data/bargraphdat.RDS")
# convert the use_type to a factor with the correct label
dat$use_type <-factor(dat$use_type, levels = c("use", "nmu"), labels = c("Lifetime Use", "Lifetime Non-Medical Use"))
bar_colors <- c("Lifetime Use" = "grey", "Lifetime Non-Medical Use" = "blue")

p <- ggplot(data = dat, aes(x = drug, y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(position = "dodge", width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  coord_cartesian(ylim=c(0, 50)) +
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)+5), 5),  # change the y-axis scale
                     expand = c(0,0)) +      # remove the spacing between the x axis and the bars
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)") +        # Change the y-axis label
  theme_classic() +
  theme(legend.position = "bottom",          # move the legend to the bottom
        legend.title    = element_blank(),   # remove the legend variable
        axis.text.x     = element_text(angle = 90, hjust = 1),   # rotate the x-axis text
        axis.ticks.x    = element_blank())       # remove the x asix tick marks
p

Facets

Facets are subplots of the data with each subplot displaying one subset of the data. there are two ways to create facets: facet_grid and facet_wrap.

facet_grid forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

p <- ggplot(data = dat, aes(x = fct_reorder(drug, mean), y = mean, fill = use_type, ymin = lower, ymax = upper)) +
  geom_col(width = 0.75) +
  geom_errorbar(position = position_dodge(width = 0.75), width = 0.5) +
  facet_wrap(~ use_type, scales = "free") +
  scale_fill_manual(values=bar_colors) +     # change the bar colors
  scale_y_continuous(breaks = seq(0, ceiling(max(dat$upper)), 5),  # change the y-axis scale
                     expand = c(0,0)) +      # remove the spacing between the x axis and the bars
  labs(x = NULL,                             # Remove the x-axis label "drug"
       y = "Prevalence % (95% CI)") +        # Change the y-axis label
  theme_classic() +
  theme(legend.position = "bottom",          # move the legend to the bottom
        legend.title    = element_blank()) + # remove the legend variable
  coord_flip()
p

Stats

While we didn't use them for this particular plot stat_*() function can be a huge time saver. stat_* functions display statistical summaries of the data. For a bar plot there is no reason the count then number of items in a group (or percentage) on the data. Instead we can use the appropriate function have it calculated automatically for us.

# geom_
old_width = options(width = 80)
lsf.str("package:ggplot2") %>% grep("^stat_", ., value = TRUE)
options(width = old_width$width)

There are many more useful stat_*() functions in various packages.

Saving

Save your plot with ggsave. Use the correct extension for the plot type you wish to save. E.g .pdf for pdf, .png for png, etc. See ?ggsave for details and other parameters.

Exercises

  1. Modify the above code to produce the plot below. You can read in the data with: dat <- readRDS("./data/bargraphdat2.RDS")
knitr::include_graphics("images/dataviz/comp_recent_means.png", dpi = 450)
  1. If you wanted to make this style of plot a function, what would you need to pass to the function? What customization would you allow a user to make and what would you not?

  2. For the plot you brought, create a data set and create the the plot using ggplot.

  3. For the above plot (exercise 3). Re-imagine a different visualization for the data and create the plot using ggplot.

  4. Begin making a RADARS theme. What is our font, font size for various elements, background, etc. We will end up making a custom theme based on this for everyone to use. This will allow us to get presentation quality graphics quickly.

  5. Read Chapter 2 (Workflow: Basics)

Resources and Links

Learn more

Noteworthy RStudio Add-Ins

General Help and How-To's

Tips and Tricks

Math and symbols

Base Plot



DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.