In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(blank = FALSE, width.cutoff = 50),
  cache = 1
)
knitr::knit_hooks$set(
  source = function(x, options) {
    if (options$engine == 'R') {
      # format R code
      x = highr::hilight(x, format = 'html')
    } else if (options$engine == 'bash') {
      # format bash code
      x = paste0('<span class="hl std">$</span> ',
                 unlist(stringr::str_split(x, '\\n')),
                 '\n',
                 collapse = '')
    }
    x = paste(x, collapse = "\n")
    sprintf(
      "<div class=\"%s\"><pre class=\"%s %s\"><code class=\"%s %s\">%s</code></pre></div>\n",
      'sourceCode',
      'sourceCode',
      tolower(options$engine),
      'sourceCode',
      tolower(options$engine),
      x
    )
  }
)

Review

In class
- git --- cloning, committing, pushing
- R Markdown --- mixing text with R code using ```r blocks
This past week and/or during the break
- Reviewing/learning the R programming language
- Installing packages

library(tidyverse)
library(stringr)

test_data <- NULL
local({
  relabel_factors <-
    function(z) {
      eval(parse(text = paste0(
        'c(', paste0(z, 1:4, '=', 1:4, collapse = ','), ')'
      )))
    }

  test_data <<-
    datasets::anscombe %>% 
    reshape2::melt(
      id.vars = paste0('y', 1:4),
      value.name = 'num.correct',
      variable.name = 'x'
    ) %>% 
    tbl_df %>% 
    reshape2::melt(
      id.vars = c('x', 'num.correct'),
      value.name = 'duration',
      variable.name = 'y'
    ) %>% 
    tbl_df %>% 
    mutate(
      x = plyr::revalue(x, relabel_factors('x')),
      y = plyr::revalue(y, relabel_factors('y'))
    ) %>% 
    filter(x == y) %>% 
    group_by(x) %>% 
    mutate(respondent = factor(1:n())) %>% 
    select(round = x, respondent, num.correct, duration) %>% 
    ungroup
})

Applied problem: Merging samples

Repeated measures for 11 individuals, mean (sd)

test_data %>%
  group_by(Round = round, stat = 'mean') %>%
  summarise(Duration = round(mean(duration), 1),
            `Number Correct` = mean(num.correct)) %>%
  bind_rows(test_data %>%
              group_by(Round = round, stat = 'sd') %>%
              summarise(Duration = round(sd(duration), 1),
                        `Number Correct` = round(sd(num.correct), 1))) %>%
  mutate_at(vars(Duration, `Number Correct`), funs(ifelse(stat == 'sd', paste0('(', format(.), ')'), format(.)))) %>%
  ungroup %>%
  arrange(Round, stat) %>% 
  mutate(Round = ifelse(stat == 'mean', format(Round), '')) %>% 
  select(-stat) %>%
  as.data.frame %>%
  pander::pander() %>%
  asis_output

Applied problem: Merging samples

Regression of Duration on Number Correct repeated for each round

test_data %>%
  plyr::ddply(.variables = 'round', .fun = function(d) {
    lm(duration ~ num.correct, data = d) %>%
      broom::tidy()
  }) %>%
  mutate_at(vars(-round, -term), funs(format(round(., 2)))) %>%
  select(-statistic, -p.value) %>%
  mutate(round = ifelse(term == 'num.correct', '', round)) %>%
  rename(Round = round, Term = term, Estimate = estimate, `SE` = std.error) %>%
  pander::pander(justify = 'clrr')

if (str_detect(opts_current$get()$fig.path, 'handout'))
  asis_output('\n<!--\n')

wzxhzdk:6

wzxhzdk:7

Remember...

Always look at the data first

if (str_detect(opts_current$get()$fig.path, 'handout'))
  asis_output('\n-->\n')

**ALWAYS. LOOK. AT. THE. DATA.**

Today

This session: Data visualization with ggplot2
- Basics
- In-class exercises
- More advanced concepts
- In-class exercises
Later: Tidying and summarizing data

ggplot2

Plotting package in R intended to replace the core plotting routines
Based on the concept of a grammar of graphics
- Plots are constructed from simpler components, much as sentences are constructed from nouns, verbs, etc.
- Not all arrangements of words lead to comprehensible sentences --- the same is true for plots, and ggplot2 helps you avoid (visual) nonsense
- This approach leads to a modularity of design, making it easy for programmers to extend
Sensible and aesthetically pleasing default settings
- Informed by what we know about visual perception and cognition

What is a graph?

A visual display that illustrates one or more relationships among numbers...a shorthand means of presenting information that would take many more words and numbers to describe.

---Stephen M. Kosslyn. Graph Design for the Eye and Mind. Oxford University Press, 2006

It depends on the goal:

A tool for discovery --- gain an overview of, convey the scale and complexity of, or facilitate an exploration of data (dataviz)
A tool for communication --- help you to help others understand, tell a story about, or stimulate interest in a problem or solution (infographics)

Communicating with graphs

A graph intended for others to look must have at least these two properties

It should ask and answer a central question
- Only one question can be the most important
- Both the question and its answer should be evident
It should compare quantities
- One comparison is both the most important and the easiest to see
- Other, secondary comparisons should not distract from the main one

Both the question and main comparison should be obvious to you and the viewer

Psychological principles (Kosslyn, 2006)

Get their attention
1. Relevance
2. Appropriate knowledge
Hold and direct their attention
1. Salience
2. Discriminability
3. Organization
Help them remember
1. Compatability
2. Informative changes
3. Capacity limitations

Get their attention

Relevance
- Not too much or too little information
- Present information that reflects the message you want to convey
- Don’t present extraneous information
Appropriate knowledge
- Prior knowledge must be sufficient to understand the graph
- If you assume too much prior knowledge, viewers will be confused
- If you violate norms, viewers will be confused

If they are confused, they won’t try to understand your graph

Hold and direct their attention

Salience
- Attention is drawn to large perceptible differences
- The most visually striking aspect receives the most attention
- Annotations help direct viewers' attention
Discriminability
- Properties must differ enough to be noticed
- Defaults in ggplot2 do much of this work for you
Organization
- Groups of elements are seen and remembered as a whole

Try to anticipate the process the audience will go through while looking at your graph

Help them remember

Compatibility
- Form should be aligned with meaning
- Lines express continuous change, bars discrete quantities
- More = more (higher, better, bigger, etc.)
Informative changes
- Changes in properties should carry information
- ...and vice versa
Capacity limitations
- If too much information is presented, none is remembered
- Four chunks in working memory
- Graph designers err on the side of presenting too much, graph readers err on the side of paying too little attention

Decide what you want them to remember; everything else is secondary to that

ggplot2's grammar

Decomposes graphs into basic parts
Sets rules for interactions among those parts
Helps us stay out of trouble

ggplot2's grammar

Default values for Data and Mapping available to all layers
Layers --- one or more, each with the following:
- Data (overriding the default) --- a data.frame
- Mapping (overriding the default) of columns to Aesthetics
- Geometry specifying what to draw
- Statistic specifying how to transform the data before drawing
- Position specifying how to arrange items
Scales specifying how to translate the data to lengths, colors, sizes, etc. in the graph
Coordinates which is Cartesian (the default) 99% of the time, so ignore this for now
Facet specification for generating subplots

Layers

Layers contain everything we see, often showing different views of the same data

wzxhzdk:9

wzxhzdk:10

wzxhzdk:11

Test data

test_data

Defaults

Specify the defaults first
Most graphs use a single set of data (data.frame) for every layer
Most graphs use a single set of mappings between columns and aesthetics

my_plot <- ggplot( data = test_data, mapping = aes( x = duration, y = num.correct ) )

aes() is used to create a list of aesthetic mappings
- x refers to the graph's x-axis, y to the y-axis
- duration $\rightarrow$ x-axis
- num.correct $\rightarrow$ y-axis
my_plot now represents a ggplot object set to our defaults
You don't need to name the arguments; data comes first, mapping comes second

my_plot <- ggplot( test_data, aes( x = duration, y = num.correct ) )

An empty plot

Defaults by themselves do nothing

print( my_plot )

By default, we get an "empty" plot
To see something, we need to specify a layer

Adding a layer

Use the + operator to combine ggplot elements

my_plot + geom_point()

Usually you do not need the print() call, so the following two lines are equivalent:

my_plot + geom_point()
print( my_plot + geom_point() )

Each layer has a geometry

my_plot + geom_point()
my_plot + geom_line()

wzxhzdk:19

Each layer has a statistic

Usually the statistic is the identity function, $$f(x)=x$$ That is, the data are left unchanged
The default statistic for geom_point and geom_line is identity so these plots show the data as is
The default statistic for geom_histogram is a binning function (called stat_bin)

ggplot( test_data, aes( x = duration ) ) + geom_histogram( binwidth = 2 )

Result of applying binning function to duration

wzxhzdk:21

Geoms and statistics

Each geom/statistic has a default statistic/geom

| Item | Default stat/geom | |:-----------------|:-----------------------------| |geom_point |stat_identity ($f(x)=x$) | |geom_line |stat_identity ($f(x)=x$) | |geom_histogram |stat_bin (binning) | |geom_smooth |stat_smooth (regression) | |stat_smooth |geom_smooth (line + ribbon) | |stat_bin |geom_bar (vertical bars) | |stat_identity |geom_point (dots) |

Hence, these produce the same output:

ggplot( test_data, aes(x=duration) ) + stat_bin(binwidth=1)
ggplot( test_data, aes(x=duration) ) + geom_histogram(binwidth=1)

Data versus statistics

Be sure you understand: "Does this layer contain data or statistics?"
When in doubt, prefer data to statistics
Example: Scatter plot conveys more information than a box plot

wzxhzdk:23

wzxhzdk:24

Aesthetics

Each geometry interacts with one or more aesthetics

|Item | Required | Optional | |:-------|:-------------------|:----------------------| |geom_point|x, y|alpha, colour, fill, shape, size, stroke| |geom_line|x, y|alpha, colour, linetype, size| |geom_pointrange|x, ymax, ymin|alpha, colour, linetype, size|

You can either map data to an aesthetic, or set it explicitly

wzxhzdk:25

wzxhzdk:26

Position

Each layer also has a position specification
The default is again identity meaning don't do anything special
Examples: bars can be positioned with stack or dodge

g <- ggplot(test_data, aes(x = num.correct, fill = round))

wzxhzdk:28

wzxhzdk:29

Practice with layers (Tasks 1--4)

Work with a neighbor
First discuss the task, then one of you does the typing (take turns for each task)
Discuss what you are doing as you write code
Write your code in an empty File > New File... > R Script and execute each line using Cmd-Enter (Mac) or Control-Enter (Windows)
Use the data set called mpg which is included in the ggplot2 package

Data

library(tidyverse)
?mpg

Fuel economy data from 1999 and 2008 for 38 popular models of car

Description:
     This dataset contains a subset of the fuel economy data that the
     EPA makes available on http://fueleconomy.gov. It contains
     only models which had a new release every year between 1999 and
     2008 - this was used as a proxy for the popularity of the car.

Usage:
     mpg

Format:
     A data frame with 234 rows and 11 variables

     manufacturer
     model         model name
     displ         engine displacement, in litres
     year          year of manufacture
     cyl           number of cylinders
     trans         type of transmission
     drv           f = front-wheel drive, r = rear wheel drive, 4 = 4wd
     cty           city miles per gallon
     hwy           highway miles per gallon
     fl            fuel type
     class         "type" of car

rd2markdown <- function(rd) {
  html <- tempfile()
  md <- tempfile()
  tools::Rd2HTML(rd, out = html)
  system(paste0('pandoc -f html -t markdown ', html, ' -o ', md))
  rendered_md <- readr::read_file(md)
  unlink(md)
  unlink(html)
  rendered_md <- stringr::str_replace(rendered_md, '.*\\n.*\\n.*\\n.*\\n', '')
  rendered_md <- paste0('## ', rendered_md)
  rendered_md <- stringr::str_replace(rendered_md, '-{5,1000}', '')

  rendered_md
}
rd2markdown(tools::Rd_db('ggplot2')$mpg) %>% asis_output

mpg

Task 0 (Example)

Create a plot with 1 layer:
- x mapped to cty
- y mapped to hwy
- point geometry
- identity stat
- identity position

Go to http://jasonmtroos.github.io/rook/ and click on session_2_in_class_work_handout

Do Tasks 1--4

Facets and discrete groups

Two main options when comparing subsets of data
- Each discrete set is given a different colour, shape, or size
- Each discrete set is plotted in its own facet

g <- ggplot(mpg, aes(x = displ, y = hwy))

wzxhzdk:35

wzxhzdk:36

Groups

When you map discrete variables to colour, shape, or size, ggplot2 automatically maps those variables to group
The group aesthetic controls how collections of items are rendered
- In geom_line the group aesthetic determines which points will be connected by a continuous line
- In stat_summary the group aesthetic determines which points are summarised by a common statistic
If a variable v is continuous but you want to use it for grouping, either specificy group = v or transform it into a discrete variable, e.g., colour = factor(v)

wzxhzdk:37

wzxhzdk:38 *** * To override the automatic grouping, specify `aes(group=1)` when creating a layer wzxhzdk:39 Scales ====== * Scales apply to the entire plot, i.e., to every layer * ggplot2 can detect what type of scale you might want, but it isn't perfect * For example, you might want a logarithmic scale instead of the default linear scale wzxhzdk:40 Labels ====== * Always annotate graphs with a title and human-readable labels for each aesthetic * x- and y-axes * Legends and colour bars

wzxhzdk:41

wzxhzdk:42

Relabelling =========== wzxhzdk:43 *** wzxhzdk:44 - Another alternative is to use the `forcats` package to relabel/reorder factors Task 5 ======================= wzxhzdk:45 More reading ============ * See the [ggplot2 documentation](http://docs.ggplot2.org/current/) for a visual summary of the available geometries, list of stats, and more; as well as detailed documentation * [All the Graph Things at the UBC STAT 545 site](https://stat545.com/graphics-overview.html) is part of an in-depth course covering a lot of the same material we cover here * [Chapter 3 of R for Data Science](http://r4ds.had.co.nz/data-visualisation.html) has a very nice introduction to ggplot2 that follows a similar flow to what we covered today * [39 studies about human perception in 30 minutes](https://medium.com/@kennelliott/39-studies-about-human-perception-in-30-minutes-4728f9e31a73) is a nice review of what we know about perception of data visualizations

jasonmtroos/rook documentation built on May 24, 2020, 3:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Review

Applied problem: Merging samples

Applied problem: Merging samples

Remember...

Always look at the data first

Today

ggplot2

What is a graph?

Communicating with graphs

Psychological principles (Kosslyn, 2006)

Get their attention

Hold and direct their attention

Help them remember

ggplot2's grammar

ggplot2's grammar

Layers

Test data

Defaults

An empty plot

Adding a layer

Each layer has a geometry

Each layer has a statistic

Geoms and statistics

Data versus statistics

Aesthetics

Position

Practice with layers (Tasks 1--4)

Data

Task 0 (Example)

Facets and discrete groups

Groups

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Review

Applied problem: Merging samples

Applied problem: Merging samples

Remember...

Always look at the data first

Today

ggplot2

What is a graph?

Communicating with graphs

Psychological principles (Kosslyn, 2006)

Get their attention

Hold and direct their attention

Help them remember

ggplot2's grammar

ggplot2's grammar

Layers

Test data

Defaults

An empty plot

Adding a layer

Each layer has a geometry

Each layer has a statistic

Geoms and statistics

Data versus statistics

Aesthetics

Position

Practice with layers (Tasks 1--4)

Data

Task 0 (Example)

Facets and discrete groups

Groups

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R