library(learnr)
library(gradethis)
library(knitr)

gradethis::gradethis_setup()
tutorial_options(exercise.timelimit = 60)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

# Ensure that library is loaded.
library(tidyverse)
# Ensure that the data is loaded for the remainder of this tutorial.
consumers <- UsingRTutorials::consumers #  
# Alternatives: readr::read_csv("http://82.196.4.233:3838/www/consumers.csv") or readr::read_csv("data/consumers.csv")

Overview

Course Organization

Lecturers

Dr. Christin Scholz, c.scholz@uva.nl

Assistant Professor Health & Persuasive Communication

ACHC Communication, Brain & Society Lab, www.cobras-lab.com

Dr. Wouter de Nooy

Senior Lecturer Research Methods, w.denooy@uva.nl

Our expectations

Who are you?

Tell us about yourself: study programme and programming experience.

What should you expect?

At the end of this class, you will NOT know EVERYTHING about R.

We hope you will:

Materials

Book:

Additional materials: Canvas.

Study the syllabus (module guide) on Canvas!

Topics

data.frame(Week = c(1, 1, 2, 2, 3, 3, 4, 4),
           Date = c("Jan 10", "Jan 12", "Jan 17",
                    "Jan 19", "Jan 24", "Jan 26",
                    "Jan 31", "Feb 2"),
           Topics = c(
             " A New Way of Working: Preface, Part I, Ch. 1-2 {1-4}",
             " Descriptive Statistics and Reports: Ch. 3-6, 21 {5-8, 27}",
             " Principles of Database Management: Ch. 7-10 {10-13}",
             " Handling Special Types of Data: Ch. 11-13 {14-16}",
             " Programming: Ch. 14-17 up to p. 322 {18-21 up to 21.3}",
             " Modelling: Sections 3-4 in _Help, My Collaborator Uses R!_  instead of Ch. 18-20 {23-25}",
             " Communicating results: Ch. (21) 22-24 {(27) 28-30}",
             " Project presentation and final report submission"
           )) %>%
  knitr::kable(format = "html", col.names = c("Week", "Date", "Topics & Chapters"), align = c("l", "l", "l")) %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = TRUE) %>%
  kableExtra::add_footnote(label = "{} = online chapter numbers",
                          notation = "none")

Session Organization

Course Content (~1.5 hours)

Data Project Collaboration (~1.5 hours)

Thu: New problem set (due Sunday).

Use your study buddy, come prepared, and bring questions.

Study Buddies

…an optional source of knowledge, advice, and support.

Why have a study buddy?

Recommendations

We will assign a study buddy to you.

Grading

Final grade:

  1. Problem Sets (60%).
  2. Data Project: reproducible report with one data visualization per team member (40%).

Problem Sets

Selection of exercises like those in R for Data Science.

Assigned to student on Thursday as Canvas assignment.

Submission deadline: following Sunday.

Data Project: Tell a visual story

Visual story example

knitr::include_graphics("images/wolf.png")

Source: https://interaktiv.morgenpost.de/woelfe-in-deutschland/

A visualization requires a lot of data wrangling:

Get inspired...

Some examples of data visualizations:

Books:

SCRUM(-Light)

SCRUM is a framework for managing teamwork in a systematic and empirically informed manner.

What is SCRUM(-Light)?

Why SCRUM(-Light)?

Teaming Up for the Data Project

  1. Alcohol Posts and Alcohol Use.
  2. Dynamics and Performance of an IT Call Center.
  3. Social Evolution.
  4. Friends and Families.
  5. German Chancellor Candidates Debate.
  6. EU 2014 elections.
  7. Or your own... (See Canvas for requirements)

The Canvas data project data sets module page contains the data and data descriptions.

Who will work on which complex dataset (and with whom)?

Submit a list of (3-4) team members & a team name on the Canvas discussion board during the break.

Questions about the Syllabus?

Intermezzo: Tutorial Installation

You installed the latest versions, didn’t you?

In addition, install this tutorial:

Start the tutorial

Now you can start the first tutorial:

Alternatively:

Using a tutorial

Your tutorial answers are saved until you press Start Over.

Use button to stop the tutorial.

Coding in R: Using functions

We use functions in R to accomplish something.

funtion_name(argument_name = value, ...)

Function arguments specify the input for a function:

We recommend named arguments.

What does the command (below) do? Add the argument names to the arguments in the function.
seq(1, 15, 2)
__Hint:__ Use the Help page for this function (`?seq`). The help page will open in a browser window. If you use the help funstion in RStudio, the help page opens in the _Help_ tab.
seq(from = 1, to = 15, by = 2)
gradethis::grade_code(
  incorrect = "Supply each argument name with an equals sign."
  )

__Programming Tips__ - Run code and have a look at the output to figure out what the arguments mean. - Look up the arguments of the function in the Help page or from the code completion options shown when you start typing within the brackets. __Hint__: In the RStudio interface, put your cursor in the name of a function and press F1 to get help. - If the code is not valid, R displays an error message if the code is run. Correct the mistake before you continue. - Run the updated code to see if the results are what they should be.
Use this programming approach to find out what the `length.out` argument means.
seq(1, 15, length.out = )

Correct the code below, which should give integer numbers from 10 down to 1.
seq(10, 1, 1)
__Hint:__ What does the error message say if you run the code?

Where do the results go?

We can assign the results of a function to a data object:

left_hand_object <- funtion_name(argument_name = value, ...)

If left-hand data object:

In the previous exercises, the function results were send to the screen.

Now, store the results of the three functions below in data objects named `my_output`, `my_Output`, and `my_output` respectively.
seq(from = 1, to = 20, by = 1)
seq(from = 10, to = 20, by = 1)
seq(from = 10, to = 10, by = 1)
__Hint:__ Use the <- operator to store results. Spell names correctly. R is case-sensitive: lower case is different from uppercase.
my_output <- seq(from = 1, to = 20, by = 1)
my_Output <- seq(from = 10, to = 20, by = 1)
my_output <- seq(from = 10, to = 10, by = 1)
gradethis::grade_code()

What does the data object `my_output` contain? Enter the value(s) below.

gradethis::grade_result(
  pass_if(10, "The first data object created with the name `my_output` is overwritten by the data object last created because they have the same name. In this way, a data object may not contain the data that you initially intended it to contain. Be careful!"),
  fail_if(~ TRUE, "If you run 'my_output', you can see what it contains.")
)

Cleaning up

__Programming Tip__ - The number of data objects in the R environment may grow quickly. It is good practice to discard data objects that are not needed (anymore). The base R functions `remove()`and, more concisely, `rm()` remove one or more objects from the environment.
# Ensure that the data objects are available in the tutorial.
my_Output <- seq(10, 20, 1)
my_output <- seq(10, 10, 1)
Remove all data objects created in the preceding exercise with one `rm()` command.
rm()
__Hint:__ Separate data objects with commas. Do not include the same data object name more than once.
rm(my_output, my_Output)
gradethis::grade_code(
  incorrect = "Names of data objects must be used only once and be separated by commas."
  )

Visual Exploration

The trend is towards visualizing data properties rather than tabulating them.

# Summary of brand awareness by gender and wom.
means <- consumers %>% group_by(`Gender` = gender, `Word of mouth` = wom) %>%
  summarise(`Average brand awareness` = round(mean(brand_aw), digits = 1)) 
# Numeric summary.
knitr::kable(means, "html") %>%
  kableExtra::kable_styling(font_size = 16, full_width = FALSE, position = "left")
means %>% 
  ggplot() +
  geom_bar(aes(x = `Word of mouth`, y = `Average brand awareness`, fill = Gender), stat = "identity", position = "dodge") +
  theme_bw(base_size = 12) +
  scale_x_discrete(name = "Heard of brand by word of mouth") +
  scale_fill_discrete(name = "") +
  theme(legend.position = "top",
        plot.background = element_blank()) #see ggplot book
rm(means)

Plots with ggplot()

Philosophy: Grammar of Graphics (Leland Wilkinson)

Recreate a plot in steps

We will use some (fake) consumers data about a particular brand and exposure to an advertising campaign for the brand. These are the variables:

tibble::tibble(
  `Variable name` = c("ad_expo", "wom", "gender", "brand_aw", "firstname"),
  `Variable Label` = c("Exposure to the campaign", "Heard about the brand through word of mouth", "Gender of the respondent", "Awareness of the brand", "Respondent's first name"),
  `Value Labels`= c("1 = No exposure; 10 = Max exposure", "yes, no", "female, male", "1 = Not aware; 10 = Max aware", "")) %>%
  knitr::kable(booktabs = TRUE)
# Standard ggplot plot with title and axis labels.
ggplot2::ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender, shape = wom),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    formula = y ~x,
    se = FALSE,
    color = "black"
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    )

We are going to recreate the above plot in steps.

Step 1: Aesthetics: the plane

Add the points to the plot.
ggplot(
  data = consumers
  )
__Hint:__ Use the *Data Visualization with ggplot cheat sheet* (available from the RStudion *Help > Cheat Sheets* menu) if you cannot remember the geom that you need. It is usually better to specify the x and y coordinates of the plane in the `ggplot()` function than in a geom.
ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point()
gradethis::grade_code()

Step 2 - Aesthetics: Attributes

# Standard ggplot plot with title and axis labels.
ggplot2::ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender, shape = wom),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    formula = y ~x,
    se = FALSE,
    color = "black"
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    )
Set the colour, shape, and size of the dots.
# Copy the solution to the preceding exercise here as your starting point.
ggplot(
  data = consumers
  )
__Hint:__ An aesthetic, such as colour or size, must be included in the `aes()` function if it must be linked to a variable. Each group on the variable then receives its own color.
ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    )
gradethis::grade_code(
   correct = "Dot colour and shape are linked to a variable, so they must be inside an aes() function. In contrast, dot size is constant (one value for all dots), so it must be outside an aes() function."
)

Step 3 - Additional layers

# Standard ggplot plot with title and axis labels.
ggplot2::ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender, shape = wom),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    formula = y ~x,
    se = FALSE,
    color = "black"
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    )
Add the linear regression line to the plot.
# Copy the solution to the preceding exercise here as your starting point.
ggplot(
  data = consumers
  )
__Hint:__ Mind the details of the regresison line.
ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black"
    )
gradethis::grade_code()

Step 4 - Labels

# Standard ggplot plot with title and axis labels.
ggplot2::ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender, shape = wom),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    formula = y ~x,
    se = FALSE,
    color = "black"
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    )
Add the labels to the plot or change the labels. Mind the details.
# Copy the solution to the preceding exercise here as your starting point.
ggplot(
  data = consumers
  )
__Hint:__ Mind the details of the regression line.
ggplot(
  data = consumers,
  mapping =
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black"
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    )
gradethis::grade_code()

Facetting

Different graphs for different groups of observations.

# Adaptation: word of mouth as facets instead of shape.
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y =brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  facet_wrap(vars(wom))
Create the two plots in one row.
# Copy the solution to the preceding exercise here as your starting point.
ggplot()
__Hint:__ You must use the vars() function to identify the variable that is used to create facets.
# Adaptation: word of mouth as facets instead of shape.
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y =brand_aw)
  )+
  geom_point(
    mapping = aes(color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  facet_wrap(~wom)
gradethis::grade_code(
  correct = "You can also use facet_wrap(vars(wom))."
)

Adding labels (advanced)

We may want to know who is the outlier in the plot, having an exceptionally low exposure score. Use geom_text() to add the participants' first names to the plot, as shown below. Carefully inspect the arguments for this geom.

# Adaptation: add firstname as label.
ggplot2::ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE,
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_text(
    aes(label = firstname),
    nudge_y = 0.4,
    check_overlap = TRUE
    )
Create this graph yourself.
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE,
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_text()
__Hint:__ Use help on this geom to get more information about the arguments of this geom.
# Adaptation: swap color and shape, line color as aesthetic.
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE,
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_text(
    aes(label = firstname),
    nudge_y = 0.4,
    check_overlap = TRUE
    )
gradethis::grade_code(
  correct = "The example plot uses `geom_text()` because the boxes created by `geom_label()` obscure the data."
  )

Adding primitives (advanced)

Actually, we can draw all kinds of shapes on the plot, for example, an arrow drawing attention to the extremely low exposure score.

Use `geom_segment` to create this plot.
# Adaptation: Add arrow pointing to extreme value.
ggplot2::ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_segment(
    x = 1, xend = 1, y = 4, yend = 2,
    arrow = arrow(
      type = "closed"
    )
  )
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_segment()
__Hint:__ Pay attention to the `arrow` argument. It uses a function! Use help on the geom and on this function.
# Adaptation: Add arrow pointing to extreme value.
ggplot(
  data = consumers,
  mapping = 
    aes(x = ad_expo,
        y = brand_aw)
  )+
  geom_point(
    mapping = aes(shape = wom, color = gender),
    size = 4
    ) +
  geom_smooth(
    mapping = 
    aes(color = gender),
    method = "lm",
    se = FALSE
    ) +
  labs(
    title = "Does brand awareness depend on exposure, word-of-mouth, and gender?",
    x = "Exposure to the campaign",
    y = "Brand awareness"
    ) +
  geom_segment(
    x = 1, xend = 1, y = 4, yend = 2,
    arrow = arrow(
      type = "closed"
    )
  )
gradethis::grade_code(
  correct = "And yes, you can change the color and size of the arrow just like you can change them for points and other geoms."
)

Evaluating a Plot

Once you master the ggplot2 package and other packages for creating plots (see the Fancy stuff part of this tutorial), you should start thinking about what you want to convey with a plot and whether the plot conveys your message in a clear and attractive way.

We will use the following criteria in this course:

  1. The data visualization is sufficiently complex to tell a story. For example, it presents both a pattern (as a first impression) and deviations from this pattern (inviting reflection about the pattern). The deviations, however, should not be overwhelming because of too much information.

  2. The data visualization is self-explanatory. It should be comprehensible if it is presented by itself. For example, use informative, readable labels.

  3. The data visualization gives an accurate, not a biased view of the data. For example, sizes should accurately reflect quantity.

  4. The data visualization uses graphic features (size, font type, colors, line styles) in such a way that the main parts are stressed and unimportant parts remain visually in the background. Justify your choices with comments in the code.

If you see a plot (or any other graphic):

Practice with the plots below. Use the checkboxes to show plots, which will first be shown for a few seconds and, a bit later, for a longer time.
fluidPage(
  fluidRow(
    column(2,
      radioButtons("radio", label = h3("Select a plot"),
    choices = list("Plot 1" = 1, "Plot 2" = 2, "Plot 3" = 3, "Plot 4" = 4, "Plot 5" = 5, "Plot 6" = 6), 
    selected = 1),
      sliderInput("slider", label = "", min = 0, max = 20, value = 0,
                  step = 1, ticks = FALSE, 
                  animate = TRUE
      )
    ),
    column(10,
      plotOutput("regPlot")
    )
  )  
)
output$regPlot <- renderPlot({
  g <- ggplot(
    data = consumers,
    aes(x = ad_expo, y = brand_aw)
    ) +
    scale_x_continuous(name = "Exposure", breaks = 1:10, limits = c(1, 10)) +
    scale_y_continuous(name = "Brand awareness", breaks = 1:10, limits = c(1, 10)) +
    theme_bw( base_size = 14 )

  ## Create plot versions
  if (input$radio == 1) { 
    # just regression line and confidence interval
    g <- g + geom_smooth(method = "lm", formula = y ~ x, se = TRUE, color = "black")
    }
  else if (input$radio == 2) {
    # point size reflects brand_aw
    g <- g + 
      geom_point(aes(size = brand_aw), color = "grey") +
      geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
    }
  else if (input$radio == 3) {
    # outlier not visible (and regression line without outlier)
    g <- g + 
      geom_point(size = 2, color = "grey") +
      geom_smooth(data = consumers[consumers$ad_expo > 1,], 
        mapping = aes(x = ad_expo, y = brand_aw),
        method = "lm", formula = y ~ x, se = FALSE) +
      scale_x_continuous(limits = c(4, 10))
    }
  else if (input$radio == 4) {
    # grey total regression line, red regression line without outlier
    g <- g + 
      geom_point(size = 2, aes(color = ifelse(ad_expo > 1, "grey", "red")), show.legend = FALSE) +
      geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "grey", size = 2) +
      geom_smooth(
        data = consumers[consumers$ad_expo > 1,], 
        mapping = aes(x = ad_expo, y = brand_aw),
        method = "lm", formula = y ~ x, se = FALSE, color = "red"
        ) +
      geom_text(x = 3.9, y = 4.3, label = "without outlier", size = 3, color = "red", alpha = 0.6) +
      scale_color_manual(values = c("red", "grey"))
    }
  else if (input$radio == 5) {
    # regression line per gender, with additional regression line for males without outlier
    g <- g + 
      geom_point(aes(color = gender), size = 2) +
      geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = gender)) +
      geom_smooth(
        data = consumers[consumers$ad_expo > 1 & consumers$gender == "male",], 
        mapping = aes(x = ad_expo, y = brand_aw),
        method = "lm", formula = y ~ x, se = FALSE, color = "blue", linetype = "dashed"
        ) +
      geom_text(x = 4.1, y = 3.4, label = "without outlier", size = 3, color = "blue", alpha = 0.6)
    }
  else if (input$radio == 6) {
    # density conours and regression lines per gender
    g <- g + 
      geom_density2d(aes(color = gender),  adjust = 2) +
      geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = gender))
    }

  ## Show plot, depending on slider value
  if (input$slider == 0) {
    #show press button text
    ggplot(data = consumers) + geom_text(x = 0.5, y = 0.5, label = "Press the little play button", size = 12) 
    } 
  else if ((input$slider > 0 & input$slider < 4) | (input$slider > 10)) {
    # ask user to reset the slider 
    if (input$slider == 20) {
      if (input$radio == 3 ) {hor_pos = 7}
      else {hor_pos = 5.5}
      g <- g + geom_text(x = hor_pos, y = 8.5, label = "Reset the slider to 0\nbefore you watch another plot.", size = 8)
      }
    #show plot
    g
    } 
  else {
    # What did you see?
    ggplot(data = consumers) + geom_text(x = 0.5, y = 0.5, label = "What did you see?", size = 18)
    }
})

Additional Materials

Book on ggplot2:

Interactive training:

Debugging (advanced)

Your friend tried to create a non-stacked bar chart showing the proportion of females in the consumers data set who heard by word of mouth against the proportion who did not, as well as the proportion of males who heard and who did not hear by word of mouth.

Your friend did not manage to get the code to work. Can you do that?
ggplot(data consumers) +
  geom_bar(
    mapping = aes(
      x = wom, 
      color = gender,
      position = "dodge"
  )
__Hints:__ - Pay attention to the error messages when you run the code. - For proportions, you have to use a computed variable: check out the help on `geom_bar()`. - You may want to check the _R for Data Science_ book: bar charts are tricky.
ggplot(data = consumers) +
  geom_bar(
    mapping = aes(
      x = wom, 
      y = ..prop..,
      group = gender,
      fill = gender
    ),
    position = "dodge"
  )
gradethis::grade_code(
  correct = "To get proportions, you must indeed specify both the y argument (indicating that you want proportions) and the group argument (specifying which total to use for caluclating proportions).", 
  incorrect = "If you don't see the problems, build up the graph from zero, step by step."
  )
__Programming Tips__ - When you program something complicated, build it in steps. - Check intermediate results regularly by running the code and inspecting the results.

Fancy Stuff

gganimate package

If you want to animate your ggplot plot, the package gganimate provides you with tools to create an animated GIF (with the gifski package) or a video (with the av package).

The code below creates an animated gif using Gapminder data on life expectancy, GDP per capita, and population size by country.

Play around with the settings that are specific to `animate()`.

Note that it takes quite some time to generate the animation.

# Install the following packages if they haven't been installed.
library(gganimate)
library(gifski)
library(gapminder) #data used

# This code creates an animated ggplot
g <- ggplot(gapminder::gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  facet_wrap(~continent) +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear') +
  shadow_wake(
    wake_length = 0.2
  )

# And here, we show the animated plot.
gganimate::animate(
  plot = g, #gganimate plot to be shown
  nframes = 78, #1 frame for each year from 1952 to 2007 plus 2x11 additional frames for start and end 
  renderer = gifski_renderer( #save as animated GIF
    file = "gapminder.gif",
    loop = TRUE
    ),
  start_pause = 12, #first frame shows 12 times
  end_pause = 12, #last frame shows 36 times
  rewind = FALSE #roll back to the start
)

If you would have a look at the gapminder data (e.g., with View(gapminder)), you would see that the data are available for 1952, 1957, 1962, 1967, and so on. The gganimate() animation creates frames for the years in between. In a way, the data for the in-between-years are fabricated; they may give a wrong view of reality.

__Programming Tip__ - Carefully inspect the results created by packages that do a lot of work for you. They may do things that you do not want.

Movement is fascinating but it can also be frustrating if the user cannot pause or determine the speed of the animation. Evaluate the pros and cons of animations critically.

ggplotly() in the plotly package

The plotly library is designed for creating interactive graphics. It has its own language for creating graphs but for the ggplot user, it provides the 'ggplotly()' function to change a ggplot plot into an interactive plotly plot.

The plotly library offers the option to zoom, select items in the graph, and see additional information about the items in the graph.

Run the code to see the graph.

If you carefully position your cursor on a dot, the respondent's first name will pop up (works better if you select the Compare data on hover option).

#these packages have been installed by UsingRTutorials
library(plotly) 
library(gapminder) #contains the data used here

#Step 1: create a ggplot
g <- ggplot(
  data = consumers,
  aes(x = ad_expo, y = brand_aw)
  ) +
    geom_point(aes(color = gender), size = 2) +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = gender)) +
    geom_smooth(
      data = consumers[consumers$ad_expo > 1 & consumers$gender == "male",], 
      mapping = aes(x = ad_expo, y = brand_aw),
      method = "lm", formula = y ~ x, se = FALSE, color = "blue", linetype = "dashed"
      ) +
    geom_text(
      aes(label = firstname),
      alpha = 0 #trick: make labels invisible (transparent)
      ) +
    scale_x_continuous(name = "Exposure", breaks = 1:10, limits = c(1, 10)) +
    scale_y_continuous(name = "Brand awareness", breaks = 1:10, limits = c(1, 10)) +
    theme_bw()

#Step 2: Turn the ggplot into a plotly plot and use plotly options.
ggplotly(g, tooltip = c("text"), dynamicTicks = TRUE)

There are (limited) options for animation, as demonstrated in the plot below. Run the code to see the plot.

Change the code, so the pop ups only show the country name.
#Step 1: create a ggplot and use a variable to define the frames of the animation
p <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, frame = year, ids = country)) +
  scale_x_log10() #this changes the scale to a log scale, so very large differences are compressed

#Step 2: Turn the ggplot into a plotly plot.
ggplotly(p)
__Hint:__ Have a look at the help for the `ggplotly()` function.
p <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, frame = year, ids = country)) +
  scale_x_log10() #this changes the scale to a log scale, so very large differences are compressed

ggplotly(p, tooltip = "country")
gradethis::grade_code()

Shiny

The shiny package is the RStudio contribution to interactive R products. Actually, this tutorial is made with Shiny and it contains a Shiny app, namely, the animated plots used in the Evaluating a plot Section.

Here is the code for these plots. You cannot run the code here because we cannot start a Shiny app from within a Shiny app (this tutorial). If you create a new Shiny app in RStudio (File > New File > Shiny Web App) and copy and paste the below code into the Shiny app file (replacing all existing contents), you can run the app from RStudio (use the Run App button).

Experiment with the user interface and the R code.
# load the shiny package
library(shiny)

# first part of the app: the User Interface (ui)
ui <- fluidPage(
    fluidRow( #the first (and only) row in the interface
        column(2, #the first (left) column, width 2 out of 12
               radioButtons("radio", label = h3("Select a plot"),
                            choices = list("Plot 1" = 1, #label and value if selected
                                           "Plot 2" = 2, 
                                           "Plot 3" = 3, 
                                           "Plot 4" = 4, 
                                           "Plot 5" = 5, 
                                           "Plot 6" = 6), 
                            selected = 1),
               #only a slider can be animated
               sliderInput("slider", label = "", min = 0, max = 20, value = 0,
                           step = 1, ticks = FALSE, 
                           animate = TRUE
               )
        ),
        column(10, #the second (right) column, width 10 out of 12
               plotOutput("regPlot") #containing the plot named regPlot
        )
    )  
)

# second part of the app: the R code
server <- function(input, output, session) {
  # create a plot named regPlot to be shown in the output
  output$regPlot <- renderPlot({
    # the basic (empty) plot  
    g <- ggplot(
      data = consumers,
      aes(x = ad_expo, y = brand_aw)
      ) +
      scale_x_continuous(name = "Exposure", breaks = 1:10, limits = c(1, 10)) +
      scale_y_continuous(name = "Brand awareness", breaks = 1:10, limits = c(1, 10)) +
      theme_bw( base_size = 14 )

    ## Create a plot version for each radio option
    if (input$radio == 1) { 
      # just add regression line and confidence interval to the empty plot
      g <- g + geom_smooth(method = "lm", 
                           formula = y ~ x, se = TRUE, 
                           color = "black")
    }
    else if (input$radio == 2) {
      # point size reflects brand_aw
      g <- g + 
        geom_point(aes(size = brand_aw), color = "grey") +
        geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
    }
    else if (input$radio == 3) {
      # outlier not visible (and regression line without outlier)
      g <- g + 
        geom_point(size = 2, color = "grey") +
        geom_smooth(data = consumers[consumers$ad_expo > 1,], 
                    mapping = aes(x = ad_expo, y = brand_aw),
                    method = "lm", formula = y ~ x, se = FALSE) +
        scale_x_continuous(limits = c(4, 10))
    }
    else if (input$radio == 4) {
      # grey total regression line, red regression line without outlier
      g <- g + 
        geom_point(size = 2, 
          aes(color = ifelse(ad_expo > 1, "grey", "red")), show.legend = FALSE) +
        geom_smooth(method = "lm", formula = y ~ x, se = FALSE, 
                    color = "grey", size = 2) +
        geom_smooth(
          # trick: base R to create a subset of all consumers, omitting the outlier
          data = consumers[consumers$ad_expo > 1,], 
          mapping = aes(x = ad_expo, y = brand_aw),
          method = "lm", formula = y ~ x, se = FALSE, color = "red"
        ) +
        geom_text(x = 3.9, y = 4.3, label = "without outlier", 
                  size = 3, color = "red", alpha = 0.6) +
        scale_color_manual(values = c("red", "grey"))
    }
    else if (input$radio == 5) {
      # regression line per gender, with additional regression line 
      # for males without outlier
      g <- g + 
        geom_point(aes(color = gender), size = 2) +
        geom_smooth(
          method = "lm", formula = y ~ x, se = FALSE, aes(color = gender)
          ) +
        geom_smooth(
          data = consumers[consumers$ad_expo > 1 & consumers$gender == "male",], 
          mapping = aes(x = ad_expo, y = brand_aw),
          method = "lm", formula = y ~ x, se = FALSE, 
          color = "blue", linetype = "dashed"
        ) +
        geom_text(x = 4.1, y = 3.4, label = "without outlier", 
                  size = 3, color = "blue", alpha = 0.6)
    }
    else if (input$radio == 6) {
      # density conours and regression lines per gender
      g <- g + 
        geom_density2d(aes(color = gender),  adjust = 2) +
        geom_smooth(
          method = "lm", formula = y ~ x, se = FALSE, aes(color = gender)
          )
    }

    # Show plot, depending on slider value
    # This is the animation trick: slider values range from 0 to 20,
    # if it is 0, a text is shown, if it is 1 or 2 or at least 10, 
    # the plot is shown, a text is shown between 2 and 10, and
    # a text is added to the plot if the slider is 20.
    if (input$slider == 0) {
      #show press button text
      ggplot(data = consumers) + 
        geom_text(
          x = 0.5, y = 0.5, 
          label = "Press the little play button", size = 12
          ) 
    } 
    else if ((input$slider > 0 & input$slider < 4) | (input$slider > 10)) {
      # ask user to reset the slider 
      if (input$slider == 20) {
        if (input$radio == 3 ) {hor_pos = 7}
        else {hor_pos = 5.5}
        g <- g + geom_text(
          x = hor_pos, y = 8.5, 
          label = "Reset the slider to 0\nbefore you watch another plot.", 
          size = 8
          )
      }
      #show plot
      g
    } 
    else {
      # What did you see?
      ggplot(data = consumers) + 
        geom_text(
          x = 0.5, y = 0.5, 
          label = "What did you see?", size = 18
          )
    }
  })  
}

# the command to create and run the app
shinyApp(ui, server)

Check out the Shiny demos gallery for inspiration. Start simple!

If you want to create a Shiny app, first create your plot with ggplot(), then add it to the Shiny app.

Collaboration on the Data Project

Plots are very useful for getting a first idea of your data.

As a first step, use ggplot plots to describe individual variables and relations between variables in (one of) your Data Project file(s).

The tidyverse approach.

We use the tidyverse approach to R programming.

knitr::include_graphics("images/tidyverse2.png")

Source: https://r-unimelb.gitbook.io/rbook/into-the-tidyverse/the-tidyverse

tidyverse suite of packages:

Cheat sheets!

NOTE: The tidyverse package(s) are loaded in the tutorials, so they can be used.

Load and inspect one of your Data Project data files

Time to start exploring (visualizing) your Data Project data,

If you haven't done so previously, download the Data Project data file(s) from Canvas to a directory that you create for your Data Project.

The tutorial must be able to find this data set, so you have to make the Data Project directory your working directory:

Use the below code box to import your data set, inspect it, and create plots.
# Set your Data Project directory as working directory.
# (Copy the setwd() command from RStudio here.)

# Load your data in the object myData.
myData <- read_csv("filename.csv")

# Have a look at the variables in your data.
str(myData)

# Create your first plot.
ggplot()

# Create a second plot.
ggplot()

# And a third?
ggplot()
__Programming Tips__ - Build up plots in steps, from simple to complex. - Use lots of comments to explain what you are trying to do. Your team mates and your future self would like to know.

Note: As long as you do not press the Start Over button, the code (and plots) are preserved in this tutorial, so you can use the code later.

Planning of Sprint 1

SCRUM masters update

We’ve made a start!



WdeNooy/UsingRTutorials documentation built on Jan. 25, 2023, 2:39 a.m.