library(gradethis)
library(learnr)
library(qsslearnr)
library(tidyverse)
tutorial_options(exercise.checker = gradethis::grade_learnr)
knitr::opts_chunk$set(echo = FALSE)
tut_reptitle <- "QSS Tidyverse Tutorial 3: Output Report"
data(STAR, package = "qss")
star <- STAR

classcounts <- star %>%
  group_by(classtype) %>%
  count()

Handling Missing Data in R

Small class size data

In this chapter, you'll analyze data from the STAR project, which is a four-year randomized trial on the effectiveness of small class sizes on education performance. The star data frame as been loaded into your space so that you can play around with it a bit.

Exercises


star <- as_tibble(star)
grade_code()

str(star)
grade_code()

glimpse(star)
grade_code()

dim_desc(star)
grade_code()

summary(star)
grade_code()

Data wrangling with tidyverse: Handling missing data

You probably noticed that there were some NA values in the data when you used the str and glimpse functions. These are missing values, where the value for that unit on that variable is missing or unknown. These values pose problems when we are trying to calculate quantities of interest like means or medians because R doesn't know how to handle them.

The drop_na function looks at the data frame and removes all rows (observations) with at least one missing value. In other words, it only keeps the complete rows by deleting rows where any column (variable) is filled with a NA value. However, this also means that drop_na will cause listwise deletion, which gets rid of the entire row even when it contains important information in other columns. Thus, it is important to specify which columns to use to drop missing data.

Exercises


star %>% drop_na(g4math)
grade_code()

Visualizing Data

Barplots

The barplot is a useful way to visualize a categorical or factor variable. In this exercise, you are going to visualize the classtype variable from the star data frame, which can take on the following values:

Exercises

## creat a data frame called classcounts that has
## the counts of each category of classtype


## make a barplot with ggplot
classcounts <- star %>%
  count(classtype) # of uniques

## make a barplot with ggplot
classcounts %>% ggplot(aes(x = classtype, y = n)) +
  geom_bar(stat='identity') 
grade_code("Awesome. The graph is looking a little unhelpful, though. Let's spruce it up.")

Making the barplot readable

The default barplot usually isn't all that readable.

Exercises

## Use the function scale_x_discrete(labels = c()) 
## i.e., scale_x_discrete(labels = c(
##     "1" = "Small class", 
##     "2" = "" ,
##     ...))
classcounts %>% ggplot(aes(x=factor(classtype), y = n)) +
  geom_bar(stat='identity') + 
   scale_x_discrete(labels = c(
     "1" = "Small class", 
     "2" = "Regular class", 
     "3" = "Regular class with aid"))
## Use the function `labs` and `title`
classcounts %>% ggplot(aes(x=factor(classtype), y = n)) +
  geom_bar(stat='identity') + 
   scale_x_discrete(labels = c(
     "1" = "Small class", 
     "2" = "Regular class", 
     "3" = "Regular class with aid")) +
  labs( x = "Classroom Type", 
        y = "Number of students", 
        title = "The Distribution of Students in Different Class Types")

Histograms

For quantitative (numerical) variables, the barplot won't work because there are too many unique values. In this case, you will often use a histogram to visualize the a numerical variable.

Exercises

## create a histogram of g4math with ggplot
## create a histogram of g4math with ggplot
star %>% ggplot(aes(x = g4math)) +
  geom_histogram()
grade_code("Great job, though the graph is a bit spartan. Let's make it more readable.")

Sprucing up the histogram

As with the barplot, there are several arguments you can pass to the ggplot() function that will improve its readability:

Exercises

## create the histogram with the specifications given in the instructions
## create the histogram with the specifications given in the instructions
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores")
## create the histogram that shows the proportion of each bin
## create the histogram with the specifications given in the instructions
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density*width))) 

Adding lines and text to a plot

We'll often want to add more information to a plot to make it even more readable. You can do that with geoms that add to the current plot, such as geom_abline and annotate.

Exercises

## add a vertical line at the mean of the variable
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean( , na.rm = TRUE)) #use the dollar sign 
## add a vertical line at the mean of the variable with geom_vline
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean(star$g4math, na.rm = TRUE))
grade_code()

Adding text to a plot

We'll sometimes want to add text to a plot to make it more informative. annotate(geom = "text", x = 8, y = 9, label = "A") adds a character string A centered at point on the (8, 9) on the plot. You can use the axis labels to see where you might want to add these parts of the plot.

Exercise

## add the text "Average Score" at the specified location
## add the text "Average Score" at the specified location
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean(star$g4math, na.rm = TRUE)) +
  annotate(geom = "text", x = 750, y = 0.014, label = "Average Score")
grade_code()

Submit

submission_ui
submission_server()


mattblackwell/qsslearnr documentation built on Sept. 17, 2022, 6:25 p.m.