In mattblackwell/qsslearnr: Learnr tutorials based on Quantitative Social Science

library(gradethis)
library(learnr)
library(qsslearnr)
library(tidyverse)
tutorial_options(exercise.checker = gradethis::grade_learnr)
knitr::opts_chunk$set(echo = FALSE)
tut_reptitle <- "QSS Tidyverse Tutorial 3: Output Report"
data(STAR, package = "qss")
star <- STAR

classcounts <- star %>%
  group_by(classtype) %>%
  count()

Handling Missing Data in R

Small class size data

In this chapter, you'll analyze data from the STAR project, which is a four-year randomized trial on the effectiveness of small class sizes on education performance. The star data frame as been loaded into your space so that you can play around with it a bit.

Exercises

The tibble package is a core part of tidyverse. It allows you to convert your traditional R data frames into tibbles, which are data frames too. But tibbles make it easier and faster to work with tidyverse. You can coerce and save a current data frame to a tibble with as_tibble.

star <- as_tibble(star)

grade_code()

Use the str function on the star to see what the data looks like. You can always use head function to view the first six rows of the data set.

str(star)

grade_code()

Use the glimpse function on the star to see what the data looks like. glimpse is particularly useful when the data set contains a long list of variables (columns), as it allows you to see every column by transposing the original data set.

glimpse(star)

grade_code()

Use the dim_desc function on the star to see what the dimensions of the data look like.

dim_desc(star)

grade_code()

Use the summary function on the star to get a sense for each variable.

summary(star)

grade_code()

Data wrangling with tidyverse: Handling missing data

You probably noticed that there were some NA values in the data when you used the str and glimpse functions. These are missing values, where the value for that unit on that variable is missing or unknown. These values pose problems when we are trying to calculate quantities of interest like means or medians because R doesn't know how to handle them.

The drop_na function looks at the data frame and removes all rows (observations) with at least one missing value. In other words, it only keeps the complete rows by deleting rows where any column (variable) is filled with a NA value. However, this also means that drop_na will cause listwise deletion, which gets rid of the entire row even when it contains important information in other columns. Thus, it is important to specify which columns to use to drop missing data.

Exercises

Use the drop_na function to drop the rows with missing values in variable g4math.

star %>% drop_na(g4math)

grade_code()

Visualizing Data

Barplots

The barplot is a useful way to visualize a categorical or factor variable. In this exercise, you are going to visualize the classtype variable from the star data frame, which can take on the following values:

1 = small class
2 = regular class
3 = regular class with aid

Exercises

Use the count function to create a data frame of counts classcounts for each category of the classtype in the star data.
Use the geom_bar function to plot a barplot for the classcounts data frame. Remeber to state geom_bar(stat='identity') to make sure ggpplot takes the correct input for x and y axes.
You are encouraged to use the pipe operator.

## creat a data frame called classcounts that has
## the counts of each category of classtype


## make a barplot with ggplot

classcounts <- star %>%
  count(classtype) # of uniques

## make a barplot with ggplot
classcounts %>% ggplot(aes(x = classtype, y = n)) +
  geom_bar(stat='identity')

grade_code("Awesome. The graph is looking a little unhelpful, though. Let's spruce it up.")

Making the barplot readable

The default barplot usually isn't all that readable.

Exercises

Use the function scale_x_discrete(labels = c()) to name the categories of the dependent variable. Use "Small class" for 1, "Regular class" for 2, and "Regular class with aid" for 3. Remember to factorize the variable classtype in aes() with factor(), otherwise the x values will be defined as numeric rather than discrete.
You are encouraged to use the pipe operator.

## Use the function scale_x_discrete(labels = c()) 
## i.e., scale_x_discrete(labels = c(
##     "1" = "Small class", 
##     "2" = "" ,
##     ...))

classcounts %>% ggplot(aes(x=factor(classtype), y = n)) +
  geom_bar(stat='identity') + 
   scale_x_discrete(labels = c(
     "1" = "Small class", 
     "2" = "Regular class", 
     "3" = "Regular class with aid"))

Use the function labs to add the lables and title to the plot. Lable the x axis "Classroom Type", the y axis "Number of students", and the title "The Distribution of Students in Different Class Types".
You are encouraged to use the pipe operator.

## Use the function `labs` and `title`

classcounts %>% ggplot(aes(x=factor(classtype), y = n)) +
  geom_bar(stat='identity') + 
   scale_x_discrete(labels = c(
     "1" = "Small class", 
     "2" = "Regular class", 
     "3" = "Regular class with aid")) +
  labs( x = "Classroom Type", 
        y = "Number of students", 
        title = "The Distribution of Students in Different Class Types")

Histograms

For quantitative (numerical) variables, the barplot won't work because there are too many unique values. In this case, you will often use a histogram to visualize the a numerical variable.

Exercises

With the ggplot() function, use the geom_histogram() geom to create a histogram for the g4math variable in the star data frame.
You are encouraged to use the pipe operator.

## create a histogram of g4math with ggplot

## create a histogram of g4math with ggplot
star %>% ggplot(aes(x = g4math)) +
  geom_histogram()

grade_code("Great job, though the graph is a bit spartan. Let's make it more readable.")

Sprucing up the histogram

As with the barplot, there are several arguments you can pass to the ggplot() function that will improve its readability:

aes(y = after_stat(density)) argument in geom_histogram allows you to make a density plot with ggplot.
labs allows you to add character strings that print a main title for the plot, and set the labels for the x (horizontal) and y (vertical) axes.
lims: to specify the interval for the x and y axes.

Exercises

Create a histogram with ggplot where you (a) include aes(y = after_stat(density)) argument in geom_histogram to make a density plot, (b) set the y-axis to be between 0 and 0.015 using the lims argument, (c) include an informative x-axis label using the labs argument, and (d) include a title for the plot using the labs argument.
Make sure to separate the arguments in function calls with commas.
You are encouraged to use the pipe operator.

## create the histogram with the specifications given in the instructions

## create the histogram with the specifications given in the instructions
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores")

Density plots can be less useful when we want to show the proportion of each bin in a histogram, and we use y = stat(density*width) to convert the density back to percentage.

## create the histogram that shows the proportion of each bin

## create the histogram with the specifications given in the instructions
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density*width)))

Adding lines and text to a plot

We'll often want to add more information to a plot to make it even more readable. You can do that with geoms that add to the current plot, such as geom_abline and annotate.

geom_abline: adds a line with specific slope and intercept
geom_vline: adds a vertical line
geom_hline: adds a horizontal line

Exercises

Use the geom_vline function to add a vertical line at the mean of the g4math variable from the star data. By default, missing values are removed by ggplot with a warning and you can use the na.rm = TRUE argument to silently remove them.
You are encouraged to use the pipe operator.

## add a vertical line at the mean of the variable
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean( , na.rm = TRUE)) #use the dollar sign

## add a vertical line at the mean of the variable with geom_vline
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean(star$g4math, na.rm = TRUE))

grade_code()

Adding text to a plot

We'll sometimes want to add text to a plot to make it more informative. annotate(geom = "text", x = 8, y = 9, label = "A") adds a character string A centered at point on the (8, 9) on the plot. You can use the axis labels to see where you might want to add these parts of the plot.

Exercise

Use the annotate function to add the string Average Score to the plot at the point (750, 0.014).
Make sure to separate the arguments in function calls with commas.
You are encouraged to use the pipe operator.

## add the text "Average Score" at the specified location

## add the text "Average Score" at the specified location
star %>% ggplot(aes(x = g4math)) +
  geom_histogram(aes(y = after_stat(density))) +
  lims(y = c(0, 0.015)) +
  labs(x = "Score",
       title = "Distribution of fourth-grade math scores") +
  geom_vline(xintercept = mean(star$g4math, na.rm = TRUE)) +
  annotate(geom = "text", x = 750, y = 0.014, label = "Average Score")

grade_code()

Submit

submission_ui

submission_server()

mattblackwell/qsslearnr documentation built on Sept. 17, 2022, 6:25 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mattblackwell/qsslearnr
Learnr tutorials based on Quantitative Social Science

In mattblackwell/qsslearnr: Learnr tutorials based on Quantitative Social Science

Handling Missing Data in R

Small class size data

Exercises

Data wrangling with tidyverse: Handling missing data

Exercises

Visualizing Data

Barplots

Exercises

Making the barplot readable

Exercises

Histograms

Exercises

Sprucing up the histogram

Exercises

Adding lines and text to a plot

Exercises

Adding text to a plot

Exercise

Submit

R Package Documentation

Browse R Packages

We want your feedback!

mattblackwell/qsslearnr Learnr tutorials based on Quantitative Social Science

In mattblackwell/qsslearnr: Learnr tutorials based on Quantitative Social Science

Handling Missing Data in R

Small class size data

Exercises

Data wrangling with tidyverse: Handling missing data

Exercises

Visualizing Data

Barplots

Exercises

Making the barplot readable

Exercises

Histograms

Exercises

Sprucing up the histogram

Exercises

Adding lines and text to a plot

Exercises

Adding text to a plot

Exercise

Submit

R Package Documentation

Browse R Packages

We want your feedback!

mattblackwell/qsslearnr
Learnr tutorials based on Quantitative Social Science