library(gradethis)
library(learnr)
library(qsslearnr)
tutorial_options(exercise.checker = gradethis::grade_learnr)
knitr::opts_chunk$set(echo = FALSE)
tut_reptitle <- "QSS Tutorial 3: Output Report"
data(STAR, package = "qss")
star <- STAR

Handling Missing Data in R

Small class size data

In this chapter, you'll analyze data from the STAR project, which is a four-year randomized trial on the effectiveness of small class sizes on education performance. The star data frame as been loaded into your space so that you can play around with it a bit.

Exercises


head(star)
grade_code()

dim(star)
grade_code()

summary(star)
grade_code()

Handling missing data

You probably noticed that there were some NA values in the data when you used the head() function. These are missing values, where the value for that unit on that variable is missing or unknown. These values pose problems when we are trying to calculate quantities of interest like means or medians because R doesn't know how to handle them.

The first tool in your toolkit for missing data is the is.na() function. When you pass a vector x to is.na(x), it will return a vector of the same length where each entry is TRUE if the value of x is NA and FALSE otherwise. Using logicals, you can easily get the opposite vector !is.na(x) which is TRUE when x is observed and FALSE when x is missing.

Exercises


head(is.na(star$g4math))
grade_result(
  pass_if(~ identical(.result, head(is.na(star$g4math))))
)

sum(is.na(star$g4math))
grade_code()

mean(is.na(star$g4math))
grade_code()

Calculating means in the fact of missing data

Missing values makes it difficult to calculate numerical quantities of interest like the mean, median, standard deviation, or variance. Many of these function will simply return NA if there is a single missing value in the vector. We can instruct many function to ignore the missing values and do their calculation on just the observed data by using the na.rm = TRUE argument. For instance, suppose we have x <- c(NA, 1,2,3), then mean(x) will return NA, but mean(x, na.rm = TRUE) will return 2.

Exercises


mean(star$g4math)
grade_code(
  correct = "This isn't that useful though!"
)

mean(star$g4math, na.rm = TRUE)
grade_code()

Visualizing Data

Barplots

The barplot is a useful way to visualize a categorical or factor variable. In this exercise, you are going to visualize the classtype variable from the star data frame, which can take on the following values:

Exercises

## creat a vector called classcounts that has
## the counts of each category of classtype


## pass classcounts to barplot
classcounts <- table(star$classtype)

## pass classcounts to barplot
barplot(classcounts)
grade_code("Awesome. The graph is looking a little unhelpful, though. Let's spruce it up.")

Making the barplot readable

The default barplot usually isn't all that readable.

Exercises

## create a vector giving the counts of each category of classtype
classcounts <- table(star$classtype)

## create a vector of labels called classnames


## pass classcounts to barplot and set the y-axis label
classcounts <- table(star$classtype)

## create a vector of labels
classnames <- c("Small class", "Regular class", "Regular class with aid")

## pass classcounts to barplot and set the y-axis label
barplot(classcounts, names.arg = classnames, ylab = "Number of students")

Histograms

For quantitative (numerical) variables, the barplot won't work because there are too many unique values. In this case, you will often use a histogram to visualize the a numerical variable.

Exercises

## create a histogram of g4math
## create a histogram of g4math
hist(star$g4math, freq = FALSE)
grade_code("Great job, though the graph is a bit spartan. Let's make it more readable.")

Sprucing up the histogram

There are several arguments you can pass to hist that will improve its readability:

Exercises

## create the histogram with the specifications given in the instructions
## create the histogram with the specifications given in the instructions
hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))

Adding lines and text to a plot

We'll often want to add more information to a plot to make it even more readable. You can do that with commands that add to the current plot, such as abline and text. abline(v = 1) will add a vertical line to the plot at the specified value (1 in this example).

Exercises

hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))

## add a vertical line at the mean of the variable
hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))

## add a vertical line at the mean of the variable
abline(v = mean(star$g4math, na.rm = TRUE))
grade_code()

Adding text to a plot

We'll sometimes want to add text to a plot to make it more informative. text(x,y,z) adds a character string z centered at point on the (x, y) on the plot. You can use the axis labels to see where you might want to add these parts of the plot.

Exercise

hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))
abline(v = mean(star$g4math, na.rm = TRUE))

## add the text "Average Score" at the specified location
hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))

## add a vertical line at the mean of the variable
abline(v = mean(star$g4math, na.rm = TRUE))

## add the text "Average Score" at the specified location
text(x = 750, y = 0.014, "Average Score")
grade_code()

Submit

submission_ui
submission_server()


mattblackwell/qsslearnr documentation built on Sept. 17, 2022, 6:25 p.m.