library(gradethis) library(learnr) library(qsslearnr) tutorial_options(exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE) tut_reptitle <- "QSS Tutorial 3: Output Report" data(STAR, package = "qss") star <- STAR
In this chapter, you'll analyze data from the STAR project, which is a four-year randomized trial on the effectiveness of small class sizes on education performance. The star
data frame as been loaded into your space so that you can play around with it a bit.
head
function on the star
to see what the data looks like.head(star)
grade_code()
dim
function on the star
to see what the dimensions of the data look like.dim(star)
grade_code()
summary
function on the star
to get a sense for each variable.summary(star)
grade_code()
You probably noticed that there were some NA
values in the data when you used the head()
function. These are missing values, where the value for that unit on that variable is missing or unknown. These values pose problems when we are trying to calculate quantities of interest like means or medians because R doesn't know how to handle them.
The first tool in your toolkit for missing data is the is.na()
function. When you pass a vector x
to is.na(x)
, it will return a vector of the same length where each entry is TRUE
if the value of x
is NA
and FALSE
otherwise. Using logicals, you can easily get the opposite vector !is.na(x)
which is TRUE
when x
is observed and FALSE
when x
is missing.
is.na
and head
functions to show whether or not the first 6 values of the g4math
variable from the star
data frame are missing.head(is.na(star$g4math))
grade_result( pass_if(~ identical(.result, head(is.na(star$g4math)))) )
is.na
and sum
functions to show how many values of the g4math
variable are missing.sum(is.na(star$g4math))
grade_code()
is.na
and mean
functions to show what proportion of the g4math
variable is missing.mean(is.na(star$g4math))
grade_code()
Missing values makes it difficult to calculate numerical quantities of interest like the mean, median, standard deviation, or variance. Many of these function will simply return NA
if there is a single missing value in the vector. We can instruct many function to ignore the missing values and do their calculation on just the observed data by using the na.rm = TRUE
argument. For instance, suppose we have x <- c(NA, 1,2,3)
, then mean(x)
will return NA
, but mean(x, na.rm = TRUE)
will return 2
.
mean
of the g4math
variable in the star
data frame without setting na.rm
.mean(star$g4math)
grade_code( correct = "This isn't that useful though!" )
mean
of the g4math
variable when setting na.rm = TRUE
.mean(star$g4math, na.rm = TRUE)
grade_code()
The barplot is a useful way to visualize a categorical or factor variable. In this exercise, you are going to visualize the classtype
variable from the star
data frame, which can take on the following values:
1
= small class2
= regular class3
= regular class with aidtable
function to create a vector of counts classcounts
for each category of the classtype
in the star
data.classcounts
vector to the barplot
function.## creat a vector called classcounts that has ## the counts of each category of classtype ## pass classcounts to barplot
classcounts <- table(star$classtype) ## pass classcounts to barplot barplot(classcounts)
grade_code("Awesome. The graph is looking a little unhelpful, though. Let's spruce it up.")
The default barplot usually isn't all that readable.
classnames
that has the values "Small class"
(for 1), "Regular class"
(for 2), and "Regular class with aid"
(for 3).barplot
with the heights as before, but now passing classnames
to the names.arg
argument and use "Number of students"
as the ylab
argument.## create a vector giving the counts of each category of classtype classcounts <- table(star$classtype) ## create a vector of labels called classnames ## pass classcounts to barplot and set the y-axis label
classcounts <- table(star$classtype) ## create a vector of labels classnames <- c("Small class", "Regular class", "Regular class with aid") ## pass classcounts to barplot and set the y-axis label barplot(classcounts, names.arg = classnames, ylab = "Number of students")
For quantitative (numerical) variables, the barplot won't work because there are too many unique values. In this case, you will often use a histogram to visualize the a numerical variable.
hist
function to create a histogram for the g4math
variable in the star
data frame.freq = FALSE
argument to the hist
function to get the histogram.$
.## create a histogram of g4math
## create a histogram of g4math hist(star$g4math, freq = FALSE)
grade_code("Great job, though the graph is a bit spartan. Let's make it more readable.")
There are several arguments you can pass to hist
that will improve its readability:
main
: a character string that prints a main title for the plot.xlab
, ylab
: character strings that set the labels for the x (horizontal) and y (vertical) axesxlim
, ylim
: numeric vectors of length 2 to specify the interval for the x and y axes.freq = FALSE
) where you (a) set the y-axis to be between 0
and 0.015
using the ylim
argument, (b) include an informative x-axis label using the xlab
argument, and (c) include a title for the plot using the main
argument.## create the histogram with the specifications given in the instructions
## create the histogram with the specifications given in the instructions hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015))
We'll often want to add more information to a plot to make it even more readable. You can do that with commands that add to the current plot, such as abline
and text
.
abline(v = 1)
will add a vertical line to the plot at the specified value (1
in this example).
abline
function to add a vertical line at the mean of the g4math
variable from the star
data. Remember, there are missing values in that variable so be sure to use the na.rm = TRUE
argument to drop them.hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015)) ## add a vertical line at the mean of the variable
hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015)) ## add a vertical line at the mean of the variable abline(v = mean(star$g4math, na.rm = TRUE))
grade_code()
We'll sometimes want to add text to a plot to make it more informative. text(x,y,z)
adds a character string z
centered at point on the (x
, y
) on the plot. You can use the axis labels to see where you might want to add these parts of the plot.
text
function to add the string Average Score
to the plot at the point (750, 0.014).hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015)) abline(v = mean(star$g4math, na.rm = TRUE)) ## add the text "Average Score" at the specified location
hist(star$g4math, freq = FALSE, xlab = "Score", main = "Distribution of fourth-grade math scores", ylim = c(0,0.015)) ## add a vertical line at the mean of the variable abline(v = mean(star$g4math, na.rm = TRUE)) ## add the text "Average Score" at the specified location text(x = 750, y = 0.014, "Average Score")
grade_code()
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.