library(gradethis) library(learnr) library(qsslearnr) library(tidyverse) tutorial_options(exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE) tut_reptitle <- "QSS Tidyverse Tutorial 3: Output Report" data(STAR, package = "qss") star <- STAR classcounts <- star %>% group_by(classtype) %>% count()
In this chapter, you'll analyze data from the STAR project, which is a four-year randomized trial on the effectiveness of small class sizes on education performance. The star
data frame as been loaded into your space so that you can play around with it a bit.
tibble
package is a core part of tidyverse. It allows you to convert your traditional R data frames into tibbles, which are data frames too. But tibbles make it easier and faster to work with tidyverse. You can coerce and save a current data frame to a tibble with as_tibble
.star <- as_tibble(star)
grade_code()
str
function on the star
to see what the data looks like. You can always use head
function to view the first six rows of the data set.str(star)
grade_code()
glimpse
function on the star
to see what the data looks like. glimpse
is particularly useful when the data set contains a long list of variables (columns), as it allows you to see every column by transposing the original data set. glimpse(star)
grade_code()
dim_desc
function on the star
to see what the dimensions of the data look like.dim_desc(star)
grade_code()
summary
function on the star
to get a sense for each variable.summary(star)
grade_code()
You probably noticed that there were some NA
values in the data when you used the str
and glimpse
functions. These are missing values, where the value for that unit on that variable is missing or unknown. These values pose problems when we are trying to calculate quantities of interest like means or medians because R doesn't know how to handle them.
The drop_na
function looks at the data frame and removes all rows (observations) with at least one missing value. In other words, it only keeps the complete rows by deleting rows where any column (variable) is filled with a NA
value. However, this also means that drop_na
will cause listwise deletion, which gets rid of the entire row even when it contains important information in other columns. Thus, it is important to specify which columns to use to drop missing data.
drop_na
function to drop the rows with missing values in variable g4math
. star %>% drop_na(g4math)
grade_code()
The barplot is a useful way to visualize a categorical or factor variable. In this exercise, you are going to visualize the classtype
variable from the star
data frame, which can take on the following values:
1
= small class2
= regular class3
= regular class with aidcount
function to create a data frame of counts classcounts
for each category of the classtype
in the star
data.geom_bar
function to plot a barplot for the classcounts
data frame. Remeber to state geom_bar(stat='identity')
to make sure ggpplot takes the correct input for x and y axes.## creat a data frame called classcounts that has ## the counts of each category of classtype ## make a barplot with ggplot
classcounts <- star %>% count(classtype) # of uniques ## make a barplot with ggplot classcounts %>% ggplot(aes(x = classtype, y = n)) + geom_bar(stat='identity')
grade_code("Awesome. The graph is looking a little unhelpful, though. Let's spruce it up.")
The default barplot usually isn't all that readable.
scale_x_discrete(labels = c())
to name the categories of the dependent variable. Use "Small class"
for 1, "Regular class"
for 2, and "Regular class with aid"
for 3. Remember to factorize the variable classtype
in aes()
with factor()
, otherwise the x values will be defined as numeric rather than discrete. ## Use the function scale_x_discrete(labels = c()) ## i.e., scale_x_discrete(labels = c( ## "1" = "Small class", ## "2" = "" , ## ...))
classcounts %>% ggplot(aes(x=factor(classtype), y = n)) + geom_bar(stat='identity') + scale_x_discrete(labels = c( "1" = "Small class", "2" = "Regular class", "3" = "Regular class with aid"))
labs
to add the lables and title to the plot. Lable the x axis "Classroom Type"
, the y axis "Number of students"
, and the title "The Distribution of Students in Different Class Types"
.## Use the function `labs` and `title`
classcounts %>% ggplot(aes(x=factor(classtype), y = n)) + geom_bar(stat='identity') + scale_x_discrete(labels = c( "1" = "Small class", "2" = "Regular class", "3" = "Regular class with aid")) + labs( x = "Classroom Type", y = "Number of students", title = "The Distribution of Students in Different Class Types")
For quantitative (numerical) variables, the barplot won't work because there are too many unique values. In this case, you will often use a histogram to visualize the a numerical variable.
ggplot()
function, use the geom_histogram()
geom to create a histogram for the g4math
variable in the star
data frame.## create a histogram of g4math with ggplot
## create a histogram of g4math with ggplot star %>% ggplot(aes(x = g4math)) + geom_histogram()
grade_code("Great job, though the graph is a bit spartan. Let's make it more readable.")
As with the barplot, there are several arguments you can pass to the ggplot()
function that will improve its readability:
aes(y = after_stat(density))
argument in geom_histogram
allows you to make a density plot with ggplot
.labs
allows you to add character strings that print a main title for the plot, and set the labels for the x (horizontal) and y (vertical) axes.lims
: to specify the interval for the x and y axes.ggplot
where you (a) include aes(y = after_stat(density))
argument in geom_histogram
to make a density plot, (b) set the y-axis to be between 0
and 0.015
using the lims
argument, (c) include an informative x-axis label using the labs
argument, and (d) include a title for the plot using the labs
argument.## create the histogram with the specifications given in the instructions
## create the histogram with the specifications given in the instructions star %>% ggplot(aes(x = g4math)) + geom_histogram(aes(y = after_stat(density))) + lims(y = c(0, 0.015)) + labs(x = "Score", title = "Distribution of fourth-grade math scores")
y = stat(density*width)
to convert the density back to percentage. ## create the histogram that shows the proportion of each bin
## create the histogram with the specifications given in the instructions star %>% ggplot(aes(x = g4math)) + geom_histogram(aes(y = after_stat(density*width)))
We'll often want to add more information to a plot to make it even more readable. You can do that with geoms that add to the current plot, such as geom_abline
and annotate
.
geom_abline
: adds a line with specific slope and interceptgeom_vline
: adds a vertical linegeom_hline
: adds a horizontal linegeom_vline
function to add a vertical line at the mean of the g4math
variable from the star
data. By default, missing values are removed by ggplot with a warning and you can use the na.rm = TRUE
argument to silently remove them.## add a vertical line at the mean of the variable star %>% ggplot(aes(x = g4math)) + geom_histogram(aes(y = after_stat(density))) + lims(y = c(0, 0.015)) + labs(x = "Score", title = "Distribution of fourth-grade math scores") + geom_vline(xintercept = mean( , na.rm = TRUE)) #use the dollar sign
## add a vertical line at the mean of the variable with geom_vline star %>% ggplot(aes(x = g4math)) + geom_histogram(aes(y = after_stat(density))) + lims(y = c(0, 0.015)) + labs(x = "Score", title = "Distribution of fourth-grade math scores") + geom_vline(xintercept = mean(star$g4math, na.rm = TRUE))
grade_code()
We'll sometimes want to add text to a plot to make it more informative. annotate(geom = "text", x = 8, y = 9, label = "A")
adds a character string A
centered at point on the (8
, 9
) on the plot. You can use the axis labels to see where you might want to add these parts of the plot.
annotate
function to add the string Average Score
to the plot at the point (750, 0.014).## add the text "Average Score" at the specified location
## add the text "Average Score" at the specified location star %>% ggplot(aes(x = g4math)) + geom_histogram(aes(y = after_stat(density))) + lims(y = c(0, 0.015)) + labs(x = "Score", title = "Distribution of fourth-grade math scores") + geom_vline(xintercept = mean(star$g4math, na.rm = TRUE)) + annotate(geom = "text", x = 750, y = 0.014, label = "Average Score")
grade_code()
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.