knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE )
library(Stat231) library(tidyverse)
We will be using the diamonds data set from the ggplot2
package. To view the data either run the code chunk below or type the command into your Console.
View(diamonds)
As you can see, this is a large data set with information on 53,940 individual diamonds! First lets calculate some summary statistics.
summary(diamonds$price)
The summary()
function outputs the five number summary, plus the mean, of a numeric variable.
Let's see what happens when we use the summary()
function on a categorical variable.
summary(diamonds$color)
With a categorical variable, summary()
outputs a frequency distribution!
With such a large data set, it might be useful to have some summary statistics for specific subsets. For example, we might expect diamonds with different clarity to have different prices. The dplyr
package from the Tidyverse allows us to group our data set by a specific variable before calculating summary values. For this, we'll use the fivenum()
function.
First, lets look at how to pull out one value from the results of fivenum()
.
fivenum(diamonds$price)[1]
Will give the minimum value (1st number of the summary) of the variable. Now lets calculate the whole five number summary for the individual clarity values and print the results in a table.
diamonds_summary <- diamonds %>% #This will create a new data set called diamonds_summary by taking the diamonds dataset "and then"... group_by(clarity) %>% summarize(min = fivenum(price)[1], q1 = fivenum(price)[2], med = fivenum(price)[3], q3 = fivenum(price)[4], max = fivenum(price)[5]) knitr::kable(diamonds_summary)
Let's create a few boxplots.
ggplot(data = diamonds, aes(y = price))+ geom_boxplot()
What if we wanted the boxplot for the grouped data?
ggplot(data = diamonds, aes(x = clarity, y = price))+ geom_boxplot()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.