knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE) #devtools::load_all()
The main purpose of this vignette is to provide R code to calculate the summary statistics that feature in Section 2.3 of the STAT0002 notes (apart from correlation, which we defer until Chapter 9). An important point to appreciate is that usually there is more than one way to estimate from data a particular theoretical property of the distribution from which the data came. For example, we will see that there are many different rules (estimators) that can be used to estimate a quantile of a distribution.
The R code used in this vignette are available: descriptive-statistics-vignette.R.
The functions five_number, skew and q_skew can be viewed either by typing the name of the function at R command prompt > or at GitHub
These data are available in the data frame ox_births. Use ?ox_births to find out about these data.
library(stat1004)
We manipulate the data into a matrix that is of the same format as Table 2.1 in the notes. The number of birth times varies between days so we pad the matrix with R's missing values code NA in order that each column of the matrix has the same number of rows.
ox_mat <- matrix(NA, ncol = 7, nrow = 16) for (i in 1:7) { day_i_times <- ox_births$time[which(ox_births$day == i)] ox_mat[1:length(day_i_times), i] <- sort(day_i_times) colnames(ox_mat) <- paste("day", 1:7, sep = "") } ox_mat
i <- 4 ox_births$day == i which(ox_births$day == i) ox_births$time[which(ox_births$day == i)] paste("day", 1:7, sep = "") paste("day", 1:7, sep = " ")
We return to this matrix later. Until then we calculate summary statistics of the dataset containing the birth times from all days of the week.
birth_times <- ox_births[, "time"] sort(birth_times)
The function five_number calculates the five number summary of data, using the particular method for estimating the lower quartile, median and upper quartile described in the STAT0002 notes.
five_number(birth_times)
The summary function can also be used to calculate a five number summary.
summary(birth_times)
summary also calculates the sample mean) does summary produce the same values as five_number?No, the estimates of the lower quartile differ. This is because the functions summary and five_number use different rules to estimate quantiles: summary calls quantile using type = 7 whereas five_number uses type = 6. If we call five_number with type = 7 we get the same numbers as summary.
five_number(birth_times, type = 7)
In fact the function quantile has 9 different options for type. Use ?quantile for more information.
mean(birth_times)
sd(birth_times) var(birth_times) sd(birth_times) ^ 2
# Standardized sample skewness skew(birth_times)
# Sample quartile skewness q_skew(birth_times)
Until 2017/18 the STAT0002 notes gave -0.063 as the sample quartile skewness. This was because I used the default setting, type = 7, in the quantile function when calculating it ...
q_skew(birth_times, type = 7)
We can also calculate summary statistics for each of the seven days of the week, i.e. for each of the columns of ox_mat. In the following the effect of the colMeans function is fairly obvious. apply is a useful function. Use ?apply to see what it does.
five_number(ox_mat, na.rm = TRUE)
summary(ox_mat)
colMeans(ox_mat, na.rm = TRUE)
apply(ox_mat, 2, sd, na.rm = TRUE)
skew(ox_mat, na.rm = TRUE)
q_skew(ox_mat, na.rm = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.