knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE) #devtools::load_all()
The main purpose of this vignette is to provide R code to calculate the summary statistics that feature in Section 2.3 of the STAT0002 notes (apart from correlation, which we defer until Chapter 9). An important point to appreciate is that usually there is more than one way to estimate from data a particular theoretical property of the distribution from which the data came. For example, we will see that there are many different rules (estimators) that can be used to estimate a quantile of a distribution.
The R code used in this vignette are available: descriptive-statistics-vignette.R.
The functions five_number
, skew
and q_skew
can be viewed either by typing the name of the function at R command prompt >
or at GitHub
These data are available in the data frame ox_births
. Use ?ox_births
to find out about these data.
library(stat1004)
We manipulate the data into a matrix that is of the same format as Table 2.1 in the notes. The number of birth times varies between days so we pad the matrix with R's missing values code NA
in order that each column of the matrix has the same number of rows.
ox_mat <- matrix(NA, ncol = 7, nrow = 16) for (i in 1:7) { day_i_times <- ox_births$time[which(ox_births$day == i)] ox_mat[1:length(day_i_times), i] <- sort(day_i_times) colnames(ox_mat) <- paste("day", 1:7, sep = "") } ox_mat
i <- 4 ox_births$day == i which(ox_births$day == i) ox_births$time[which(ox_births$day == i)] paste("day", 1:7, sep = "") paste("day", 1:7, sep = " ")
We return to this matrix later. Until then we calculate summary statistics of the dataset containing the birth times from all days of the week.
birth_times <- ox_births[, "time"] sort(birth_times)
The function five_number
calculates the five number summary of data, using the particular method for estimating the lower quartile, median and upper quartile described in the STAT0002 notes.
five_number(birth_times)
The summary
function can also be used to calculate a five number summary.
summary(birth_times)
summary
also calculates the sample mean) does summary
produce the same values as five_number
?No, the estimates of the lower quartile differ. This is because the functions summary
and five_number
use different rules to estimate quantiles: summary
calls quantile
using type = 7
whereas five_number
uses type = 6
. If we call five_number
with type = 7
we get the same numbers as summary
.
five_number(birth_times, type = 7)
In fact the function quantile
has 9 different options for type
. Use ?quantile
for more information.
mean(birth_times)
sd(birth_times) var(birth_times) sd(birth_times) ^ 2
# Standardized sample skewness skew(birth_times)
# Sample quartile skewness q_skew(birth_times)
Until 2017/18 the STAT0002 notes gave -0.063 as the sample quartile skewness. This was because I used the default setting, type = 7
, in the quantile
function when calculating it ...
q_skew(birth_times, type = 7)
We can also calculate summary statistics for each of the seven days of the week, i.e. for each of the columns of ox_mat
. In the following the effect of the colMeans
function is fairly obvious. apply
is a useful function. Use ?apply
to see what it does.
five_number(ox_mat, na.rm = TRUE)
summary(ox_mat)
colMeans(ox_mat, na.rm = TRUE)
apply(ox_mat, 2, sd, na.rm = TRUE)
skew(ox_mat, na.rm = TRUE)
q_skew(ox_mat, na.rm = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.