In paulnorthrop/stat1004: stat1004: Introduction to Probability and Statistics at UCL

knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE)
#devtools::load_all()

The main purpose of this vignette is to provide R code to calculate the summary statistics that feature in Section 2.3 of the STAT0002 notes (apart from correlation, which we defer until Chapter 9). An important point to appreciate is that usually there is more than one way to estimate from data a particular theoretical property of the distribution from which the data came. For example, we will see that there are many different rules (estimators) that can be used to estimate a quantile of a distribution.

The R code used in this vignette are available: descriptive-statistics-vignette.R. The functions five_number, skew and q_skew can be viewed either by typing the name of the function at R command prompt > or at GitHub

The Oxford Birth Times data

These data are available in the data frame ox_births. Use ?ox_births to find out about these data.

library(stat1004)

We manipulate the data into a matrix that is of the same format as Table 2.1 in the notes. The number of birth times varies between days so we pad the matrix with R's missing values code NA in order that each column of the matrix has the same number of rows.

ox_mat <- matrix(NA, ncol = 7, nrow = 16)
for (i in 1:7) {
  day_i_times <- ox_births$time[which(ox_births$day == i)]
  ox_mat[1:length(day_i_times), i] <- sort(day_i_times)
  colnames(ox_mat) <- paste("day", 1:7, sep = "")
}  
ox_mat

Can you see what the following parts of the code do?

i <- 4
ox_births$day == i
which(ox_births$day == i)
ox_births$time[which(ox_births$day == i)]
paste("day", 1:7, sep = "")
paste("day", 1:7, sep = " ")

We return to this matrix later. Until then we calculate summary statistics of the dataset containing the birth times from all days of the week.

birth_times <- ox_births[, "time"]
sort(birth_times)

Five number summary

The function five_number calculates the five number summary of data, using the particular method for estimating the lower quartile, median and upper quartile described in the STAT0002 notes.

five_number(birth_times)

The summary function can also be used to calculate a five number summary.

summary(birth_times)

(If we ignore the fact that summary also calculates the sample mean) does summary produce the same values as five_number?

No, the estimates of the lower quartile differ. This is because the functions summary and five_number use different rules to estimate quantiles: summary calls quantile using type = 7 whereas five_number uses type = 6. If we call five_number with type = 7 we get the same numbers as summary.

five_number(birth_times, type = 7)

In fact the function quantile has 9 different options for type. Use ?quantile for more information.

Sample mean

mean(birth_times)

Sample standard deviation and variance

sd(birth_times)
var(birth_times)
sd(birth_times) ^ 2

Measures of skewness

# Standardized sample skewness
skew(birth_times)

# Sample quartile skewness
q_skew(birth_times)

Until 2017/18 the STAT0002 notes gave -0.063 as the sample quartile skewness. This was because I used the default setting, type = 7, in the quantile function when calculating it ...

q_skew(birth_times, type = 7)

Summary statistics for each day

We can also calculate summary statistics for each of the seven days of the week, i.e. for each of the columns of ox_mat. In the following the effect of the colMeans function is fairly obvious. apply is a useful function. Use ?apply to see what it does.