knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE, fig.width = 7, fig.height = 5, fig.align = 'center')
This vignette provides some R code that is related to some of the content of Chapter 7 of the STAT0002 notes, namely to Statistical Inference.
This section relates to Section 7.5.1 of the notes.
We calculate from the Kerrich's coin data an estimate $\hat{p}$ of the probability that the coin lands with the heads side up when he tosses it.
library(stat0002) phat <- kerrich[nrow(kerrich), "heads"] / kerrich[nrow(kerrich), "throws"] phat
We produce a plot like Figure 7.2. The idea is to repeat Kerrich's 10,000 coin-toss experiment a large number (100,000) times. These experiments are not real: they are computer simulations. The simulations are based on the assumption that the value of $p$ is $1/2$. If this is the case then the number of heads obtained in one experiment (of 10,000 simulated tosses of the coin) has a binomial($10,000, 1/2$) distribution.
# Set the same random number seed that was used to create Figure 7.2 # If we change the value of seed then we will obtain a slightly different plot set.seed(37) # Number of simulated experiments nsimulations <- 100000 # Number of coin tosses on each experiment ntosses <- 10000 # Simulate nsimulations (=100,000) times from a binomial(10,000, 1/2) distribution # This produces a vector nheads of length nsimulations (=100,000) # Divide each value in nhead by ntosses (=10,000) to produce estimates of p nheads <- rbinom(nsimulations, size = ntosses, p = 1 / 2) / ntosses # Plot a histogram hist(nheads, prob = TRUE, col = 8, xlab = "proportion of heads", ylab = "density", main = "")
This histogram describes the sampling distribution of the estimator $\hat{p}$ of $p$ for Kerrich's experiment under the assumption that $p = 1/2$. We could use the simulated values of $\hat{p}$ to help us to judge whether we are surprised that Kerrich's data produced an estimate of 0.5067. For example, we might want to calculate the proportion of simulated values of $\hat{p}$ that are greater than $0.5067$ or the proportion that are either greater than $0.5067$ or less than $1 - 0.5067 = 0.4933$. The following code does this. Can you see how it works?
mean(nheads > 0.5067) mean(nheads > 0.5067 | nheads < 1 - 0.5067)
This section relates to Section 7.5.2 of the notes. We provide code that produces plots like those in Figure 7.3. For details please see the notes. You can change the value of n
to investigate the effect of changing the sample size $n$.
set.seed(37) # Set values of mean and standard deviation mu <- 7.22 sigma <- sqrt(1.36) # Set the sample size n n <- 3 # The number of simulated datasets of size n to produce nsim <- 100000 # Simulate, producing a matrix with nsim columns and n rows # Each column contains a simulated sample of size n x <- matrix(rnorm(nsim * n, mean = mu, sd = sigma), ncol = nsim, nrow = n) # Calculate the sample means and sample variances of each of the nsim samples xbars <- colMeans(x) xvars <- apply(x,2,var)
The following code produces the plots.
# Histogram of sample means br <- seq(from = min(xbars), to = max(xbars), len = 100) hist(xbars, prob = TRUE, breaks = br, col = 8, xlab = "sample means", ylab = "density", main = "") title(paste0("sample size = ", n)) curve(dnorm(x, mean = mu, sd = sigma), lty = 2, add = TRUE) # Histogram of sample variances br <- seq(from = min(xvars), to = max(xvars), len = 100) hist(xvars, prob = TRUE, breaks = br, col = 8, xlab = "sample variances", ylab = "density", main = "") title(paste0("sample size = ", n))
We calculate the values that appear in Table 7.1 of the notes. Firstly, we calculate the observed frequencies, which appear in the first row of the table. I have made two attempts at tabulating the numbers of births in each hour. Can you see why the first attempt is not correct? Can you see how the second attempt corrects this?
# Hour of birth birth_hour <- floor(aussie_births$time / 60) + 1 # Tabulate the number of births in each hour, attempt 1 table(birth_hour) # Tabulate the number of births in each hour, attempt 2 add_a_birth_in_each_hour <- 1:max(birth_hour) births_each_hour <- table(c(birth_hour, add_a_birth_in_each_hour)) - 1 births_each_hour # Check that we have 44 births sum(births_each_hour) # Find the observed frequencies observed <- table(births_each_hour) # Add the 5+ category, with frequency 0 observed <- c(observed, 0) names(observed)[length(observed)] <- "5+" observed
Now, we estimate the rate $\lambda$ of the assumed Poisson process of births.
lambda_hat <- sum(births_each_hour) / 24 lambda_hat
Finally, we calculate the estimated expected frequencies and the residuals then print the middle part of Table 7.1, rounding the values to 1 decimal place.
nbirths <- 0:max(births_each_hour) estimated_expected <- dpois(nbirths, lambda = lambda_hat) * 24 estimated_expected <- c(estimated_expected, 24 - sum(estimated_expected)) estimated_expected resids <- observed - estimated_expected round(rbind(observed, estimated_expected, resids), 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.