knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE)
#devtools::load_all()

Preliminaries

This vignette contains ideas and techniques that are not covered until later in STAT0002 and others (stochastic simulation and logistic regression) that are not part of the STAT0002 syllabus. Therefore, you should not worry too much about these details: just try to get a sense of how the R code works. The questions and comments below are there to prompt you to think about things.

For information about R see STAT0004 Moodle page (if you are taking STAT1006) or Introduction to R Moodle page (if you are not taking STAT1006, e.g. Natural Sciences students).

The R code used in this vignette are available: shuttle-vignette.R. The functions shuttle_sim and shuttle_sim_plot can be viewed either by typing the name of the function at R command prompt > or at GitHub

If you have any questions please ask them using the STAT0002 discussion forum.

The Challenger Space Shuttle Disaster

In the first lecture we discussed the Challenger Space Shuttle Disaster. In this vignette we are concerned mainly with the R code, so there is very little discussion of the problem, the data or statistical ideas. For discussion of these see the notes and/or lecture slides in the Notes section of the STAT0002 Moodle page.

In this vignette we consider how R can be used to analyse data related to this disaster. See @shuttle for more information about these data and analyses to estimate the probability of a catastrophic failure of the Challenger space shuttle under the launch conditions on 28th January 1986. This paper is used as a worked example for the STAT0002 Meet your Professor ICA.

In the following R commands follow the command prompt > and # is used to create a comment, that is, to prevent R from trying to execute text that contains comments rather than a command.

First we load the stat1004 package.

library(stat1004)

The O-ring data

The data are available in a data frame called shuttle. A data frame is what R calls a table of data. Use ?shuttle to find out about these data.

# Print the data to the screen 
shuttle
  1. What do you think NA means?
# The function head() prints only the first 6 lines of the data (useful to see the structure of large datasets)
head(shuttle)
  1. Can you guess which R function prints only the last 6 lines of the data?
# Tabulate the number of O-rings with thermal distress
table(shuttle[, "damaged"])
# Repeat and assign the output to the vector o_ring_table 
o_ring_table <- table(shuttle[, 3])

Note the different way to get the column that we want.

We produce a bar plot of the numbers of distressed O-rings and a scatter plot of temperature against pressure. See also the vignette Chapter 2: Graphs (more than one variable).

barplot(o_ring_table, xlab = "number of distressed O-rings", ylab = "frequency")
attach(shuttle)
plot(pressure, temperature)
  1. Can you see what attach(shuttle) enabled us to do? Type ?attach to find out what the attach() function does.

Did you also note the warning message? This means that there exists another variable called pressure, in the R package called datasets. [Type library(help = "datasets") to see all the datasets available in the datasets package.] Using attach(shuttle) means that if we ask R for the variable pressure then R gives us shuttle[, "pressure"]. Otherwise pressure would give us the data in the datasets package. This highlights the fact that if we use attach() then we need to be careful.

# The opposite of the function attach() is the function detach()
detach(shuttle)
pairs(shuttle[, 3:5])
  1. Can you work out what the 3:5 bit is used for and what the pairs() function does?
# Remove the zeros from the O-ring damage data
not_zero <- shuttle$damaged > 0
not_zero
  1. Can you see what the vector not_zero contains?
xlim <- range(shuttle$temperature)
xlim
# Plot with no zeros
plot(shuttle$temperature[not_zero], shuttle$damaged[not_zero], xlim = xlim, ylim = c(0, 3), ann = FALSE)
title(xlab = "temperature / deg F", ylab = "number of distressed O-rings")
# Plot of all the data
plot(shuttle$temperature, shuttle$damaged, ann = FALSE)
title(xlab = "temperature / deg F", ylab = "number of distressed O-rings")
  1. Can you see/guess what the functions range and title and the function arguments xlim, ann, xlab and ylab do?

  2. Did you see yet another way to refer to the variable temperature in the shuttle dataset?

Relating the probability of O-ring damage to temperature

This section involves a statistical model, a logistic regression model, that is not part of the STAT0002 syllabus, so don't worry too much about the details that follow. However, this section involves things that we will study later in STAT0002, namely independence, the binomial distribution and (the general concept of) regression. We fit the model using a general fitting method called maximum likelihood estimation. This method is not formally part of the STAT0002 syllabus, but we will look at it very briefly later in the course.

Consider the six O-rings on the space shuttle. We suppose that on a launch at temperature $t$ degrees F:

Under these assumptions the number of O-rings that suffer thermal distress has a binomial distribution with parameters $6$ and $p(t)$. We will study this distribution in Section 5.3 of the STAT0002 notes. We have reason to believe that $p(t)$ depends on $t$, that is, the probability that an O-ring suffers thermal distress is different for different launch temperatures. A simple model that is used in this kind of situation is a linear logistic regression model, in which $$ \ln \left(\frac{p(t)}{1-p(t)}\right) = \alpha + \beta t, $$ for some unknown constants $\alpha$ and $\beta$. That is, the logit $\log[p(t) / (1-p(t))]$ of $p(t)$ is assumed to be a linear function of $t$. Inverting this equation gives $$ p(t) = \frac{e^{\alpha + \beta t}}{1 + e^{\alpha + \beta t}}. $$

  1. Why would it not make sense to suppose that $p(t)$ is a linear function of $t$? [Bear in mind that $p(t)$ is a probability and therefore can be no smaller than 0 and no larger than 1.]

  2. How does $p(t)$ behave as $t$ becomes very small, and as $t$ becomes very large, and how does this depend on the value of $\beta$?

  3. What happens in the special case where $\beta = 0$?

A linear logistic regression model is a special case of a Generalized Linear Model. The function glm() can be used to fit this type of model, using maximum likelihood estimation to estimate the unknown parameters (or coefficients) $\alpha$ and $\beta$. First we create response data y (numbers of O-rings with and without thermal distress) and the explanatory data x (the launch temperatures).

# Create a matrix y containing 2 columns:
#   column 1: number of O-rings WITH thermal distress
#   column 2: number of O-rings WITHOUT thermal distress
y <- cbind(shuttle[1:23, 3], 6 - shuttle[1:23, 3])
head(y)
x <- shuttle[1:23, 4]
x

Then we fit the model.

shuttle_fit <- glm(y ~ x, family = binomial)
# Produce a summary of the estimates.  There is a lot of output: only look at the
# numbers in the column headed "Estimate".
summary(shuttle_fit)
alpha_hat <- shuttle_fit$coefficients[1]
beta_hat <- shuttle_fit$coefficients[2]

We calculate the fitted, or estimated probability $p(t)$ over a range of temperatures that includes 31 degrees F, and stored these numbers in a vector called fitted_probs.

temp <- seq(from = 30, to = 85, by = 0.1)
linear_predictor <- alpha_hat + beta_hat * temp
fitted_curve <- exp(linear_predictor) / (1 + exp(linear_predictor))
  1. Can you guess what the function seq does? Use ?seq to find out.

We plot the proportions of distressed O-rings against temperature with the fitted logistic probability curve $p(t)$ superimposed. The second plot is the same as the plot in the lecture slides.

plot(shuttle$temperature, shuttle$damaged / 6, ann = FALSE, ylim = c(0, 1))
title(xlab = "temperature / deg F", ylab = "proportion of distressed O-rings")
lines(temp, fitted_curve)

Now we produce a better plot, one that is similar to the plot that appears in the lecture slides.

repeated_data <- which(duplicated(shuttle[, 3:4]))
shuttle[repeated_data, ]
new_damaged <- shuttle$damaged
new_damaged[c(11, 13, 17, 22)] <- new_damaged[c(11, 13, 17, 22)] + 0.2  
new_damaged[15] <- new_damaged[15] - 0.2  
plot(shuttle$temperature, new_damaged / 6, ann = FALSE, ylim = c(0, 1), pch = 16)
title(xlab = "temperature (deg F)", ylab = "proportion of distressed O-rings")
lines(temp, fitted_curve)
legend("topright", legend = c("sample proportions", "fitted curve"), pch=c(16, -1), lty = c(-1, 1))
abline(v = 31, lty = 2)
  1. Can you see how the second plot is different from the first plot and identify the bits of the R code that do this? To find our about the many options for changing the appearance of a plot use ?par.

Using stochastic simulation to assess uncertainty

The smooth curve superimposed on the plot immediately above cannot be thought of as "the truth": it is merely an estimate $\hat{p}(t)$, based on a single dataset and some simple modelling assumptions, of how the probability $p(t)$ of O-ring distress might depend on launch temperature $t$. If the launches were repeated, under exactly the same conditions that produced the original data, then we would (with high probability) obtain a different dataset and hence a different fitted smooth curve. That is, different experiments produce different datasets, which leads to different estimated curves and therefore uncertainty about $p(t)$.

In reality we are not able to repeat the launches and so we have only one dataset, producing one estimated curve. However, we should appreciate that this dataset (and its estimated curve) is one example taken from a population of datasets (and their respective estimated curves) could be obtained. The size of our uncertainty based on this one dataset is related to how variable is this population of estimated curves. If the estimated curves are very similar then it may not matter which of them we get from our real dataset, because our findings will be similar for all datasets. However, if the estimated curves are very different then our findings will vary a lot depending on which dataset we have obtained.

Although we cannot repeat the launches we can get an idea of how much uncertainty there is about $p(t)$ for given values of $t$ by simulating fake datasets from the model fitted to the real data. This involves getting a computer to `decide' whether or not a given O-ring suffers thermal distress during a launch at temperature $t$, in a random manner such that the probability of distress equals $\hat{p}(t)$. For more details about how we can use a computer to generate pseudo-random numbers (numbers that are close enough to being random for our purposes) see the vignette Stochastic simulation and/or the short Stochastic simulation video.

The function shuttle_sim simulates fake O-ring thermal distress data for Challenger Space Shuttle launches at different launch temperatures. The following code simulates 10 fake datasets of size 23, using the real launch temperatures.

# Set a random number `seed'.  
set.seed(1004)
# Simulate 10 fake datasets of size 23, using the real temperatures.
res <- shuttle_sim(n_sim = 10)
res

This function can also be used to simulate the number of distressed O-rings under the scenario that we launch the space shuttle many times at a single temperature. Here we do this 1000 times at 31 degrees F.

# Simulate the number of distressed O-rings for 1000 launches at 31 deg F.
res <- shuttle_sim(n_sim = 1000, temperature = 31)
res[1:100]
table(res)

The function shuttle_sim_plot plots multiple linear logistic curves. Each curve is the result of simulating a fake dataset from the linear logistic model fitted to the real data and then estimating the linear logistic curve using the fake dataset. The following plot also contains the real data and a (blue) curve showing the curve fitted to the real data.

shuttle_sim_plot(n_sim = 50, col = "grey")

Given that we are simulating data using a computer we could consider a scenario where 10 launches are carried out at each temperature in the real dataset, i.e. the simulated datasets now have size 23 x 10 = 230, rather than 23. This plot shows how the estimated curves would vary between different simulated datasets.

shuttle_sim_plot(n_sim = 50, n_reps = 10, plot_real_data = FALSE, lty = 1)

Now we consider the case where the simulated datasets each have size 2300.

shuttle_sim_plot(n_sim = 50, n_reps = 100, plot_real_data = FALSE, lty = 1)

See also the function shuttle_movie for an animated version of shuttle_sim_plot.

Finally, we produce a plot that is (almost) the same as Figure 1.6 in the notes.

x <- shuttle_sim_plot(n_sim = 1000, plot = FALSE)
shuttle_sim_hists(x, temps = c(31, 50, 65, 80), col = 8)

References



paulnorthrop/stat1004 documentation built on Nov. 17, 2019, 3:49 a.m.