.tutorial <- FALSE .solutions <- TRUE begin_sol <- function(sol = .solutions) {if (!sol) {"<!--\n"} else {"\n<hr />**Solution:**\n"}} end_sol <- function(sol = .solutions) {if (!sol) {"-->\n"} else {"\n<hr />\n"}} if (.tutorial) { library(learnr) require(gradethis) learnr::tutorial_options( exercise.timelimit = 120, exercise.checker = grade_learnr, exercise.startover = TRUE, exercise.diagnostics = TRUE, exercise.completion = TRUE # exercise.error.check.code = exercise.error.check.code. ) } library(ggplot2) knitr::opts_chunk$set( echo = FALSE ) ### set.seed(123L) z <- rep(1:10, times = 10) data <- data.frame(z = z, y = 2 * z + rnorm(length(z), sd = 4)) mod <- lm(y ~ z, data) newdata <- data.frame(z = -10:20) data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata))) pred <- cbind(newdata, predict(mod, newdata, interval = "prediction")) library(ggplot2) theme_set(theme_bw())
In this lab session you'll gradually develop R code for estimating multiple linear models, and plotting the results. You will not hand in anything, but you should keep your code script file for later use.
File
$\rightarrow$New Project
$\rightarrow$New Directory
$\rightarrow$New Project
)
Make sure you choose a location where no part of the path contains any blank spaces, since that can sometimes cause problems. Name the folder Lab01
, for example.
3.
Next time you want to work on the course files, use
File
$\rightarrow$Open Project
instead (or, if your system has linked .Rproj
files to RStudio, you can open RStudio by opening the .Rproj
file in the project folder)Console
window, run devtools::install_github("finnlindgren/StatCompLab", dependencies = TRUE)
Lecture01.pdf
lecture slides so you can refer to them easily during the lab. Read through the pages on formulas and functions before you continue with the lab.Tools
$\rightarrow$ Global Options
$\rightarrow$ Code
$\rightarrow$ Diagnostics
This allows you to turn on margin notes that automatically alert you to potential code problems. * Run the current line of code (or a selected section of code) and step to the next line:
Press the Run
button in the top right corner of the editor window, or
Press Ctrl+Enter
* Run all the code in the current script file:
Press the Source
button in the top right corner of the editor window, or
Press Ctrl+Shift+S
, or
Press Ctrl+Shift+Enter
Note: the first two options disable the normal automatic value printing; you'll only see the ouput that is explicitly fed into print()
. The third option displays both the code and the results in the Console window.
* The styler
package (should already be available on rstudio.cloud
. Use install.packages("styler")
in R if you're running on a local installation).
This will enable options to reformat your script code to (usually) more readable code, by pressing the Addins
button below the main program menu and choosing e.g. Style active file
.
File
$\rightarrow$New File
$\rightarrow$R Script
File
$\rightarrow$Save
, or
Press Ctrl+S
,
and choose a descriptive filename (e.g. lab_1_code.R
)During the lab, remember to save the script file regularly, to avoid losing any work ( unsaved changes may still be remembered by the system, but it is much safer to explicitly save the files).
There are several ways to view this tutorial document; either on the StatCompLab webpage, by running
learnr::run_tutorial("Tutorial01", package = "StatCompLab")
in RStudio, or vignette("Tutorial01", package = "StatCompLab")
.
(Note: the run_tutorial
option contains some additional interactive questions and exercises that are not supported by the other two methods.)
We will start with the simple linear model $$ \begin{aligned} y_i &= \beta_0 + z_i \beta_z + e_i , \end{aligned} $$ where $y_i$ are observed values, $z_i$ are observed values ("covariates") that we believe have a linear relationship with $y_i$, $e_i$ are observation noise components with variance $\sigma_e^2$, and $\beta_0$, $\beta_z$, and $\sigma_e^2$ are model parameters.
First, generate synthetic data to use when developing the code for estimating models of this type. Enter the following code into your script file and run it with e.g. Ctrl+Enter
:
z <- rep(1:10, times = 10) data <- data.frame(z = z, y = 2 * z + rnorm(length(z), sd = 4))
What does rep()
, length()
, and rnorm()
do?
Look at the help pages by running ?rep
, ?rnorm
, and ?length
in the interactive Console
window, or searching for them in the Help
pane.
question( "What values of $\\beta_0$, $\\beta_z$, and $\\sigma_e^2$ did the synthetic data generation use?", answer("$\\beta_0=0$, $\\beta_z=2$, and $\\sigma_e^2=4$", message = "The `sd` parameter to `rnorm` is $\\sigma_e$, not $\\sigma_e^2$"), answer("$\\beta_0=2$, $\\beta_z=0$, and $\\sigma_e^2=4$", message = "The synthetic data did not use an offset/intercept for $y$, and the `sd` parameter to `rnorm` is $\\sigma_e$, not $\\sigma_e^2$"), answer("$\\beta_0=0$, $\\beta_z=2$, and $\\sigma_e^2=16$", correct = TRUE), answer("$\\beta_0=2$, $\\beta_z=0$, and $\\sigma_e^2=16$", message = "The synthetic data did not use an offset/intercept for $y$."), allow_retry = TRUE )
You can plot the data with "base graphics",
plot(data$z, data$y)
or with ggplot
,
# Put the first two lines (library and theme_set) at the beginning of your script file. # They only need to be run once in your R session! # Load the ggplot2 package library(ggplot2) # Change the default theme (avoids grey background in the plots): theme_set(theme_bw()) # Now we can plot: ggplot(data) + geom_point(aes(z, y))
For this simple plot, the ggplot
approach may seem more complicated than necessary, but for more complex plots ggplot
will make plotting much easier. In the code above, we first supply the data set, and then add information about how to plot it, using a grammar of graphics.
See ?aes
for some more information about the parameters controlling the aesthetics of the geom that adds points to the plot.
Use the lm()
function to estimate the model and save the estimated model in a variable called mod
.
mod <- lm(y ~ z, data)
Now we want to plot the linear predictor for the model as a function of $z$, between $z=-10$ and $z=20$. First, create a new data.frame
:
newdata <- data.frame(z = -10:20)
Note that we don't have any corresponding $y$-values yet!
Use predict()
, plot()
, and geom_line()
to plot the linear predictor and the original data points in a single figure. Also add the true predictor with the help of geom_abline(intercept = ..., slope = ...)
(use the col
parameter to set a different color!). Create a new data frame and supply it directly to geom_line
to override the original data:
data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata))) ggplot(???) + ??? + geom_line(aes(???), data = data_pred)
Structure the code by adding a linebreak after each +
operation.
data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata))) ggplot(data) + geom_point(???) + geom_line(aes(???), data = data_pred) + geom_abline(intercept = ???, slope = ???, col = ???)
data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata))) ggplot(data) + geom_point(???) + geom_line(aes(???), data = data_pred) + geom_abline(intercept = 0, slope = 2, col = "red")
data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata))) ggplot(data) + geom_point(aes(z, y)) + geom_line(aes(z, fit), data = data_pred) + geom_abline(intercept = 0, slope = 2, col = "red")
data_pred <- cbind(newdata, data.frame(fit = predict(mod, newdata)))
r begin_sol()
r end_sol()
Now modify the plotting code to also show prediction intervals. Hint: use the interval="prediction"
parameter for predict()
, and combine the output (which is now a matrix, and there is no need to convert it to a data.frame
first) with newdata
with the help of cbind
. The resulting data.frame
should have variables named z
, fit
, lwr
, and upr
, and the whole object should be called pred
.
pred <- cbind( newdata, predict(???) )
pred <- cbind( newdata, predict(mod, newdata, ???) )
pred <- cbind( newdata, predict(mod, newdata, interval = "prediction") )
grade_this({ msg <- character(0) if (!is.data.frame(.result)) {msg <- c(msg, "The result should be a data.frame")} if (length(msg) == 0) { for (nm in c("z", "fit", "lwr", "upr")) { if (!(nm %in% names(.result))) { msg <- c(msg, paste0("Column '", nm, "' is missing")) } } } if (length(msg) == 0) { graded(correct = TRUE, message = "OK!") } else { graded(correct = FALSE, message = paste0(msg, collapse = "\n\n")) } })
r begin_sol()
r end_sol()
Use the geom_ribbon(data = ???, aes(x = ???, ymin = ???, ymax = ???), alpha = 0.25)
function to add the prediction intervals to the figure. The alpha
parameter sets the "transparency" of the ribbon. Experiment with different values between 0 and 1. The result should look similar to this:
ggplot(data) + geom_point(aes(z, y)) + geom_line(data = pred, aes(z, fit)) + geom_abline(intercept = 0, slope = 2, col = "red") + geom_ribbon(data = pred, aes(???), alpha = 0.25)
ggplot(data) + geom_point(aes(z, y)) + geom_line(data = pred, aes(z, fit)) + geom_abline(intercept = 0, slope = 2, col = "red") + geom_ribbon(data = pred, aes(x = z, ymin = lwr, ymax = upr), alpha = 0.25)
In "base graphics", the following would produce a similar result, but without the shaded bands:
plot(newdata$z, pred[, "fit"], type = "l", ylim = range(pred)) lines(newdata$z, pred[, "lwr"], lty = 2) lines(newdata$z, pred[, "upr"], lty = 2) points(data$z, data$y, pch = 20) abline(0, 2, col = 2)
Note: This section introduces functions that will be discussed in more detail in week 2.
Let's say we want to redo the data analysis using different models (after all, we only know the true model because we generated synthetic data!). Instead of copying the plotting code for each new model, let's create a function instead! Start with the following skeleton code, and fill in the missing bits:
# x: a formula # newdata: data for prediction # xname: A string containing the name of the variable on the x-axis plot_predictions <- function(x, newdata, xname = "x") { pred <- ??? ggplot(pred) + geom_line(aes_string(xname, "???")) + geom_ribbon(???) }
plot_predictions <- function(x, newdata, xname = "x") { pred <- cbind(newdata, predict(x, newdata, interval = "prediction")) ggplot(pred) + geom_line(aes_string(xname, "fit")) + geom_ribbon(aes_string(x = xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) }
r begin_sol()
r end_sol()
What should xname
be set to when we call this function for our specific data?
You should be able to run the following code to regenerate your previous figure:
plot_predictions(mod, newdata, xname = "z") + geom_point(data = data, aes(z, y)) + geom_abline(intercept = 0, slope = 2, col = "red")
Remark: Why can we use the + geom_*
technique to add to the plot drawn by plot_predictions
even though we don't directly call ggplot()
here? Every function returns its last computed object as output. In this case, that's a ggplot
object, and we can use it just as if we had spelled out the call to ggplot()
. This also means that we could in principle write a function that adds features to a plot, taking an existing ggplot
object as input, making complex plotting more structured and modular.
You can change the plot labels and add a title by adding calls to xlab()
, ylab()
, and ggtitle()
to the plot:
plot_predictions <- function(x, newdata, xname = "x", ???, ...) { ??? ggplot(???) + ??? + xlab(xname) + ylab(ylab) } plot_predictions(mod, newdata, xname = "z", xlab = "z", ylab = "y")
This passes any additional parameters through to the plot()
function.
The provided solution has a further generalised version of the plotting function, that also adds the confidence intervals for the predictor curve to the plot.
plot_prediction <- function(x, newdata, xname = "x", xlab = NULL, ylab = NULL, ...) { if (is.null(xlab)) { xlab <- xname } if (is.null(ylab)) { ylab <- "Response" } pred <- cbind(newdata, predict(x, newdata, interval = "prediction")) pl <- ggplot() + geom_line(data = pred, aes_string(xname, "fit")) + geom_ribbon(data = pred, aes_string(xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) # Also add the confidence intervals for the predictor curve conf <- cbind(newdata, predict(x, newdata, interval = "confidence")) pl <- pl + geom_ribbon(data = conf, aes_string(xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) # Add lables: pl + xlab(xlab) + ylab(ylab) }
# Basic solution plot_prediction <- function(x, newdata, xname = "x", xlab = NULL, ylab = NULL, ...) { if (is.null(xlab)) { xlab <- xname } if (is.null(ylab)) { ylab <- "Response" } pred <- cbind(newdata, predict(x, newdata, interval = "prediction")) pl <- ggplot() + geom_line(data = pred, aes_string(xname, "fit")) + geom_ribbon(data = pred, aes_string(xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) # Add labels: pl + xlab(xlab) + ylab(ylab) }
# Extended version of the solution: plot_prediction <- function(x, newdata, xname = "x", xlab = NULL, ylab = NULL, ...) { if (is.null(xlab)) { xlab <- xname } if (is.null(ylab)) { ylab <- "Response" } pred <- cbind(newdata, predict(x, newdata, interval = "prediction")) pl <- ggplot() + geom_line(data = pred, aes_string(xname, "fit")) + geom_ribbon(data = pred, aes_string(xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) # Also add the confidence intervals for the predictor curve conf <- cbind(newdata, predict(x, newdata, interval = "confidence")) pl <- pl + geom_ribbon(data = conf, aes_string(xname, ymin = "lwr", ymax = "upr"), alpha = 0.25) # Add lables: pl + xlab(xlab) + ylab(ylab) }
r begin_sol()
r end_sol()
Use your new function to plot the predictions for the quadratic model
mod2 <- lm(y ~ 1 + z + I(z ^ 2), data)
The I()
syntax means that you don't have to create a separate data variable equal to the square of the z-values; the lm()
function will create it for you internally. The result should look similar to this:
# Run the code in your script
plot_prediction(mod2, newdata, xname = "z", ylab = "y") + geom_point(data = data, aes(z, y)) + geom_abline(intercept = 0, slope = 2, col = "red") + ggtitle("Confidence and prediction intervals")
Finally, let's estimate and plot predictions for four polynomial models, by first creating a list of formulas:
formulas <- c(y ~ 1, y ~ z, y ~ z + I(z^2), y ~ z + I(z^2) + I(z^3))
Look at the lecture slides and use the lapply()
technique to estimate all four models and store the result in a variable called mods
.
formulas <- c(y ~ 1, y ~ z, y ~ z + I(z^2), y ~ z + I(z^2) + I(z^3)) mods <- lapply(...)
formulas <- c(y ~ 1, y ~ z, y ~ z + I(z^2), y ~ z + I(z^2) + I(z^3)) mods <- lapply(formulas, function(x) lm(...))
formulas <- c(y ~ 1, y ~ z, y ~ z + I(z^2), y ~ z + I(z^2) + I(z^3)) mods <- lapply(formulas, function(x) lm(x, data))
r begin_sol()
r end_sol()
You can now use something like the following code to plot all the results:
for (k in seq_along(formulas)) { pl <- plot_prediction(mods[[k]], newdata, xname = "z", ylab = "y") + geom_point(data = data, aes(z, y)) + geom_abline(intercept = 0, slope = 2, col = "red") + ggtitle(as.character(formulas[k])) print(pl) }
Look at the help text for seq_along()
! Storing the plot as an object, pl
, and then explicitly "printing" it is sometimes needed to ensure that all the plots are generated into a report document.
Note: We will look at ways of combining multiple plots later.
Does the information displayed in the plots match your intuition about the different models?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.