.tutorial <- FALSE .solutions <- FALSE begin_sol <- function(sol = .solutions, quiet = FALSE) { if (!sol) { "<!--\n" } else if (quiet) { "\n<hr />\n" } else { "\n<hr />**Solution:**\n" } } end_sol <- function(sol = .solutions) { if (!sol) { "-->\n" } else { "\n<hr />\n" } } library(StatCompLab) if (.tutorial) { library(learnr) require(gradethis) learnr::tutorial_options( exercise.timelimit = 120, exercise.checker = grade_learnr, exercise.startover = TRUE, exercise.diagnostics = TRUE, exercise.completion = TRUE # exercise.error.check.code = exercise.error.check.code. ) } library(ggplot2) knitr::opts_chunk$set( echo = FALSE, fig.width = 6 ) library(ggplot2) theme_set(theme_bw()) set.seed(1234L) suppressPackageStartupMessages(library(tidyverse))
This document can either be accessed in html form on https://finnlindgren.github.io/StatCompLab/, as
a vignette in StatCompLab
version 21.4.1 or later, with vignette("Project01", "StatCompLab")
, or in pdf format on Learn.
Use the template material in the zip file project01.zip
in Learn to write your report. Add all your function definitions on the code.R
file and write your report using report.Rmd
. You must upload the following three files as part of this assignment: code.R
, report.html
, report.Rmd
. Specific instructions for these files are in the README.md
file.
The main text in your report should be a coherent presentation of theory and discussion of methods and results, showing code for code chunks that perform computations and analysis but not code for code chunks that generate figures or tables.
Use the echo=TRUE
and echo=FALSE
to control what code is visible.
The styler
package addin is useful for restyling code for better and consistent readability. It works for both .R
and .Rmd
files.
The Project01Hints
file contains some useful tip and CWmarking
contain guidelines. Both are attached in Learn as pdf files.
Submission should be done through Gradescope.
As in Lab 4, consider the Poisson model for observations $\bm{y}={y_1,\dots,y_n}$: $$ \begin{aligned} y_i & \sim \pPo(\lambda), \quad\text{independent for $i=1,\dots,n$.} \end{aligned} $$ that has joint probability mass function $$ p(\bm{y}|\lambda) = \exp(-n\lambda) \prod_{i=1}^n \frac{\lambda^{y_i}}{y_i!} $$ In Lab 4, one of the considered parameterisation alternatives was
Create a function called estimate_coverage
(in code.R
; also document the function) to perform interval coverage estimation taking arguments CI
(a function object for confidence interval construction taking arguments y
and alpha
and returning a 2-element vector [see Lab 4]), N
(the number of simulation replications to use for the coverage estimate), alpha
(1-alpha
is the intended coverage probability), n
(the sample size) and lambda
(the true lambda
values for the Poisson model).
Use your function estimate_coverage
to estimate the coverage of the construction of confidence intervals from samples $\pPo(\lambda)$
and discuss the following:
\newpage
The aim is to estimate the parameters of a Bayesian statistical model of material use in a 3D printer. The printer uses rolls of filament that get heated and squeezed through a moving nozzle, gradually building objects. The objects are first designed in a CAD program (Computer Aided Design) that also estimates how much material will be required to print the object.
The data can be loaded with data("filament1", package = "StatCompLab")
, and contains information about one
3D-printed object per row. The columns are
Index
: an observation indexDate
: printing datesMaterial
: the printing material, identified by its colourCAD_Weight
: the object weight (in grams) that the CAD software calculatedActual_Weight
: the actual weight of the object (in grams) after printingIf the CAD system and printer were both perfect, the CAD_Weight
and Actual_Weight
values would be equal for each object.
In reality, there are random variations, for example, due to varying humidity and temperature, and systematic deviations due to the CAD system not having perfect information about the properties of the printing materials.
When looking at the data (see below) it's clear that the variability of the data
is larger for larger values of CAD_Weight
. The printer operator has made a simple
physics analysis, and settled on a model where the connection between CAD_Weight
and Actual_Weight
follows a linear model, and the variance increases with square of CAD_Weight
.
If we denote the CAD weight for observations $i$ by x_i
, and the corresponding actual
weight by $y_i$, the model can be defined by
$$ y_i \sim \pN[\beta_1 + \beta_2 x_i, \beta_3 + \beta_4 x_i^2)] . $$ To ensure positivity of the variance, the parameterisation $\mv{\theta}=[\theta_1,\theta_2,\theta_3,\theta_4]=[\beta_1,\beta_2,\log(\beta_3),\log(\beta_4)]$ is introduced, and the printer operator assigns independent prior distributions as follows: $$ \begin{aligned} \theta_1 &\sim \pN(0,\gamma_1), \ \theta_2 &\sim \pN(1,\gamma_2), \ \theta_3 &\sim \pLogExp(\gamma_3), \ \theta_4 &\sim \pLogExp(\gamma_4), \end{aligned} $$ where $\pLogExp(a)$ denotes the logarithm of an exponentially distributed random variable with rate parameter $a$, as seen in Tutorial 4. The $\mv{\gamma}=(\gamma_1,\gamma_2,\gamma_3,\gamma_4)$ values are positive parameters.
The printer operator reasons that due to random fluctuations in the material properties (such as the density) and room temperature should lead to a relative error instead of an additive error, which leads them to the model as an approximation of that. The basic physics assumption is that the error in the CAD software calculation of the weight is proportional to the weight itself.
Start by loading the data and plotting it.
data("filament1", package = "StatCompLab") ggplot(filament1, aes(CAD_Weight, Actual_Weight, colour = Material)) + geom_point()
With the help of dnorm
and the dlogexp
function (see the code.R
file for documentation),
define and document (in code.R
) a function log_prior_density
with arguments theta
and params
,
where theta is the $\mv{\theta}$ parameter vector, and params
is the vector of $\mv{\gamma}$ parameters. Your function should evaluate the logarithm of the joint
prior density $p(\mv{\theta})$ for the four $\theta_i$ parameters.
With the help of dnorm
, define and document a function log_like
, taking arguments theta
, x
, and y
, that evaluates the observation log-likelihood
$p(\mv{y}|\mv{\theta})$ for the model defined above.
Define and document a function log_posterior_density
with arguments
theta
, x
, y
, and params
, which evaluates the logarithm of the
posterior density $p(\mv{\theta}|\mv{y})$,
apart from some unevaluated normalisation constant.
Define a function posterior_mode
with arguments theta_start
, x
, y
, and params
, that uses optim
together with the log_posterior_density
and filament data to find the mode $\bm{\mu}$ of the log-posterior-density and evaluates the Hessian at the mode as well as the inverse of the negated Hessian, $\bm{S}$.
This function should return a list with elements mode
(the posterior mode location), hessian
(the Hessian of the log-density at the mode), and S
(the inverse of the negated Hessian at the mode).
See the documentation for optim
for how to do maximisation instead of minimisation.
Let all $\gamma_i=1$, $i=1,2,3,4$, and use posterior_mode
to evaluate the inverse of the negated Hessian at the mode, in order to obtain a multivariate Normal approximation $\pN(\mv{\mu},\mv{S})$ to the posterior distribution for $\mv{\theta}$. Use start values $\mv{\theta} = \mv{0}$.
The aim is to construct a 90% Bayesian credible interval for each $\beta_j$ using
importance sampling, similarly to the method used in lab 4. There, a one dimensional
Gaussian approximation of the posterior of a parameter was used.
Here, we will instead use a multivariate Normal approximation as the importance sampling distribution. The functions
rmvnorm
and dmvnorm
in the mvtnorm
package can be used to sample and
evaluate densities.
Define and document a function do_importance
taking arguments N
(the number of samples
to generate), mu
(the mean vector for the importance distribution), and S
(the covariance matrix),
and other additional parameters that are needed by the function code.
The function should output a data.frame
with five columns, beta1
, beta2
, beta3
, beta4
, log_weights
, containing the $\beta_i$ samples and normalised $\log$-importance-weights,
so that sum(exp(log_weights))
is $1$. Use the log_sum_exp
function (see the code.R
file for documentation) to compute
the needed normalisation information.
Use your defined functions to compute an importance sample of size $N=10000$.
Plot the empirical weighted CDFs together with the un-weighted CDFs for each parameter,
with the help of stat_ewcdf
, and discuss the results. To achieve simpler ggplot
code, you may find pivot_longer(???, starts_with("beta"))
and facet_wrap(vars(name))
useful.
Construct 90% credible intervals for each of the four model parameters, based on the importance sample. In addition to wquantile
and pivot_longer
,
the methods group_by
and summarise
are helpful. You may wish to define a function make_CI
taking arguments x
, weights
, and prob
(to control the intended
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.