.tutorial <- FALSE .solutions <- FALSE begin_sol <- function(sol = .solutions, quiet = FALSE) { if (!sol) { "<!--\n" } else if (quiet) { "\n<hr />\n" } else { "\n<hr />**Solution:**\n" } } end_sol <- function(sol = .solutions) { if (!sol) { "-->\n" } else { "\n<hr />\n" } } library(StatCompLab) if (.tutorial) { library(learnr) require(gradethis) learnr::tutorial_options( exercise.timelimit = 120, exercise.checker = grade_learnr, exercise.startover = TRUE, exercise.diagnostics = TRUE, exercise.completion = TRUE # exercise.error.check.code = exercise.error.check.code. ) } library(ggplot2) knitr::opts_chunk$set( echo = FALSE, # dev = "png", # dev.args = list(type = "cairo-png"), fig.width = 6 ) library(ggplot2) theme_set(theme_bw()) set.seed(1234L)
report.Rmd
,
that you should expand with your answers and solutions for this project assignment,
and two code files, code.R
containing some predefined functions, and my_code.R
that should be used for some of your own code. Do not change these filenames.my_code.R
file, and included in the report as indicated in
the template document (also see the RMdemo.Rmd
, my_code.R
, and T4sol.Rmd
document examples from the Week 4 lecture and example materials on Learn).report.Rmd
file should be
shown in a separate section at the end of the report, so that the main text is
a coherent presentation of theory and discussion of methods and results.
In the code presentation section, include a brief code comment
for each chunk. Also include brief documentation as code comments before each function
defined in my_code.R
.styler
package Addin for restyling code for better and consistent readability
works for both .R
and .Rmd
files.The assignment is submitted by pushing your work to the github repository
before your submission deadline.
Make sure you include your name, student number, and github username in both report.Rmd
and my_code.R
. When submitting, also include a generated html version of the report,
with filename report.html
. Normally, generated images are automatically
included in the html file itself; otherwise they need to be added to the repository as well.
It is strongly recommended to commit and push to github frequently. If you're uncertain about whether you're pushing all the needed submission files to github, ask Finn to check the repository contents at least a week before the submisison deadline.
The 10 sub-tasks contribute equally to the total mark, but the project will be marked as a whole, between $0$ and $40$. A marking guide will be posted on Learn.
\newpage
In Lab 3, you investigated data from an archaeological excavation on Gotland in the Baltic sea. In the 1920s, a battle gravesite from 1361 was subject to several archaeological excavations. A total of $493$ femurs (thigh bones) ($256$ left, and $237$ right) were found. We wanted to figure out how many persons were likely buried at the gravesite. It must reasonably have been at least $256$, but how many more? In this assignment, you will use importance sampling to estimate credible intervals for this problem.
In the lab, it seemed difficult to extract much information from the sample, since there
were only two observations and two parameters. You will here investigate if
including data from other excavations of the same type might decrease the uncertainty about
the number of persons buried, through improved information about the average detection
probability, $\phi$. The function arch_data
in code.R
returns a data frame
with data for up to four different excavations (note: only one of them is real; the other
three are synthetic data generated for this assignment).
We assume the same model as in the lab for each excavation, with an unknown $N_j$, $j=1,\dots,J$, for each of the $J$ excavations, and a common detection probability $\phi$. We assume them model that, conditionally on ${N_j}$ and $\phi$,
$(Y_{j,i}|N_j,\phi)\sim\pBin(N_j,\phi)$, $i=1,2$, all independent, and prior distributions
$N_j\sim\pGeom(\xi)$, $0<\xi<1$, and $\phi\sim\pBeta(a, b)$, $a,b>0$,
$$
\begin{aligned}
p_{Y_{j,i}|N_j=n,\phi}(y) = \pP(Y_{j,i}=y|N_j=n,\phi) &= {n \choose y} \phi^y(1-\phi)^{n-y},
\quad y=0,\dots,n, \
p_{N_j}(n) = \pP(N_j=n) &= \xi\,(1-\xi)^n,\quad n=0,1,2,3,\dots, \
p_\phi(\phi) &= \frac{\phi^{a-1}(1-\phi)^{b-1}}{B(a,b)}, \quad \phi\in(0,1) ,
\end{aligned}
$$
where $B(a,b)$ is the Beta function, see ?beta
in R.
Conditionally on $\bm{N}={N_1,\dots,N_J}$, and the observations $\bm{Y}$, the conditional posterior distribution for $(\phi|\bm{N},\bm{Y})$ is $\pBeta(\wt{a},\wt{b})$ where $\wt{a}=a+\sum_{j,i} Y_{j,i}$ and $\wt{b}=b+2\sum_j N_j -\sum_{j,i} Y_{j,i}$, and the sums go over $j=1,\dots,J$ and $i=1,2$.
Show that the joint probability function for $(\bm{N},\bm{Y})$, with
$N_j\geq\max(Y_{j,1},Y_{j,2})$ and $Y_{j,i}=0,1,2,\dots$,
is given by^[Binomial coefficients can be notated with {N \choose k}
in LaTeX formulas.]
$$
p(\bm{N},\bm{Y}) = \frac{B(\wt{a},\wt{b})}{B(a,b)} \prod_{j=1}^J \left[
p(N_j)
\prod_{i=1}^2
{N_j \choose Y_{j,i}}
\right],
$$
where $p(N_j)$ is the probability mass function for the Geometric distribution prior
for $N_j$.
With the help of the functions lbeta
, lchoose
, and dgeom
, define (in my_code.R
)
a function log_prob_NY(N,Y,xi,a,b)
with arguments N
(a vector of length $J$) and $Y$
(a data frame of size $J\times 2$, formatted like the output from arch_data(J)
).
The arguments xi
, a
, b
are the model hyperparameters $\xi$, $a$, and $b$.
The function should return the logarithm of $p(\bm{N},\bm{Y})$ when
the combination of $\bm{N}$ and $\bm{Y}$ is valid, and should otherwise return -Inf
.
The aim is to construct a Bayesian credible interval for each $N_j$ using importance sampling, similarly to the method used in lab 4. There, a Gaussian approximation of the posterior of a parameter was used. Here, we will instead use the prior distribution for $\bm{N}$ as sampling distribution, with samples $N_j^{[k]}\sim\pGeom(\xi)$.
Since the observations $\bm{Y}$ are fixed, the posterior probability function $p(\bm{N}|\bm{Y})$ for $\bm{N}$ is proportional to $p(\bm{N},\bm{Y})$.
We define the logarithm of unnormalised importance weights, $\log[w^{[k]}]=\log[p(\bm{N}^{[k]},\bm{Y})]-\log[p(\bm{N}^{[k]})]$, where $p(\bm{N}^{[k]})$ is the product of $J$ geometric distribution probabilities.
Define (in my_code.R
) a function arch_importance(K,Y,xi,a,b)
with the same arguments Y
,
xi
, a
, and b
as before, but with the first argument K
defining the number of
samples to generate. The function should return a data frame with $K$ rows and
$J+1$ columns,
named N1
, N2
, etc, for each $j=1,\dots,J$, containing samples from the prior distributions for $N_j$,
and a final column called Log_Weights
,
containing re-normalised log-importance-weights, constructed by shifting $\log[w^{[k]}]$ by
subtracting the largest $\log[w^{[k]}]$ value, as in lab 4.
Using the $\pGeom(\xi)$ prior distribution for each $N_j$ leads to a rather inefficient importance sampler, especially for $J=4$; Most of the weights will become zero (when some $N_j^{[k]}<Y_{j,i}$) or near zeros, and only a few samples will influence the practical estimates and credible intervals.
As an alternative, we can use sampling distributions that are better adapted to the posterior distributions. Let $N^\text{min}j=\max(Y{j,1}, Y_{j,2})$ be the smallest allowed value to sample, for each $j=1,\dots,J$. This would eliminate all the impossible combinations, so that all weights would be (in theory if not numerically) positive.
To further improve the method, we can choose different values of the probability parameter for the samples, and also adapt it differently for each $j$. Without motivation, you can use values $\xi_j=1/(1 + 4 N^\text{min}_j)$ when sampling, so that $N_j^{[k]}=N^\text{min}_j + \pGeom(\xi_j)$.
Note: Due to the nature of the geometric distribution, this is also equivalent to an ordinary $\pGeom(\xi_j)$ conditioned on being at least $N^\text{min}_j$.
To take this different sampling method into account, the log-weights need to be defined as $\log[w^{[k]}]=\log[p(\bm{N}^{[k]},\bm{Y})]-\log[\wt{p}(\bm{N}^{[k]})]$, where $\wt{p}(\bm{N}^{[k]})=\prod_{j=1}^J p_{\pGeom(\xi_j)}(N_j^{[k]}-N^\text{min}_j)$, the product of $J$ different geometric distribution probabilities.
You may adapt this example code (the details will depend on which vectorisation method you use):
# matrix of minimum allowed N for each j, repeated K times N_min <- matrix(rep(pmax(Y[, 1], Y[, 2]), each = K), K, J) xi_sample <- 1 / (1 + 4 * N_min) # Sample values and call log_prob_NY for each sample, # store N-values in a K-by-J matrix, and log(p_NY)-values in a vector log_PY N <- matrix(rgeom(..., prob = xi_sample), K, J) + N_min log_PY <- ... # Subtract the sampling log-probabilities log_PY - rowSums(dgeom(N - N_min, prob = xi_sample, log = TRUE))
Using this method, $K$ doesn't need to be larger than $100000$,
as small as $K=1000$ starts to give reasonable values for code testing purposes,
and $K=10000$ should only take at most a few seconds to run if the code is fully
vectorised (the vapply
function can be used, but a basic for
-loop is also
fine in this case).
As in the lab, let $\xi=1/1001$, and $a=b=1/2$.
Use arch_importance
and the wquantile
function with type=1
(see the
help text for information) to compute credible intervals for $N_j$.
Start by including only the data from a single excavation, and observe how the
interval for $N_1$ changes when data from the other excavations are added.
Present and discuss the results.
Update^[Note: your final submission should only contain the updated function.] the arch_importance
function to add a column Phi
with samples from the
conditional distribution $\pBeta(\wt{a},\wt{b})$ for each row of the importance sample.
Use the updated function to construct credible intervals for $\phi$, one for $J=1$, and one for $J=4$. Present and discuss the results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.