Distributions and Likelihoods {#dists}

This section describes distributions and likelihoods that I have used or found useful to have as reference. These are the less frequent compared with the Normal and Multinomial etc.

Gamma

The density for the Gamma is parameterised either by the shape ($k$) and scale ((\theta)) or the shape ((\alpha)) and rate parameter ((\beta)). Many of the models I apply to represent the Gamma (and general model frameworks such as GLM) have coefficients that define the moments i.e., mean $E[Y] = \mu = k\theta$ and variance $Var[Y] = \mu^2/\phi$, where (\phi = 1/k) is the dispersion coefficient (estimable). These moments can be reparameterised back to the original Gamma parameters. The likelihood for the Gamma follows \begin{equation} l = \prod \frac{1}{\Gamma(k) \theta^{k}} x_i^{k - 1} e^{x_i/\theta} \end{equation} and log-likelihood \begin{equation} ll = \sum -log(\Gamma(k)) - klog(\theta) + (k - 1)log(x_i) + {x_i/\theta} \end{equation}

so calculating (\theta) and (k) from (\mu) and (Var).

[k = \frac{\mu}{\theta} ] and then substituting (k) into the variance definition to solve for (\theta)

[Var[Y] = \mu^2/\phi = \mu\theta]

Given (\phi) is an estimable parameter we want to re-arrange the above to make (\theta) a function of (\mu) and (\phi).

[\theta = \frac{\mu}{\phi}]

# credit for this code https://stat.ethz.ch/pipermail/r-help/2011-July/283736.html
set.seed(1001)
shape = 2
scale = 3
z <- rgamma(10000, shape = shape ,scale = scale)
var(z)
# shoudl be close to 
shape * scale^2
# calculate CV
1/sqrt(shape)
sd(z) / mean(z) # compare to cv
#> [1] 1.002399
## inverse link GLM
g1 <- glm(z ~ 1, family = Gamma(link = "inverse"))
#Here the intercept estimate is 0.167 , which is very
#nearly 1/6 = 1/(shape*scale) -- i.e. the Gamma GLM is
#parameterized in terms of the mean (on the inverse scale).  
# If you want to recover the scale parameter for the intercept case, then
summary(g1)$dispersion / coef(g1)[1]

## to get the scale parameter the log-link gamma model
g2 <- glm(z ~ 1, family = Gamma(link = "log"))
## scale param
summary(g2)$dispersion * exp(coef(g2)[1])
## shape param
1 / summary(g2)$dispersion
# the shape parameter (the reciprocal of the dispersion parameter)

Negative Binomial

The negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of iid Bernoulli trials before a specified (non-random) number of successes (denoted (r)) occurs, with probability of success denoted by ((p)).

\begin{equation} y \sim NB(\mu, \phi) \end{equation}

where (\mathbb{E}\left[y\right] = \mu), (Var\left[y\right] = \mu \left(1 +\frac{ \mu}{\phi}\right)) and constrained to be (\phi > 0). Link the parameters back to the traditional \begin{equation} y \sim NB(r, p) \end{equation}

where, (r) is the number of failures until the experiment is stopped, and (p) is the success probability in each trial.

[ r = \frac{\mu^2}{\sigma^2 - \mu} ]

[ p = 1 - \frac{\sigma^2 - \mu}{\sigma^2} = \frac{\phi}{\phi + \mu} ]

set.seed(123)
## generate nbinomial draw
n_sim = 100
phi = 100 # also = r 
mu = rlnorm(n_sim, log(5), 0.4)
var = mu + (mu^2/phi)
p_success = phi/(phi+mu)  ## probability of success in each trial.
p_fail = (var - mu) / var ## probability of failure in each trial.
head(cbind(p_success, p_fail, 1 - p_success))
## 
r = mu^2 / (var - mu) ## = phi
n = mu * (p_success / p_fail)
summary(r)
summary(p_fail)
## simulates the number of failures which occur in a sequence of Bernoulli 
## trials before a target number of successes is reached
y_sim = rnbinom(n_sim, size = phi, mu = mu)

## Calculate the densities the two ways
sum(dnbinom(y_sim, size = phi, mu = mu, log = T))
sum(dnbinom(y_sim, size = n, prob = p_success, log = T))

Dirichlet-Multinomial

If ((p_1,\dots,p_k) \sim \mathrm{Dirichlet}(\alpha_1,\dots,\alpha_k)) and ((x_1,\dots,x_k) \sim \mathrm{Multinomial}(n, p_1,\dots,p_k)), then ((x_1,\dots,x_k) \sim \mathrm{DirichletMultinomial(n, \alpha_1,\dots,\alpha_k)}).

The probability density function is \begin{equation} f(x) = \frac{\left(n!\right)\Gamma\left(\sum \alpha_k\right)}{\Gamma\left(n+\sum \alpha_k\right)}\prod_{k=1}^K\frac{\Gamma(x_{k}+\alpha_{k})} {\left(x_{k}!\right)\Gamma(\alpha_{k})} \end{equation}

with dispersion parameter (\beta)

When simulating from this distribution you first simulate the dirichlet variable, using independent gamma draws (normalised), with shape parameter set as the fitted proportion. Then draw from the multinomial with expected composition based on the dirichlet draw.

The Dirichlet-Multinomial using the linear re-parametrised approach from @thorson2017model. For the observed proportions at age $\tilde{\pi}_b$ for composition bin $b$ (age or length), with sample size $n$, expected proportions for the same bin denoted by $\pi_b$, and estimable overdispersion parameter (\theta) the negative log-likelihood is:

\begin{align} -\log \left(L \right) &= -\log \Gamma \left(n + 1 \right) + \sum\limits_b \log \left( \Gamma \left(n\tilde{\pi}_b + 1\right) \right) + \ & \ \ \ \log \Gamma \left(\theta n\right) + \log \Gamma \left(n + \theta n\right) n\tilde{\pi}_b - \sum\limits_b \log (n\tilde{\pi}_b + \theta n \pi_b) - \log(\theta n \pi_b) \end{align} with, (\beta = n \theta) which has an effective sample size (n_{eff}) [ n_{eff} = \frac{1 + \theta n}{1 + \theta} = \frac{1}{1 + \theta} + n\frac{\theta}{1 + \theta} ] where effective sample size is a linear function of input sample size with intercept ((1 + \theta)^{-1}) and slope (\frac{\theta}{1 + \theta}). Interpreting (\theta) is if (\theta) is large then (n_{eff} \rightarrow N) and if (\theta \ll N ) and (N > 1) then (\theta) can be interpreted as the ratio of effective sample size over input sample size.

#' Return the Pdf for the dirichlet multinomial tried to copy from The 
#' thorson paper. He deviates from the classic formulation by pulling out (x!) 
#' of the denominator from the right hand product
#' and moving it to the left denominator
#' @param: obs vector of observed compositions assumes sum(pi) = 1
#' @param: fitted vector of fiited compositions assumes sum(pi_fitted) = 1 
#' @param: beta variance inflation coefficient
#' @param: n is the total number of samples in the available data (which is restricted to any non-negative real number),
#' @return PDF
#'
ddirichmult = function(obs, beta, n, fitted, log = F) {
  if(sum(pi) != 1)
    stop("pi needs to sum to 1")
  if(sum(fitted) != 1)
    stop("pi_fitted needs to sum to 1")
  val = (lgamma(n + 1) + lgamma(beta)) - (lgamma(n + beta) + 
               sum(lgamma(n*obs + 1))) + sum(lgamma(n*obs + beta * fitted) 
                                             - lgamma(beta* fitted))
  if(log == F)
    val = exp(val)
  return(val)
}

#' returns the negative log-likelihood for the dirichlet multinomial
#' this is used in the TMB code, parameterised as theta.
ddirichmult_alt = function(obs, theta, fitted) {
  N_eff = sum(obs)
  obs_prop = obs / N_eff
  sum1 = sum2 = 0
  for(i in 1:length(obs)) {
    sum1 = sum1 + lgamma(N_eff * obs_prop[i] + 1);
    sum2 = sum2 + lgamma(N_eff * obs_prop[i] + theta * N_eff * 
                           fitted[i]) - lgamma(theta * N_eff * fitted[i])
  }
  nll = (lgamma(N_eff + 1) - sum1 + lgamma(theta * N_eff) - 
           lgamma(N_eff + theta * N_eff) + sum2);
  return(nll)
}

## run an example
set.seed(123)
n = 1000 ## sample size
pi = rnorm(10, 50, 10)
pi_fitted = rnorm(10,  50, 10)
pi = pi / sum(pi)
pi_fitted = pi_fitted / sum(pi_fitted)
size = n 
x = pi * n
beta = 7
theta = beta / n
## evaluate different functions
ddirichmult_alt(obs = x, theta = beta / sum(x), fitted = pi_fitted)
ddirichmult(obs = x / sum(x), n = sum(x), fitted = pi_fitted,  beta = beta, log = T)

Logistic Normal

A key data type in age and length-structured stock assessment models is composition data. The default in error structure for normalized (sum to 1) composition data sets is the multinomial [@francis2014replacing]. However there are often complex correlations in residuals when applying the multinomial distribution [@berg2016accounting; @francis2014replacing]. This make the additive logistic normal and multinomial centered log transform attractive, because the transformed compositions are assumed to be multivariate random variables. This allows for complex correlation structures to be explored (See next section for details on correlations)

Normalised composition data can be viewed as a simplex. Consider a specific simplex, the so-called (D - 1) standard simplex, subset of (\mathbb{R}^D), which is defined by

\begin{equation}\label{eq:simplex} \tilde{S}^D = \boldsymbol{x} = (x_1, x_2, \dots, x_D) \in \mathbb{R}^D | x_i \geq 0, \sum\limits_{i = 1}^D x_i = 1 \end{equation}

The additive log ratio (ALR) transformation maps a simplex (\tilde{S}^D) to (\mathbb{R}^{D-1}), and the result for an observation (\boldsymbol{x} \in \tilde{S}^D) are coordinates (\boldsymbol{y} \in \mathbb{R}^{D - 1}) with

\begin{equation}\label{eq:alr_transform} \boldsymbol{y} = (y_1,y_2,\dots,y_{D - 1})' = alr(\boldsymbol{x}) = \left(ln\frac{x_1}{x_j}, \dots, ln\frac{x_{j-1}}{x_j},ln\frac{x_{j + 1}}{x_j}\right) \end{equation}

The index (j \in [1,\dots, D]) refers to the variable that is chosen as ratioing variable in the coordinates. This choice usually depends on the context, but also on the suitability of the results for data exploration. The main disadvantage of alr is the subjective choice of the ratioing variable.

The centred log ratio (CLR) transformation maps a simplex (\tilde{S}^D) to (\mathbb{R}^{D}), and the result for an observation (\boldsymbol{x} \in \tilde{S}^D) are coordinates (\boldsymbol{y} \in \mathbb{R}^{D}) with

\begin{equation}\label{eq:clr_transform} \boldsymbol{y} = clr(\boldsymbol{x}) = (y_1, \dots,y_D)'= \left(ln\frac{x_1}{\sqrt[D]{\prod\limits_{k = 1}^Dx_k}} , \dots, ln\frac{x_D}{\sqrt[D]{\prod\limits_{k = 1}^Dx_k}}\right) \end{equation}

[@francis2014replacing] (Appendix) suggests defining a multivariate normal variable (\boldsymbol{X} \sim \mathcal{MVN} \left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right)) in [@schnute2007compositional] (\boldsymbol{\mu} = log(E)) and (\boldsymbol{\Sigma}) is the variance. With the relationship to the observed composition (\boldsymbol{O} = (O_1,\dots, O_D)')

\begin{equation}\label{eq:schnute} O_i = \frac{e^{X_i}}{\sum_i e^{X_i}} \end{equation}

Francis suggests using the two transformations, the ALR (Equation~\ref{eq:alr_transform}) to define a new variable (\boldsymbol{Y} = alr(\boldsymbol{X})), and (\boldsymbol{Y} \sim \mathcal{MVN}\left(alr(\boldsymbol{E}), \boldsymbol{V}\right)),

[ \boldsymbol{V} = \boldsymbol{K}\boldsymbol{\Sigma}\boldsymbol{K}' ]

with (\boldsymbol{K}) has dims (D - 1 \times D) which is an identity matrix (I_{D-1}) with an additional vector of (\boldsymbol{-1}) to the right.

The other transformation is the centred log ratio transformation (Equation~\ref{eq:clr_transform}), with variable (\boldsymbol{Z} \sim \mathcal{MVN}\left(clr(\boldsymbol{E}), \boldsymbol{\Gamma}\right))

[ \boldsymbol{\Gamma} = \boldsymbol{F}'\boldsymbol{H}^{-1}\boldsymbol{V}\boldsymbol{H}^{-1} \boldsymbol{F} ]

where, ( \boldsymbol{F}) is the same structure as (\boldsymbol{K}) but instead of a vector of (\boldsymbol{-1}) the are positive (\boldsymbol{1}), (\boldsymbol{H} =\boldsymbol{I}_{D-1} + 1).

The benefits of using the ALN, is that you can estimate the covariance with respect to the composition, the negative is that the transformed random variable (\boldsymbol{Y}) has (D - 1) bins and so relating residuals can be confusing to interpret. This is compared to CLT transformation (\boldsymbol{Z}) which has (D) bins for the transformed random variable, but the covariance (\boldsymbol{\Gamma}) is singular when using the above formula's. The other options is to construct (\boldsymbol{\Gamma}) from estimated parameters, but then it's hard to interpret the variance and correlation parameters, back to the original composition dataset.

Correlation structures

Estimating Correlations structures {-}

We are interested in estimating correlation and covariance structures for variable (X(t)) which is a function of time (t). The aim is to estimate the covariance of over multiple time horizons (cov(X(t),X(s)) denoted as (\boldsymbol{\Sigma}_X). We look at four correlation structures, AR(1), Compound symmetry, unstructured and independence. Also how the parameters are parametrised and estimated. Standard deviation is estimated separately to the correlation structures. A good resource for TMB, specifically for glmmTMB is here. Another good resource regarding different correlation structures can be found here \subsection{Independence} The simplest and only has a single estimable parameter. \begin{equation} cov(X(t),X(s)) = \sigma^2 \end{equation}

AR(1) {-}

This is a first-order autoregressive structure with homogenous variances. \begin{equation} cov(X(t),X(s)) = \sigma^2 \rho^{\lvert t - s\rvert} \end{equation}

so the correlation structure \begin{equation} \boldsymbol{\Sigma}_{\rho} = \begin{pmatrix} 1 & \rho & \rho^2 & \rho^3 \ & 1 & \rho & \rho^2 \ & & 1 & \rho \ & & & 1 \end{pmatrix} \end{equation} where, (\rho) is bound between -1 and 1. To enforce this transformation a transformed parameter (\phi \in \mathbb{R}) is estimated which is unbounded, with the relationship to (\rho) being,

\begin{equation}\label{eq:rho_transform} \rho = \phi / \sqrt(1 + \phi^2); \end{equation}

\begin{figure}[H] \centering \includegraphics[scale=0.6]{../Figures/rho_transform.jpeg} \caption{Example of the relationship of transformed and untransformed parameter from Equation~\ref{eq:rho_transform}} \label{fig:rho_transform} \end{figure}

Compound symmetry {-}

This covariance structure has heterogenous variances and constant correlation between elements, the correlation parameter is estimated as the logistic function so is unbounded defined by (\phi) and mapped back to ([0,1]) space using the inverse logistic function (Equation~\ref{eq:invlogit}), where (\rho) is generated as, if there are (n) maximum unit spacings.

\begin{align} a =& \frac{1}{n - 1}\ \rho =& logit^{-1}\left(\phi\right) \left(1 + a\right) + a \end{align}

\begin{equation}\label{eq:invlogit} logit^{-1}\left(x\right) = \frac{1}{1 + e^{-x}} \end{equation}

so the correlation structure looks like \begin{equation} \boldsymbol{\Sigma}_{\rho} = \begin{pmatrix} 1 & \rho & \rho^2 & \rho^3 \ & 1 & \rho & \rho^2 \ & & 1 & \rho \ & & & 1 \end{pmatrix} \end{equation}

but with the heterogenenous variance the covariance is \begin{equation} \boldsymbol{\Sigma}_{X} = \begin{pmatrix} \sigma^2_1 & \sigma_1\sigma_2\rho & \sigma_1\sigma_3\rho^2 & \sigma_1\sigma_4\rho^3 \ & \sigma^2_2 & \sigma_2\sigma_3\rho & \sigma_2\sigma_4\rho^2 \ & & \sigma^2_3 & \sigma_3\sigma_4\rho \ & & & \sigma^2_4 \end{pmatrix} \end{equation}

Unstructured Correlations {-}

This assumption has, (n(n - 1 )/ 2) parameters, which represent the lower triangle of the cholesky decomposition of the correlation matrix.

\begin{equation}\label{eq:us_corre} \boldsymbol{\Sigma}_{\rho} = D^{-0.5}LL'D^{-0.5} \end{equation}

where, \begin{equation} L = \begin{pmatrix} 1 & & & \ \theta_{1,1} & 1 & & \ \theta_{2,1} & \theta_{2,2} & 1 & \ \theta_{3,1} & \theta_{3,2} & \theta_{3,3} & 1 \end{pmatrix} \end{equation}

with \begin{equation}\label{eq:D_dag} D = diag(LL') \end{equation}

In TMB (\boldsymbol{\theta}) is parametrised as unbounded parameters where for a single parameter of (L) (\theta_i), this works out to confirmed here \begin{equation}\label{eq:us_rho} \rho = \frac{\theta_i}{1 + \theta_i^2} \end{equation}

Toeplitz structure {-}

The next natural step would be to reduce the number of parameters by collecting correlation parameters within the same off-diagonal. This amounts to (n - 1) correlation parameters and (n) variance parameters. The diagonal elements are all approximately equal to the true total variance (2\sigma^2AR+2\sigma_0^2=2), and the off-diagonal elements are approximately equal to the expected value of (0.7/2=0.35.)

Censored likelihoods

This follows from work by @cadigan2015state, who looked at censored likelihoods to account for under/over reported catch. This is in the context of fitting to catch when estimating Fs from the baranov catch equation.

#'
#' Follows Cadigan (2016) "A state-space stock assessment model for northern cod
#' including under-reported and variable nvatural mortality rates" Equation~7. 
#' This function shows the log-likelihood surface between bounds for a range of
#' parameters and catches

posfun = function(x, eps) {
  denom = 2;
  denom = denom - x/eps;
  ans = ifelse(x > eps, x, eps/denom);
  penalty = ifelse(x > eps, 0.0, 0.01 * (x - eps) * (x - eps));
  return(list(new_x = ans, pen = penalty))
}

#' catches<vector> catches to look at the profile over
#' upper<scalar> upper bound for censored likelihood
#' lower<scalar> lower bound for censored likelihood
#' sigma<scalar> standard deviation for censored likelihood
#' delta<scalar> delta the minium to look at 

catch_ll = function(catches, upper, lower, sigma, delta) {
  ll = vector()
  for(i in 1:length(catches)) {
    diff = pnorm(log(upper / catches[i]) / sigma) - 
      pnorm(log(lower / catches[i]) / sigma)
    ad_buffer = posfun(diff, delta)
    ll[i] = log(ad_buffer$new_x) + ad_buffer$pen
  }
  return(ll)
}

## run through
L_y = 3000
U_y = 4000
sigma = 0.01
delta = 0.025
catches = seq(2000,5000, by = 1)
ll = catch_ll(catches, upper = U_y, lower = L_y, sigma, delta)

plot(catches, ll, type = "l", lwd = 3, lty = 1, xlab = "Catches", ylab = "Log-likelihood")


#' Issues with the above "The log-likelihood can be very small even for values of Cy that are slightly
#' outside of the bounds, and this can lead to problems when optimisizing the likelihood."
#' Add a mixture
#'
#'
catch_ll_mixture = function(catches, upper, lower, sigma, prob, sigma_astrix, delta) {
  ll = vector()
  for(i in 1:length(catches)) {
    diff = (prob * pnorm(log(upper / catches[i]) / sigma) + (1 - prob) * pnorm(log(upper / catches[i]) / sigma_astrix)) - 
      (prob * pnorm(log(lower / catches[i]) / sigma) + (1 - prob) * pnorm(log(lower / catches[i]) / sigma_astrix))
    ad_buffer = posfun(diff, delta)
    ll[i] = log(ad_buffer$new_x) + ad_buffer$pen
  }
  return(ll)
}

sigma_astrix = 0.075
prob = 0.9
ll_mix = catch_ll_mixture(catches, upper = U_y, lower = L_y, sigma, prob, sigma_astrix, delta)


par(mfrow= c(2,1), oma = c(3,3,1,1), mar = c(2,1,1,1), cex.lab = 2, cex.axis = 1.5)
plot(catches, ll, type = "l", lwd = 5, lty = 1, xlab = "", ylab = "", ylim = c(-5, 0.5), xlim = c(2250, 4800))
lines(catches, ll_mix, lty = 2, lwd = 5, col = "red")
legend('topright', cex = 1.2, legend = c(expression(paste(sigma[c], " = ", 0.01)), expression(paste(sigma[c], "* = ", 0.075)),
                              expression(paste(L[c], " = ", 3000)), expression(paste(U[c], " = ", 4000)),
                              expression(paste(p, " = ", 0.9)), expression(paste(epsilon, " = ", 0.025))))
#legend("topright", legend = c("Original", "Mixture"), col = c("black","red"), lwd = 5, lty = c(1,2))
## is the top of flat
catch_ndx = catches > (L_y + 200) & catches < (U_y - 300)
polygon(x = c(min(catches[catch_ndx]), min(catches[catch_ndx]), max(catches[catch_ndx]),max(catches[catch_ndx])), y = c(-0.4, 0.4, 0.4, -0.4), lwd = 3, lty = 2)
plot(catches[catch_ndx], ll[catch_ndx], type = "l", lwd = 5, ylim = c(-0.04, 0.01), ylab = "", xlab = "", xlim = c(min(catches[catch_ndx]), max(catches[catch_ndx])))
lines(catches[catch_ndx], ll_mix[catch_ndx], lwd = 5, col = "red", lty = 2)
legend("bottomright", legend = c("Original", "Mixture"), col = c("black","red"), lwd = 5, lty = c(1,2))
mtext(side = 2, outer = T, adj = 0.5, las = 3, text = "Log-likelihood", line = 1.5, cex = 2)
mtext(side = 1, outer = T, adj = 0.5, las = 1, text = "Catches", line = 1.5, cex = 2)


Craig44/stockassessmenthelper documentation built on April 14, 2023, 10:57 a.m.