In Displayr/flipRegression: Estimates standard regression models

Likelihood Theory for Linear Regression

Standard theory (unweighted) with correct model specification

Consider the following linear regression model: [ \boldsymbol{Y} \sim \mathcal N(\boldsymbol{X} \boldsymbol{\beta}_0, \sigma_0^2 \boldsymbol{I}) \quad \text{ or for an individual observation } \quad y \sim \mathcal N(\boldsymbol{x}^T \boldsymbol{\beta}_0, \sigma_0^2) ]

where $\boldsymbol{x} = \left( 1, x_1, x_2, \ldots, x_m\right)^T \in \mathbb{R}^{m + 1}$ is a column vector of covariates, $\boldsymbol{X} = \left( \boldsymbol{x}1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_n\right)^T$ is an $n \times (m + 1)$ matrix of covariates, $\boldsymbol{Y} = \left( y_1, y_2, \ldots, y_n\right)^T \in \mathbb{R}^n$ are the observed outcomes, $\boldsymbol{\beta}_0 = \left( \beta{0, 0}, \beta_{1, 0}, \beta_{2, 0}, \ldots, \beta_{m, 0}\right)\in \mathbb{R}^{m + 1}$ is a vector of regression coefficients, and $\sigma_0^2 \in \left{\mathbb{R}_+\setminus 0\right}$ is the variance of the error term with $\boldsymbol{I}$ being the $p \times p$ identity matrix.

The goal is to estimate the true set of parameters $\boldsymbol{\theta}_0 = \left(\boldsymbol{\beta}_0, \sigma_0^2 \right)$ to gain insight into the population.

Since the true parameters are unknown, the parametric family of models can be characterised in a parametric likelihood framework and the parameter vector estimated via maximisation (Maximum Likelihood Estimation or MLE). This is done as follows:

The parameter space of all candidate models in the family is defined with, [ \boldsymbol{\theta} = \left( \boldsymbol{\beta}^T, \sigma^2\right) \in \mathbb{R}^{m+1} \cup \left{\mathbb{R}+\setminus 0\right} ] with the associated parametric likelihood function given by, [ f{\boldsymbol{\theta}}(\boldsymbol{y}|\boldsymbol{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(\boldsymbol{y} - \boldsymbol{x}^T \boldsymbol{\beta})^2}{2 \sigma^2}\right). ]
Each observation has log-likelihood given by, [ \ell_i(\boldsymbol{\theta}) = \log f_{\boldsymbol{\theta}}(y_i|\boldsymbol{x}_i) = - \frac{1}{2} \log(2 \pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \boldsymbol{x}_i^T \boldsymbol{\beta})^2}{2 \sigma^2}. ]

Then the overall log-likelihood function is given by,

[ \begin{aligned} \mathcal L(\boldsymbol{\theta}) &= \sum_{i = 1}^n \ell_i(\boldsymbol{\theta}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \log(\sigma^2) - \frac{(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta})}{2\sigma^2} &=: l^{[1]} + l^{[2]} + l^{[3]} \end{aligned} ]

For convenience, define the unexplained variation as $\boldsymbol{\varepsilon}_0 = \boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}_0 \sim \mathcal N(\boldsymbol{0}, \sigma_0^2 \boldsymbol{I})$. The candidate unexplained variation takes the form, $\boldsymbol{\varepsilon} = \boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta} \sim \mathcal N(\boldsymbol{X}(\boldsymbol{\beta}_0 - \boldsymbol{\beta}), \sigma_0^2 \boldsymbol{I})$

Maximising the log-likelihood function with respect to $\boldsymbol{\beta}$ and $\sigma^2$ yields the maximum likelihood estimates of the regression coefficients and the variance of the error term, respectively. These are obtained at the score vector $\mathcal{U} = \frac{\partial \mathcal L}{\partial \boldsymbol{\theta}}$ with Hessian matrix $\mathcal H(\theta) = \frac{\partial^2 \mathcal L}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}$ [ \begin{aligned} \mathcal{U}(\boldsymbol{\theta}) &= \begin{pmatrix} \boldsymbol{X}^T \boldsymbol{\varepsilon}/ \sigma^2\ - \frac{1}{2\sigma^2} (n - \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}/(\sigma^2))\end{pmatrix} \ \mathcal{H}(\boldsymbol{\theta}) &= \begin{pmatrix} - \frac{\boldsymbol{X}^T\boldsymbol{X}}{\sigma^2} & -\frac{\boldsymbol{X}^T\boldsymbol{\varepsilon}}{(\sigma^2)^2} \ -\frac{\boldsymbol{\varepsilon}^T\boldsymbol{X}}{(\sigma^2)^2} & \frac{n}{2(\sigma^2)^2} - \frac{\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}}{(\sigma^2)^3} \end{pmatrix}. \end{aligned} ]

Proof: Looking at partial derivatives.

[ \begin{aligned} \frac{\partial \mathcal L}{\partial \boldsymbol{\beta}} &= \frac{\partial l^{[1]}}{\partial \boldsymbol{\beta}} + \frac{\partial l^{[2]}}{\partial \boldsymbol{\beta}} + \frac{\partial l^{[3]}}{\partial \boldsymbol{\beta}}; \quad \frac{\partial{l^{[1]}}}{\partial \boldsymbol{\beta}} = \frac{\partial{l^{[2]}}}{\partial \boldsymbol{\beta}} = \boldsymbol{0}\ l^{[3]} &=- (\boldsymbol{Y}^T\boldsymbol{Y} - \boldsymbol{Y}^T\boldsymbol{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T \boldsymbol{X}^T \boldsymbol{Y} + \boldsymbol{\beta} \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{\beta})/(2 \sigma^2)\ &\Downarrow\ \frac{\partial l^{[3]}}{\partial \boldsymbol{\beta}} &= \boldsymbol{X}^T \boldsymbol{\varepsilon}/ \sigma^2. \end{aligned} ]

For the $\sigma^2$ component,

[ \begin{aligned} \frac{\partial \mathcal L}{\partial \sigma^2} &= \frac{\partial l^{[1]}}{\partial \sigma^2} + \frac{\partial l^{[2]}}{\partial \sigma^2} + \frac{\partial l^{[3]}}{\partial \sigma^2}; \quad \frac{\partial{l^{[1]}}}{\partial \sigma^2} = 0\ &\Downarrow\ \frac{\partial l^{[2]}}{\partial \sigma^2} &= - \frac{n}{2\sigma^2}; \quad \frac{\partial l^{[3]}}{\partial \sigma^2} = - \frac{\partial}{\partial \sigma^2} \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} / (2 \sigma^2) = \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} / (2 (\sigma^2)^2). \end{aligned} ]

Note that the score vector will be zero at the MLE giving, [ \begin{aligned} \mathcal U(\widehat{\boldsymbol{\theta}}) = \boldsymbol{0} \Rightarrow \begin{cases} \boldsymbol{X}^T (\boldsymbol{Y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}})/ \widehat{\sigma}^2 \Rightarrow \widehat{\boldsymbol{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y};\ - \frac{1}{2\widehat{\sigma}^2} (n - (\boldsymbol{Y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}})^T(\boldsymbol{Y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}})/(\widehat{\sigma}^2)) = 0 \Rightarrow \widehat{\sigma}^2 = (\boldsymbol{Y} - \widehat{\boldsymbol{Y}})^T(\boldsymbol{Y} - \widehat{\boldsymbol{Y}})/n = \widehat{\boldsymbol{\varepsilon}}^T\widehat{\boldsymbol{\varepsilon}}/n. \end{cases} \end{aligned} ]

The Hessian matrix is thus computed as, [ \begin{aligned} \mathcal{H}(\boldsymbol{\theta}) &= \begin{pmatrix} \frac{\partial^2 \mathcal L}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} & \frac{\partial^2 \mathcal L}{\partial \boldsymbol{\beta} \partial \sigma^2} \ \frac{\partial^2 \mathcal L}{\partial \sigma^2 \partial \boldsymbol{\beta}^T} & \frac{\partial^2 \mathcal L}{\partial \sigma^2 \partial \sigma^2} \end{pmatrix}\ \frac{\partial^2\mathcal L}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^T} &= \frac{\partial}{\partial\boldsymbol{\beta}^T} \boldsymbol{X}^T (\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}) / \sigma^2 = - \frac{\boldsymbol{X}^T\boldsymbol{X}}{\sigma^2}\ \frac{\partial^2\mathcal L}{\partial\boldsymbol{\beta}\partial\sigma^2} &= \frac{\partial}{\partial\sigma^2} \boldsymbol{X}^T \boldsymbol{\varepsilon}^T / \sigma^2 = - \frac{\boldsymbol{X}^T\boldsymbol{\varepsilon}}{(\sigma^2)^2} \Rightarrow \frac{\partial^2\mathcal L}{\partial\sigma^2\partial\boldsymbol{\beta}^T} = - \frac{\boldsymbol{\varepsilon}^T\boldsymbol{X}}{(\sigma^2)^2}\ \frac{\partial^2\mathcal L}{\partial\sigma^2\partial\sigma^2} &= \frac{\partial}{\partial\sigma^2} \left(-\frac{n}{2\sigma^2} + \frac{\boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}}{2(\sigma^2)^2}\right) = \frac{n}{2(\sigma^2)^2} - \frac{\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}}{(\sigma^2)^3} \end{aligned} ] The theory of $d$AIC given in [@dAIC] considers an extension of the above unweighted likelihood and uses the weighted likelihood function in the construction.

Weighted likelihood for multiple regression under sampling design

In the complex survey design approach, the observations $\left{ (y_i, \boldsymbol{x}_i): i \in s\right}$ are assumed to be derived from a sample $s$ of $n$ units drawn from a finite population of size $N$ using some probability sampling design. The weights associated with the $i^{\mathrm{th}}$ unit from the population are defined to be $w_i$.

The data from the finite population are assumed to be generated from some distribution with joint density, $g(y, \boldsymbol{x})$. The desired modelling approach for the data is assumed to be a parametric model, $\left{ f_{\boldsymbol{\theta}}(y|\boldsymbol{x}), \boldsymbol{\theta} \in \boldsymbol{\Theta}\right}$. Generally speaking the parametric family induced by $f_{\boldsymbol{\theta}}(y|x)$ does not need to contain the true distribution in $g(y, x)$ or $g(y|x)$. The parametric model would be considered misspecified in this case. The likelihood approach can nevertheless still be used and the objective function being maximised is the quantity, [ \ell(\boldsymbol{\theta}) = \mathbb{E}g \left( \log f{\boldsymbol{\theta}}(y|\boldsymbol{x})\right). ]

This is motivated by the using the Kullback-Leibler divergence of the candidate parametric model away from the true model, [ \mathrm{KL}(f_{\boldsymbol{\theta}}, g) = \mathbb{E}_g \left( \log g(y| \boldsymbol{x})\right) - \ell(\boldsymbol{\theta}). ]

The target parameter is $\boldsymbol{\theta}^*$ is the parametric value that is the 'least false' which is the value that minimizes the Kullback-Leibler divergence above (or equivalently, maximizer of $\ell(\boldsymbol{\theta}))$. In terms of the classical likelihood equations, this occurs at the new score equation,

[ \mathcal U(\boldsymbol{\theta}^) = \mathbb{E}g \left( \left.\frac{\partial \log f{\boldsymbol{\theta}}(y|\boldsymbol{x})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \boldsymbol{\theta}^}\right) = 0. ]

The likelihood equation, $\ell(\boldsymbol{\theta})$, being a population mean parameter is estimated using the weighted estimator, [ \widehat{\ell}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i \in s} w_i \ell_i(\boldsymbol{\theta}) ] where $\ell_i(\boldsymbol{\theta}) = \log f_{\boldsymbol{\theta}}(y_i|\boldsymbol{x})$ and the weights are assumed to aggregate to the population total, $\sum_{i \in s}w_i = N$. To estimate the maximizer of $\ell(\boldsymbol{\theta})$ In the context of sampling design, use the estimator, [ \widehat{\mathcal U}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i \in s} w_i \frac{\partial}{\partial \boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) ] where the contribution to the log-likelihood by unit $i$ is given by $\ell_i(\boldsymbol{\theta}) = \log f_{\boldsymbol{\theta}}(y_i |\boldsymbol{x}_i)$.

Then $\boldsymbol{\theta}^*$ is consistently estimated by $\widehat{\boldsymbol{\theta}}$ that satisfies the weighted pseudo-score equations, $\widehat{\mathcal U}(\widehat{\boldsymbol{\theta}}) = 0$. This is shown later using results of [@Fuller-2009].

Before computing the required terms in this analysis, the link between $\widehat{\boldsymbol{\theta}}$ and the AIC strategy of model assessment is covered. The model that matches the population likelihood the best will minimize the Kullback-Leibler divergence above. The AIC strategy is to estimate this difference by estimating $Q_n = \mathbb{E}_g( \ell(\widehat{\boldsymbol{\theta}}))$ and selecting the model with the largest value.

Theorem 1 of [@dAIC] showed the statistical properties of the naive estimator, $\widehat{\ell}(\widehat{\boldsymbol{\theta}})$, with, [ \mathbb{E}_g(\widehat{\ell}(\widehat{\boldsymbol{\theta}})) = Q_n + \frac{1}{n} \mathrm{tr} \left{ \boldsymbol{\Delta} \right} + o_p(n^{-1}) ] where $\boldsymbol{\Delta} = \mathcal I(\boldsymbol{\theta}^)\boldsymbol{V}(\boldsymbol{\theta}^)$ and $\mathcal I(\boldsymbol{\theta}) = \mathbb{E}\widehat{\mathcal J}(\boldsymbol{\theta})$ and $\boldsymbol{V}(\boldsymbol{\theta}^*)$ being the asymptotic covariance matrix of $\sqrt{n}\widehat{\boldsymbol{\theta}}$.

In particular, their proof utilizes the regularity conditions and results of Theorem 1.3.9 [@Fuller-2009] under the typical framework in survey estimation where it is supposed that there are a sequence of finite populations indexed by $\nu$ and a sequence of samples of size $n_\nu$ drawn from the $N_\nu$ units in the $\nu^{\mathrm th}$ population with a well-defined probability scheme. Then asymptotic normality and consistency is established with

$\widehat{\boldsymbol{\theta}}\nu$ is asymptotically normal with $\sqrt{n} (\widehat{\boldsymbol{\theta}}{\nu}- \boldsymbol{\theta}^) \stackrel{d}{\longrightarrow} \mathcal N(\boldsymbol{0}, \boldsymbol{V}(\boldsymbol{\theta}^))$.
If a sequence $\left{\boldsymbol{\theta}\nu \right}$ are consistent estimators of $\boldsymbol{\theta}^$, that is, $\boldsymbol{\theta}_\nu \stackrel{p}{\longrightarrow} \boldsymbol{\theta}^$, then $\widehat{\mathcal J}\nu(\boldsymbol{\theta}_\nu) \stackrel{p}{\longrightarrow} \mathcal I(\boldsymbol{\theta})$.

Consider now the precise form of the likelihood equations in this weighted sampling design context. For convenience define $\boldsymbol{W} = \mathrm{diag}(w_i)_{i \in s}$ as the diagonal matrix of weights.

Leveraging the earlier results from the unweighted likelihood, the log-likelihood takes the form, [ \begin{aligned} \widehat{\ell}(\boldsymbol{\theta}) &= - \frac{1}{2} \log(2 \pi) - \frac{1}{2} \log \sigma^2 - \frac{1}{N}\frac{(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta})^T\boldsymbol{W}(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta})}{2\sigma^2}\ &= - \frac{1}{2} \log(2 \pi) - \frac{1}{2} \log \sigma^2 - \frac{1}{N}\frac{\boldsymbol{\varepsilon}^T\boldsymbol{W}\boldsymbol{\varepsilon}}{2\sigma^2}\ &= \widehat{\ell}1 + \widehat{\ell}_2 + \widehat{\ell}_3. \end{aligned} ] Maximising the log likelihood function in a similar manner to the unweighted case gives the weighted score vector $\widehat{\mathcal{U}} = \frac{\partial \widehat{\ell}}{\partial \boldsymbol{\theta}} = \frac{1}{N} \sum{i \in s} w_i \frac{\partial l_i(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}$, and corresponding negative Hessian $\widehat{\mathcal J}(\boldsymbol{\theta}) = - \frac{\partial^2 \widehat{\ell}}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = - \frac{1}{N} \sum_{i \in s} w_i \frac{\partial^2\ell_i(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}\partial\boldsymbol{\theta}^T}$. Evaluating these leads to the equations, [ \widehat{\mathcal{U}}(\boldsymbol{\theta}) = \begin{pmatrix} \boldsymbol{X}^T\boldsymbol{W} \boldsymbol{\varepsilon}/ (N\sigma^2)\ - \frac{1}{2N\sigma^2} (N - \boldsymbol{\varepsilon}^T\boldsymbol{W}\boldsymbol{\varepsilon}/(\sigma^2))\end{pmatrix} = \frac{1}{N} \sum_{i \in s} w_i \begin{pmatrix} \boldsymbol{x}i^T \frac{(y_i - \boldsymbol{x}_i^T\boldsymbol{\beta})}{2\sigma^2}\ - \frac{1}{2\sigma^2} + \frac{(y_i - \boldsymbol{x}_i^T \boldsymbol{\beta})^2}{2 (\sigma^2)^2}\end{pmatrix} =: \begin{pmatrix} U{\boldsymbol{\beta}}\ U_{\sigma^2}\end{pmatrix}. ] [ \widehat{\mathcal J}(\boldsymbol{\theta}) = \begin{pmatrix} \frac{\boldsymbol{X}^T \boldsymbol{W} \boldsymbol{X}}{N\sigma^2} & \frac{\boldsymbol{X}^T \boldsymbol{W} \boldsymbol{\varepsilon}}{N(\sigma^2)^2} \ \frac{\boldsymbol{\varepsilon}^T \boldsymbol{W} \boldsymbol{X}}{N(\sigma^2)^2} & -\frac{1}{2(\sigma^2)^2} + \frac{\boldsymbol{\varepsilon}^T \boldsymbol{W}\boldsymbol{\varepsilon}}{N(\sigma^2)^3} \end{pmatrix}, \qquad \mathcal J_N(\boldsymbol{\theta}) = \begin{pmatrix} \frac{\boldsymbol{X}^T \boldsymbol{X}}{N\sigma^2} & \frac{\boldsymbol{X}^T \boldsymbol{\varepsilon}}{N(\sigma^2)^2} \ \frac{\boldsymbol{\varepsilon}^T \boldsymbol{X}}{N(\sigma^2)^2} & -\frac{1}{2(\sigma^2)^2} + \frac{\boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}}{N(\sigma^2)^3} \end{pmatrix}; ] where $\mathcal J_N(\boldsymbol{\theta})$ is the negative Hessian without the weights applied which is used in later results.

Solving the pseudo-score equation above yields the pseudo maximum likelihood parameter estimates, [ \widehat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{Y}, \qquad \widehat{\sigma}^2 = \frac{1}{N} \boldsymbol{\varepsilon}^T\boldsymbol{W}\boldsymbol{\varepsilon}. ] From (1.3.78) in [@Fuller-2009] $\mathcal I(\boldsymbol{\theta}) = \mathbb{E}\widehat{\mathcal J}(\boldsymbol{\theta})$ can be estimated by $\mathcal J(\boldsymbol{\theta})$, [ \widehat{\mathcal J}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{i \in s} w_i \frac{\partial \ell_i(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\frac{1}{N}\sum_{i \in s} \frac{\partial \ell_i(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} + O_p(n^{-1/2}) =: \mathcal J_N(\boldsymbol{\theta}) + O_p(n^{-1/2}) \stackrel{a.s.}{\longrightarrow} \mathcal J(\boldsymbol{\theta}) ] and $\mathcal J(\boldsymbol{\theta})$ is non-singular.

$d$AIC

Following the theory of [@dAIC], the $d$AIC is constructed using the weighted likelihood and an adjustment from the "design effect matrix" as in [@Rao-Scott-1984] as follows,

[ d\mathrm{AIC} = - 2n\widehat{\ell}(\widehat{\boldsymbol{\theta}}) + 2p \widehat{\overline{\delta}} ] where $\widehat{\ell}$ is the weighted likelihood and $\widehat{\boldsymbol{\theta}}$ the weighted MLE. The number of parameters $p = m + 2$ (includes the $m$ slope parameters, the intercept, $\beta_0$, and variance parameter, $\sigma^2$). The design effect adjustment value $\widehat{\overline{\delta}}$ is explained in the next section.

Estimating the design effect matrix, $\boldsymbol{\boldsymbol{\Delta}}$

Recall the design effect matrix term $\boldsymbol{\boldsymbol{\Delta}} = \mathcal I(\boldsymbol{\theta}^) \boldsymbol{V}(\boldsymbol{\theta}^)$ where $\boldsymbol{V}(\boldsymbol{\theta}^*)$ denotes the asymptotic covariance matrix of $\sqrt{n}\widehat{\boldsymbol{\theta}}$. The estimate of $\mathcal I(\boldsymbol{\theta})$ is considered first. [ \mathcal I(\boldsymbol{\theta}) = \mathbb{E}\widehat{\boldsymbol{\mathcal J}}(\boldsymbol{\theta}) \Rightarrow \widehat{\mathcal I}(\widehat{\boldsymbol{\theta}}) = \begin{pmatrix} \frac{\boldsymbol{X}^T\boldsymbol{X}}{N\widehat{\sigma}^2} & \boldsymbol{0} \ \boldsymbol{0}^T & \frac{1}{2(\widehat{\sigma}^2)^2}\end{pmatrix} =: \begin{pmatrix} \mathcal I_{\boldsymbol{\beta}} & \boldsymbol{0}\ \boldsymbol{0}^T & \mathcal I_{\sigma^2} \end{pmatrix}, ] where $\boldsymbol{0}$ is a column vector of zeros of dimension $m +1$. Note that the regression parameters, $\boldsymbol{\beta}$, and the variance nuisance parameter, $\sigma^2$, are orthogonal. That is, in the sense that the Information matrix is block diagonal.

The design effect adjustment is computed using $\boldsymbol{\boldsymbol{\Delta}}$ with, [ \overline{\delta} = \mathrm{trace}(\boldsymbol{\boldsymbol{\Delta}}) / p. ] The blockwise diagonal form of $\mathcal I(\boldsymbol{\theta})$ allows the trace to be computed separately for the regression parameters and the variance nuisance parameter. Indeed, due to the form of the negative Hessian $\widehat{\mathcal J}(\boldsymbol{\theta})$, it follows that $\boldsymbol{V}(\boldsymbol{\theta})$ will be blockwise diagonal with, [ \boldsymbol{V}(\boldsymbol{\theta}) = \begin{pmatrix} V_{\boldsymbol{\beta}} & \boldsymbol{0} \ \boldsymbol{0}^T & V_{\sigma^2} \end{pmatrix} \Rightarrow \mathrm{trace}(\boldsymbol{\boldsymbol{\Delta}}) = \mathrm{trace}(\mathcal I_{\boldsymbol{\beta}}\boldsymbol{V}{\boldsymbol{\beta}}) + \mathrm{trace}(\mathcal I{\boldsymbol{\sigma^2}}V_{\sigma^2}) = \mathrm{trace}(\mathcal I_{\boldsymbol{\beta}}\boldsymbol{V}{\boldsymbol{\beta}}) +\mathcal I{\boldsymbol{\sigma^2}}V_{\sigma^2} ] The trace calculation is split into the two components. One for the regression parameters and one for the variance nuisance parameter. The trace of the regression parameters is estimated routinely using the standard variance covariance matrices of the regression parameters in $\mathcal I_{\boldsymbol{\beta}}$ and $V_{\boldsymbol{\beta}}$. That is, define $\boldsymbol{V}{\boldsymbol{\beta}, 0}$ to be the variance covariance matrix of $\boldsymbol{\beta}$ under simple random design and the variance covariance matrix under sampling weights to be $\boldsymbol{V}{\boldsymbol{\beta}}$. Then simply $\mathcal I_{\boldsymbol{\beta}}\boldsymbol{V}{\boldsymbol{\beta}} = \boldsymbol{V}{\boldsymbol{\beta}, 0}^{-1}\boldsymbol{V}(\boldsymbol{\beta})$

The $\sigma^2$ component of the calculation involves estimating the scalars $\mathcal I_{\sigma^2}$ and $V_{\sigma^2}$. The $\mathcal I_{\sigma^2}$ component can be estimated directly with $\widehat{\mathcal{I}}{\sigma^2} = n /(2(\widehat{\sigma}^2)^2)$. To estimate $V{\sigma^2}$, it can be done by estimating the $\sigma^2$ element in the Information matrix under the sampling weights and then taking the reciporical. To estimate the $\sigma^2$ element in the Information matrix, use the standard estimation equation as the variance of the score equation. Since the score equation is zero at the MLE it only requires computing the second moment of the score vector. The contribution of each observation to the $\sigma^2$ element in the score equation is defined, [ U_{\sigma^2, i}(\boldsymbol{\theta}) := \frac{\partial l_i(\boldsymbol{\theta})}{\partial \sigma^2} = - \frac{1}{2\sigma^2} + \frac{(y_i - \boldsymbol{x}i^T \boldsymbol{\beta})^2}{2(\sigma^2)^2}. ] Therefore an estimate of the second moment and consequently variance of this component since the score vector is zero at the MLE is given by, [ \widehat{\mathbb{V}\text{ar}}U{\sigma^2} = \frac{1}{N}\sum_{i \in s} w_i U_{\sigma^2, i}^2(\widehat{\boldsymbol{\theta}}) = \frac{1}{N}\sum_{i \in s} w_i \left( - \frac{1}{2\widehat{\sigma}^2} + \frac{(y_i - \boldsymbol{x}i^T \widehat{\boldsymbol{\beta}})^2}{2(\widehat{\sigma}^2)^2}\right)^2 ] Thus the estimate of the covariance element $V{\sigma^2}$ can be estimated with, [ \widehat V_{\sigma^2} = \left(\widehat{\mathbb{V}\text{ar}} U_{\sigma^2}\right)^{-1}. ] Finally $\widehat{\overline{\delta}} = \widehat{\delta}{\boldsymbol{\beta}} + \widehat{\delta}{\sigma^2} = \mathrm{trace}(\widehat{\boldsymbol{V}}{\boldsymbol{\beta}, 0}^{-1} \widehat{\boldsymbol{V}}{\boldsymbol{\beta}}) + \widehat{\mathcal{I}}{\sigma^2}\widehat{V}{\sigma^2}$.

Implementation

This result is implemented using the survey package [@survey] to compute the model but $d$AIC computed manually.

Note in the implementation in [@survey] the rescaled weights $w_i^$ are used so that $\sum_{i \in s} w_i^ = N^ = n$. That is, $w_i^ = \frac{w_i}{\overline w}$ where $\overline w = \frac{1}{n}\sum_{i \in s} w_i$ which has the implication that $N = N^* = n$.