svrep: Tools for Creating, Updating, and Analyzing Survey Replicate Weights

Documented in as_gen_boot_design make_gen_boot_factors

#' @title Factors for the Generalized Survey Bootstrap
#' @description Creates replicate factors for the generalized survey bootstrap method.
#' The generalized survey bootstrap is a method for forming bootstrap replicate weights
#' from a textbook variance estimator, provided that the variance estimator
#' can be represented as a quadratic form whose matrix is positive semidefinite
#' (this covers a large class of variance estimators).
#' @param Sigma The matrix of the quadratic form used to represent the variance estimator.
#' Must be positive semidefinite.
#' @param num_replicates The number of bootstrap replicates to create.
#' @param tau Either \code{"auto"}, or a single number; the default value is 1. 
#' This is the rescaling constant
#' used to avoid negative weights through the transformation \eqn{\frac{w + \tau - 1}{\tau}},
#' where \eqn{w} is the original weight and \eqn{\tau} is the rescaling constant \code{tau}. \cr
#' If \code{tau="auto"}, the rescaling factor is determined automatically as follows:
#' if all of the adjustment factors are nonnegative, then \code{tau} is set equal to 1;
#' otherwise, \code{tau} is set to the smallest value needed to rescale
#' the adjustment factors such that they are all at least \code{0.01}.
#' Instead of using \code{tau="auto"}, the user can instead use the function
#' \code{rescale_replicates()} to rescale the replicates later.
#' @param exact_vcov If \code{exact_vcov=TRUE}, the replicate factors will be generated
#' such that their variance-covariance matrix exactly matches the target variance estimator's
#' quadratic form (within numeric precision).
#' This is desirable as it causes variance estimates for totals to closely match
#' the values from the target variance estimator.
#' This requires that \code{num_replicates} exceeds the rank of \code{Sigma}.
#' The replicate factors are generated by applying PCA-whitening to a collection of draws
#' from a multivariate Normal distribution, then applying a coloring transformation
#' to the whitened collection of draws.
#'
#' @section Statistical Details:
#' Let \eqn{v( \hat{T_y})} be the textbook variance estimator for an estimated population total \eqn{\hat{T}_y} of some variable \eqn{y}.
#' The base weight for case \eqn{i} in our sample is \eqn{w_i}, and we let \eqn{\breve{y}_i} denote the weighted value \eqn{w_iy_i}.
#' Suppose we can represent our textbook variance estimator as a quadratic form: \eqn{v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T},
#' for some \eqn{n \times n} matrix \eqn{\Sigma}.
#' The only constraint on \eqn{\Sigma} is that, for our sample, it must be symmetric and positive semidefinite.
#'
#' The bootstrapping process creates \eqn{B} sets of replicate weights, where the \eqn{b}-th set of replicate weights is a vector of length \eqn{n} denoted \eqn{\mathbf{a}^{(b)}}, whose \eqn{k}-th value is denoted \eqn{a_k^{(b)}}.
#' This yields \eqn{B} replicate estimates of the population total, \eqn{\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k}, for \eqn{b=1, \ldots B}, which can be used to estimate sampling variance.
#'
#' \deqn{
#'   v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B}
#' }
#'
#' This bootstrap variance estimator can be written as a quadratic form:
#'
#'   \deqn{
#'     v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}
#'   }
#'   where
#'   \deqn{
#'     \boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B}
#'   }
#'
#' Note that if the vector of adjustment factors \eqn{\mathbf{a}^{(b)}} has expectation \eqn{\mathbf{1}_n} and variance-covariance matrix \eqn{\boldsymbol{\Sigma}},
#' then we have the bootstrap expectation \eqn{E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}}. Since the bootstrap process takes the sample values \eqn{\breve{y}} as fixed, the bootstrap expectation of the variance estimator is \eqn{E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}}.
#' Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating \eqn{\mathbf{a}^{(b)}} from a distribution with the following two conditions:
#' \cr \cr
#'     \strong{Condition 1}: \eqn{\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n}
#' \cr
#'     \strong{Condition 2}: \eqn{\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}}
#' \cr \cr
#' While there are multiple ways to generate adjustment factors satisfying these conditions,
#' the simplest general method is to simulate from a multivariate normal distribution: \eqn{\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})}.
#' This is the method used by this function.
#'
#' @section Details on Rescaling to Avoid Negative Adjustment Factors:
#' Let \eqn{\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]} denote the \eqn{(n \times B)} matrix of bootstrap adjustment factors.
#' To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors \eqn{\mathbf{A}^S} by rescaling each adjustment factor \eqn{a_k^{(b)}} as follows:
#' \deqn{
#'    a_k^{S,(b)} = \frac{a_k^{(b)} + \tau - 1}{\tau}
#'  }
#' where \eqn{\tau \geq 1 - a_k^{(b)} \geq 1} for all \eqn{k} in \eqn{\left\{ 1,\ldots,n \right\}} and all \eqn{b} in \eqn{\left\{1, \ldots, B\right\}}.
#'
#' The value of \eqn{\tau} can be set based on the realized adjustment factor matrix \eqn{\mathbf{A}} or by choosing \eqn{\tau} prior to generating the adjustment factor matrix \eqn{\mathbf{A}} so that \eqn{\tau} is likely to be large enough to prevent negative bootstrap weights.
#'
#' If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes \eqn{\frac{\tau^2}{B}} instead of \eqn{\frac{1}{B}}.
#' \deqn{
#'  \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) = \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2
#'  }
#' \deqn{
#'  \textbf{After rescaling: } v_B\left(\hat{T}_y\right) = \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2
#' }
#' When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor \eqn{\frac{\tau^2}{B}} rather than \eqn{\frac{1}{B}} when estimating sampling variances.
#'
#' @references
#' The generalized survey bootstrap was first proposed by Bertail and Combris (1997).
#' See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap.
#' The generalized survey bootstrap represents one strategy for forming replication variance estimators
#' in the general framework proposed by Fay (1984) and Dippo, Fay, and Morganstein (1984).
#' \cr \cr
#' - Beaumont, Jean-François, and Zdenek Patak. 2012. "On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling: Generalized Bootstrap for Sample Surveys." International Statistical Review 80 (1): 127-48. https://doi.org/10.1111/j.1751-5823.2011.00166.x.
#' \cr \cr
#' - Bertail, and Combris. 1997. "Bootstrap Généralisé d'un Sondage." Annales d'Économie Et de Statistique, no. 46: 49. https://doi.org/10.2307/20076068.
#' \cr \cr
#' - Dippo, Cathryn, Robert Fay, and David Morganstein. 1984. "Computing Variances from Complex Samples with Replicate Weights." In, 489-94. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_094.pdf.
#' \cr \cr
#' - Fay, Robert. 1984. "Some Properties of Estimates of Variance Based on Replication Methods." In, 495-500. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_095.pdf.
#' \cr
#'
#' @seealso The function \code{\link[svrep]{make_quad_form_matrix}} can be used to
#' represent several common variance estimators as a quadratic form's matrix,
#' which can then be used as an input to \code{make_gen_boot_factors()}.
#'
#' @return A matrix with the same number of rows as \code{Sigma}, and the number of columns
#' equal to \code{num_replicates}. The object has an attribute named \code{tau} which can be retrieved
#' by calling \code{attr(which = 'tau')} on the object. The value \code{tau} is a rescaling factor
#' which was used to avoid negative weights.
#'
#' In addition, the object has attributes named \code{scale} and \code{rscales} which can be
#' passed directly to \link[survey]{svrepdesign}. Note that the value of \code{scale} is \eqn{\tau^2/B},
#' while the value of \code{rscales} is vector of length \eqn{B}, with every entry equal to \eqn{1}.
#'
#' @export
#'
#' @examples
#' \donttest{
#'   library(survey)
#'
#' # Load an example dataset that uses unequal probability sampling ----
#'   data('election', package = 'survey')
#'
#' # Create matrix to represent the Horvitz-Thompson estimator as a quadratic form ----
#'   n <- nrow(election_pps)
#'   pi <- election_jointprob
#'   horvitz_thompson_matrix <- matrix(nrow = n, ncol = n)
#'   for (i in seq_len(n)) {
#'     for (j in seq_len(n)) {
#'       horvitz_thompson_matrix[i,j] <- 1 - (pi[i,i] * pi[j,j])/pi[i,j]
#'     }
#'   }
#'
#'   ## Equivalently:
#'
#'   horvitz_thompson_matrix <- make_quad_form_matrix(
#'     variance_estimator = "Horvitz-Thompson",
#'     joint_probs = election_jointprob
#'   )
#'
#' # Make generalized bootstrap adjustment factors ----
#'
#'   bootstrap_adjustment_factors <- make_gen_boot_factors(
#'     Sigma = horvitz_thompson_matrix,
#'     num_replicates = 80,
#'     tau = 'auto'
#'   )
#'
#' # Determine replication scale factor for variance estimation ----
#'   tau <- attr(bootstrap_adjustment_factors, 'tau')
#'   B <- ncol(bootstrap_adjustment_factors)
#'   replication_scaling_constant <- tau^2 / B
#'
#' # Create a replicate design object ----
#'   election_pps_bootstrap_design <- svrepdesign(
#'     data = election_pps,
#'     weights = 1 / diag(election_jointprob),
#'     repweights = bootstrap_adjustment_factors,
#'     combined.weights = FALSE,
#'     type = "other",
#'     scale = attr(bootstrap_adjustment_factors, 'scale'),
#'     rscales = attr(bootstrap_adjustment_factors, 'rscales')
#'   )
#'
#' # Compare estimates to Horvitz-Thompson estimator ----
#'
#'   election_pps_ht_design <- svydesign(
#'     id = ~1,
#'     fpc = ~p,
#'     data = election_pps,
#'     pps = ppsmat(election_jointprob),
#'     variance = "HT"
#'   )
#'
#' svytotal(x = ~ Bush + Kerry,
#'          design = election_pps_bootstrap_design)
#' svytotal(x = ~ Bush + Kerry,
#'          design = election_pps_ht_design)
#' }
make_gen_boot_factors <- function(Sigma, num_replicates, tau = 1, exact_vcov = FALSE) {

  n <- nrow(Sigma)

  # Generate replicate factors by simulating from a multivariate normal distribution
  if (!exact_vcov) {
    replicate_factors <- t(
      mvtnorm::rmvnorm(n = num_replicates,
                       mean = rep(1, times = n),
                       sigma = Sigma,
                       checkSymmetry = FALSE,
                       method = "eigen")
    )
  }
  if (exact_vcov) {
    qr_rank <- qr(Sigma)$rank
    if (num_replicates <= qr_rank) {
      err_msg <- paste(
        "`exact_vcov=TRUE` only works if `num_replicates` exceeds the rank of `Sigma`,",
        sprintf("which is %s", qr_rank)
      )
      stop(err_msg)
    }
    # Draw from standard multivariate normal
    draws <- mvtnorm::rmvnorm(n = num_replicates,
                              mean = rep(0, times = n),
                              sigma = diag(n),
                              checkSymmetry = FALSE,
                              method = "eigen")
    # Whiten the draws
    draws <- scale(draws, center = TRUE, scale = FALSE)
    draws <- draws %*% svd(draws, nu = 0, nv = n)$v
    draws <- scale(draws, FALSE, TRUE)
    # Color the draws
    replicate_factors <- with(svd(Sigma, nu = 0, nv = n), {
      tcrossprod(sweep(v, 2, sqrt(pmax(d, 0)), "*"), draws)
    })
    # Adjust the scale of the replicates,
    # since the generalized bootstrap uses the scale (1/B),
    # but the coloring process gives the exact result only for scale (1/(B-1))
    scale_adjustment <- sqrt((num_replicates)/(num_replicates-1))
    replicate_factors <- replicate_factors * scale_adjustment
    # Make sure that draws have the exact specified mean
    replicate_factors <- rep(1, times = n) + replicate_factors
  }

  attr(replicate_factors, 'scale') <- 1/num_replicates

  # (Potentially) rescale to avoid negative weights
  if (missing(tau)) {
    replicate_factors <- replicate_factors
    attr(replicate_factors, 'tau') <- 1 
  } else {
    if (tau == "auto") {
      tau <- NULL
      new_scale <- NULL
      min_wgt <- 0.01
    } else {
      tau <- tau
      new_scale <- attr(replicate_factors, 'scale') * (tau^2)
      min_wgt <- NULL
    }
    replicate_factors <- rescale_replicates(
      x = replicate_factors,
      new_scale = new_scale,
      min_wgt = min_wgt
    )
    selected_tau <- sqrt(
      attr(replicate_factors, 'scale')*num_replicates
    )
    attr(replicate_factors, 'tau') <- selected_tau
  }

  attr(replicate_factors, 'rscales') <- rep(1, times = num_replicates)

  # Set column names
  colnames(replicate_factors) <- sprintf("REP_%s", seq_len(num_replicates))

  # Return result
  return(replicate_factors)
}

#' @title Convert Survey Design to Generalized Bootstrap Replicate Design
#' @description Converts a survey design object to a replicate design object
#' with replicate weights formed using the generalized bootstrap method.
#' The generalized survey bootstrap is a method for forming bootstrap replicate weights
#' from a textbook variance estimator, provided that the variance estimator
#' can be represented as a quadratic form whose matrix is positive semidefinite
#' (this covers a large class of variance estimators).
#' @param design A survey design object created using the 'survey' (or 'srvyr') package,
#' with class \code{'survey.design'} or \code{'svyimputationList'}.
#' @param variance_estimator The name of the variance estimator
#' whose quadratic form matrix should be created.
#' See \link[svrep]{variance-estimators} for a
#' detailed description of each variance estimator.
#' Options include:
#' \itemize{
#'   \item \strong{"Yates-Grundy"}: \cr The Yates-Grundy variance estimator based on
#'     first-order and second-order inclusion probabilities.
#'   \item \strong{"Horvitz-Thompson"}: \cr The Horvitz-Thompson variance estimator based on
#'     first-order and second-order inclusion probabilities.
#'   \item \strong{"Poisson Horvitz-Thompson"}: \cr The Horvitz-Thompson variance estimator
#'     based on assuming Poisson sampling, with first-order inclusion probabilities
#'     inferred from the sampling probabilities of the survey design object.
#'   \item \strong{"Stratified Multistage SRS"}: \cr The usual stratified multistage variance estimator
#'     based on estimating the variance of cluster totals within strata at each stage.
#'   \item \strong{"Ultimate Cluster"}: \cr The usual variance estimator based on estimating
#'     the variance of first-stage cluster totals within first-stage strata.
#'   \item \strong{"Deville-1"}: \cr A variance estimator for unequal-probability
#'     sampling without replacement, described in Matei and Tillé (2005)
#'     as "Deville 1".
#'   \item \strong{"Deville-2"}: \cr A variance estimator for unequal-probability
#'     sampling without replacement, described in Matei and Tillé (2005) as "Deville 2".
#'   \item \strong{"Deville-Tille": } \cr A variance estimator useful
#'     for balanced sampling designs, proposed by Deville and Tillé (2005).
#'   \item \strong{"SD1"}: \cr The non-circular successive-differences variance estimator described by Ash (2014),
#'     sometimes used for variance estimation for systematic sampling.
#'   \item \strong{"SD2"}:  \cr The circular successive-differences variance estimator described by Ash (2014).
#'     This estimator is the basis of the "successive-differences replication" estimator commonly used
#'     for variance estimation for systematic sampling.
#'   \item \strong{"BOSB"}: \cr The kernel-based variance estimator proposed by
#'     Breidt, Opsomer, and Sanchez-Borrego (2016) for use with systematic samples
#'     or other finely stratified designs. Uses the Epanechnikov kernel
#'     with the bandwidth automatically chosen to result in the smallest possible
#'     nonempty kernel window.
#'   \item\strong{"Beaumont-Emond"}: \cr The variance estimator of Beaumont and Emond (2022)
#'     for multistage unequal-probability sampling without replacement.
#' }
#' @param aux_var_names (Only used if \code{variance_estimator = "Deville-Tille")}.
#' A vector of the names of auxiliary variables used in sampling.
#' @param replicates Number of bootstrap replicates (should be as large as possible, given computer memory/storage limitations).
#' A commonly-recommended default is 500.
#' @param tau Either \code{"auto"}, or a single number; the default value is 1. 
#' This is the rescaling constant
#' used to avoid negative weights through the transformation \eqn{\frac{w + \tau - 1}{\tau}},
#' where \eqn{w} is the original weight and \eqn{\tau} is the rescaling constant \code{tau}. \cr
#' If \code{tau="auto"}, the rescaling factor is determined automatically as follows:
#' if all of the adjustment factors are nonnegative, then \code{tau} is set equal to 1;
#' otherwise, \code{tau} is set to the smallest value needed to rescale
#' the adjustment factors such that they are all at least \code{0.01}.
#' Instead of using \code{tau="auto"}, the user can instead use the function
#' \code{rescale_replicates()} to rescale the replicates later.
#' @param exact_vcov If \code{exact_vcov=TRUE}, the replicate factors will be generated
#' such that variance estimates for totals exactly match the results from the target variance estimator.
#' This requires that \code{num_replicates} exceeds the rank of \code{Sigma}.
#' The replicate factors are generated by applying PCA-whitening to a collection of draws
#' from a multivariate Normal distribution, then applying a coloring transformation
#' to the whitened collection of draws.
#' @param psd_option Either \code{"warn"} (the default) or \code{"error"}.
#' This option specifies what will happen if the target variance estimator
#' has a quadratic form matrix which is not positive semidefinite. This
#' can occasionally happen, particularly for two-phase designs. \cr
#' If \code{psd_option="error"}, then an error message will be displayed. \cr
#' If \code{psd_option="warn"}, then a warning message will be displayed,
#' and the quadratic form matrix will be approximated by the most similar
#' positive semidefinite matrix.
#' This approximation was suggested by Beaumont and Patak (2012),
#' who note that this is conservative in the sense of producing
#' overestimates of variance.
#' Beaumont and Patak (2012) argue that this overestimation is expected to be
#' small in magnitude. See \code{\link[svrep]{get_nearest_psd_matrix}}
#' for details of the approximation.
#' @param compress This reduces the computer memory required to represent the replicate weights and has no
#' impact on estimates.
#' @param mse If \code{TRUE}, compute variances from sums of squares around the point estimate from the full-sample weights.
#' If \code{FALSE}, compute variances from sums of squares around the mean estimate from the replicate weights.
#' @return
#' A replicate design object, with class \code{svyrep.design}, which can be used with the usual functions,
#' such as \code{svymean()} or \code{svyglm()}.
#'
#' Use \code{weights(..., type = 'analysis')} to extract the matrix of replicate weights.
#'
#' Use \code{as_data_frame_with_weights()} to convert the design object to a data frame with columns
#' for the full-sample and replicate weights.
#' @export
#' @seealso Use \code{\link[svrep]{estimate_boot_reps_for_target_cv}} to help choose the number of bootstrap replicates. \cr
#'
#' For greater customization of the method, \code{\link[svrep]{make_quad_form_matrix}} can be used to
#' represent several common variance estimators as a quadratic form's matrix,
#' which can then be used as an input to \code{\link[svrep]{make_gen_boot_factors}}.
#' The function \code{\link[svrep]{rescale_replicates}} is used to implement
#' the rescaling of the bootstrap adjustment factors.
#'
#' See \link[svrep]{variance-estimators} for a
#' description of each variance estimator.
#' @section Statistical Details:
#' Let \eqn{v( \hat{T_y})} be the textbook variance estimator for an estimated population total \eqn{\hat{T}_y} of some variable \eqn{y}.
#' The base weight for case \eqn{i} in our sample is \eqn{w_i}, and we let \eqn{\breve{y}_i} denote the weighted value \eqn{w_iy_i}.
#' Suppose we can represent our textbook variance estimator as a quadratic form: \eqn{v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T},
#' for some \eqn{n \times n} matrix \eqn{\Sigma}.
#' The only constraint on \eqn{\Sigma} is that, for our sample, it must be symmetric and positive semidefinite.
#'
#' The bootstrapping process creates \eqn{B} sets of replicate weights, where the \eqn{b}-th set of replicate weights is a vector of length \eqn{n} denoted \eqn{\mathbf{a}^{(b)}}, whose \eqn{k}-th value is denoted \eqn{a_k^{(b)}}.
#' This yields \eqn{B} replicate estimates of the population total, \eqn{\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k}, for \eqn{b=1, \ldots B}, which can be used to estimate sampling variance.
#'
#' \deqn{
#'   v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B}
#' }
#'
#' This bootstrap variance estimator can be written as a quadratic form:
#'
#'   \deqn{
#'     v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}
#'   }
#'   where
#'   \deqn{
#'     \boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B}
#'   }
#'
#' Note that if the vector of adjustment factors \eqn{\mathbf{a}^{(b)}} has expectation \eqn{\mathbf{1}_n} and variance-covariance matrix \eqn{\boldsymbol{\Sigma}},
#' then we have the bootstrap expectation \eqn{E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}}. Since the bootstrap process takes the sample values \eqn{\breve{y}} as fixed, the bootstrap expectation of the variance estimator is \eqn{E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}}.
#' Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating \eqn{\mathbf{a}^{(b)}} from a distribution with the following two conditions:
#' \cr
#'     \strong{Condition 1}: \eqn{\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n}
#' \cr
#'     \strong{Condition 2}: \eqn{\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}}
#' \cr \cr
#' While there are multiple ways to generate adjustment factors satisfying these conditions,
#' the simplest general method is to simulate from a multivariate normal distribution: \eqn{\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})}.
#' This is the method used by this function.
#' @section Details on Rescaling to Avoid Negative Adjustment Factors:
#' Let \eqn{\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]} denote the \eqn{(n \times B)} matrix of bootstrap adjustment factors.
#' To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors \eqn{\mathbf{A}^S} by rescaling each adjustment factor \eqn{a_k^{(b)}} as follows:
#' \deqn{
#'    a_k^{S,(b)} = \frac{a_k^{(b)} + \tau - 1}{\tau}
#'  }
#' where \eqn{\tau \geq 1 - a_k^{(b)} \geq 1} for all \eqn{k} in \eqn{\left\{ 1,\ldots,n \right\}} and all \eqn{b} in \eqn{\left\{1, \ldots, B\right\}}.
#'
#' The value of \eqn{\tau} can be set based on the realized adjustment factor matrix \eqn{\mathbf{A}} or by choosing \eqn{\tau} prior to generating the adjustment factor matrix \eqn{\mathbf{A}} so that \eqn{\tau} is likely to be large enough to prevent negative bootstrap weights.
#'
#' If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes \eqn{\frac{\tau^2}{B}} instead of \eqn{\frac{1}{B}}.
#' \deqn{
#'  \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) = \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2
#'  }
#' \deqn{
#'  \textbf{After rescaling: } v_B\left(\hat{T}_y\right) = \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2
#' }
#' When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor \eqn{\frac{\tau^2}{B}} rather than \eqn{\frac{1}{B}} when estimating sampling variances.
#' This rescaling method does not affect variance estimates for linear statistics,
#' but its impact on non-smooth statistics such as quantiles is unclear.

#' @section Two-Phase Designs:
#' For a two-phase design, \code{variance_estimator} should be a list of variance estimators' names,
#' with two elements, such as \code{list('Ultimate Cluster', 'Poisson Horvitz-Thompson')}.
#' In two-phase designs, only the following estimators may be used for the second phase:
#' \itemize{
#'   \item "Ultimate Cluster"
#'   \item "Stratified Multistage SRS"
#'   \item "Poisson Horvitz-Thompson"
#' }
#' For statistical details on the handling of two-phase designs,
#' see the documentation for \link[svrep]{make_twophase_quad_form}.
#' @references
#' The generalized survey bootstrap was first proposed by Bertail and Combris (1997).
#' See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap.
#' The generalized survey bootstrap represents one strategy for forming replication variance estimators
#' in the general framework proposed by Fay (1984) and Dippo, Fay, and Morganstein (1984).
#' \cr \cr
#' - Ash, S. (2014). "\emph{Using successive difference replication for estimating variances}."
#' \strong{Survey Methodology}, Statistics Canada, 40(1), 47-59.
#' \cr \cr
#' - Bellhouse, D.R. (1985). "\emph{Computing Methods for Variance Estimation in Complex Surveys}."
#' \strong{Journal of Official Statistics}, Vol.1, No.3.
#' \cr \cr
#' - Breidt, F. J., Opsomer, J. D., & Sanchez-Borrego, I. (2016). 
#' "\emph{Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata}." 
#' \strong{Journal of the American Statistical Association}, 111(514), 822-833. https://doi.org/10.1080/01621459.2015.1058264
#' \cr \cr
#' - Beaumont, Jean-François, and Zdenek Patak. 2012. "On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling: Generalized Bootstrap for Sample Surveys." International Statistical Review 80 (1): 127-48. https://doi.org/10.1111/j.1751-5823.2011.00166.x.
#' \cr \cr
#' - Bertail, and Combris. 1997. "Bootstrap Généralisé d'un Sondage." Annales d'Économie Et de Statistique, no. 46: 49. https://doi.org/10.2307/20076068.
#' \cr \cr
#' - Deville, J. C., and Tillé, Y. (2005). "\emph{Variance approximation under balanced sampling.}"
#' \strong{Journal of Statistical Planning and Inference}, 128, 569-591.
#' \cr \cr
#' - Dippo, Cathryn, Robert Fay, and David Morganstein. 1984. "Computing Variances from Complex Samples with Replicate Weights." In, 489-94. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_094.pdf.
#' \cr \cr
#' - Fay, Robert. 1984. "Some Properties of Estimates of Variance Based on Replication Methods." In, 495-500. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_095.pdf.
#' \cr \cr
#' - Matei, Alina, and Yves Tillé. (2005).
#' "\emph{Evaluation of Variance Approximations and Estimators
#' in Maximum Entropy Sampling with Unequal Probability and Fixed Sample Size.}"
#' \strong{Journal of Official Statistics}, 21(4):543-70.
#' @export
#'
#' @examples
#' \donttest{
#' library(survey)
#'
#'# Example 1: Bootstrap based on the Yates-Grundy estimator ----
#'    set.seed(2014)
#'
#'    data('election', package = 'survey')
#'
#'    ## Create survey design object
#'    pps_design_yg <- svydesign(
#'      data = election_pps,
#'      id = ~1, fpc = ~p,
#'      pps = ppsmat(election_jointprob),
#'      variance = "YG"
#'    )
#'
#'    ## Convert to generalized bootstrap replicate design
#'    gen_boot_design_yg <- pps_design_yg |>
#'      as_gen_boot_design(variance_estimator = "Yates-Grundy",
#'                         replicates = 1000, tau = "auto")
#'
#'    svytotal(x = ~ Bush + Kerry, design = pps_design_yg)
#'    svytotal(x = ~ Bush + Kerry, design = gen_boot_design_yg)
#'
#'# Example 2: Bootstrap based on the successive-difference estimator ----
#'
#'    data('library_stsys_sample', package = 'svrep')
#'
#'    ## First, ensure data are sorted in same order as was used in sampling
#'    library_stsys_sample <- library_stsys_sample |>
#'      sort_by(~ SAMPLING_SORT_ORDER)
#'
#'    ## Create a survey design object
#'    design_obj <- svydesign(
#'      data = library_stsys_sample,
#'      strata = ~ SAMPLING_STRATUM,
#'      ids = ~ 1,
#'      fpc = ~ STRATUM_POP_SIZE
#'    )
#'
#'    ## Convert to generalized bootstrap replicate design
#'    gen_boot_design_sd2 <- as_gen_boot_design(
#'      design = design_obj,
#'      variance_estimator = "SD2",
#'      replicates = 2000
#'    )
#'
#'    ## Estimate sampling variances
#'    svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2)
#'    svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = design_obj)
#'
#' # Example 3: Two-phase sample ----
#' # -- First stage is stratified systematic sampling,
#' # -- second stage is response/nonresponse modeled as Poisson sampling
#'
#'   nonresponse_model <- glm(
#'     data = library_stsys_sample,
#'     family = quasibinomial('logit'),
#'     formula = I(RESPONSE_STATUS == "Survey Respondent") ~ 1,
#'     weights = 1/library_stsys_sample$SAMPLING_PROB
#'   )
#'
#'   library_stsys_sample[['RESPONSE_PROPENSITY']] <- predict(
#'     nonresponse_model,
#'     newdata = library_stsys_sample,
#'     type = "response"
#'   )
#'
#'   twophase_design <- twophase(
#'     data = library_stsys_sample,
#'     # Identify cases included in second phase sample
#'     subset = ~ I(RESPONSE_STATUS == "Survey Respondent"),
#'     strata = list(~ SAMPLING_STRATUM, NULL),
#'     id = list(~ 1, ~ 1),
#'     probs = list(NULL, ~ RESPONSE_PROPENSITY),
#'     fpc = list(~ STRATUM_POP_SIZE, NULL),
#'   )
#'
#'   twophase_boot_design <- as_gen_boot_design(
#'     design = twophase_design,
#'     variance_estimator = list(
#'       "SD2", "Poisson Horvitz-Thompson"
#'     )
#'   )
#'
#'   svytotal(x = ~ LIBRARIA, design = twophase_boot_design)
#'
#' }
as_gen_boot_design <- function(design, variance_estimator = NULL,
                               aux_var_names = NULL,
                               replicates = 500, tau = 1, exact_vcov = FALSE,
                               psd_option = "warn",
                               mse = getOption("survey.replicates.mse"),
                               compress = TRUE) {
  UseMethod("as_gen_boot_design", design)
}

#' @export
as_gen_boot_design.twophase2 <- function(design, variance_estimator = NULL,
                                         aux_var_names = NULL,
                                         replicates = 500, tau = 1,
                                         exact_vcov = FALSE, psd_option = "warn",
                                         mse = getOption("survey.replicates.mse"),
                                         compress = TRUE) {

  Sigma <- get_design_quad_form(design, variance_estimator, aux_var_names = aux_var_names)

  if (!is_psd_matrix(Sigma)) {
    problem_msg <- paste0(
      "The sample quadratic form matrix",
      " for this design and variance estimator",
      " is not positive semidefinite."
    )
    if (psd_option == "warn") {

      warning_msg <- paste0(
        problem_msg,
        " It will be approximated by the nearest",
        " positive semidefinite matrix."
      )
      warning(warning_msg)
      Sigma <- get_nearest_psd_matrix(Sigma)

    } else {
      error_msg <- paste0(
        problem_msg,
        " This can be handled using the `psd_option` argument."
      )
      stop(error_msg)
    }
  }

  adjustment_factors <- make_gen_boot_factors(
    Sigma = Sigma,
    num_replicates = replicates,
    tau = tau,
    exact_vcov = exact_vcov
  )

  rep_design <- survey::svrepdesign(
    variables = design$phase1$full$variables[design$subset,,drop=FALSE],
    weights = stats::weights(design, type = "sampling"),
    repweights = as.matrix(adjustment_factors),
    combined.weights = FALSE,
    compress = compress, mse = mse,
    scale = attr(adjustment_factors, 'scale'),
    rscales = attr(adjustment_factors, 'rscales'),
    type = "other"
  )

  rep_design$tau <- attr(adjustment_factors, 'tau')

  if (inherits(design, 'tbl_svy') && ('package:srvyr' %in% search())) {
    rep_design <- srvyr::as_survey_rep(
      rep_design
    )
  }

  rep_design$call <- sys.call(which = -1)

  return(rep_design)
}

#' @export
as_gen_boot_design.survey.design <- function(design, variance_estimator = NULL,
                                             aux_var_names = NULL,
                                             replicates = 500, tau = 1, exact_vcov = FALSE,
                                             psd_option = 'warn',
                                             mse = getOption("survey.replicates.mse"),
                                             compress = TRUE) {

  # Produce a (potentially) compressed survey design object
  compressed_design_structure <- compress_design(design, vars_to_keep = aux_var_names)

  # Get the quadratic form of the variance estimator,
  # for the compressed design object
  Sigma <- get_design_quad_form(
    compressed_design_structure$design_subset,
    variance_estimator,
    aux_var_names = aux_var_names
  )

  # Check that the matrix is positive semidefinite
  if (!is_psd_matrix(Sigma)) {
    problem_msg <- "The sample quadratic form matrix for this design and variance estimator is not positive semidefinite."
    if (psd_option == "warn") {

      warning_msg <- paste0(
        problem_msg,
        " It will be approximated by the nearest positive semidefinite matrix."
      )
      warning(warning_msg)
      Sigma <- get_nearest_psd_matrix(Sigma)

    } else {
      error_msg <- paste0(
        problem_msg, " This can be handled using the `psd_option` argument."
      )
      stop(error_msg)
    }
  }

  # Generate adjustment factors for the compressed design object
  adjustment_factors <- make_gen_boot_factors(
    Sigma = Sigma,
    num_replicates = replicates,
    tau = tau,
    exact_vcov = exact_vcov
  )

  tau <- attr(adjustment_factors, 'tau')
  scale <- attr(adjustment_factors, 'scale')
  rscales <- attr(adjustment_factors, 'rscales')

  # Uncompress the adjustment factors
  adjustment_factors <- distribute_matrix_across_clusters(
    cluster_level_matrix = adjustment_factors,
    cluster_ids = compressed_design_structure$index,
    rows = TRUE, cols = FALSE
  )

  attr(adjustment_factors, 'tau') <- tau
  attr(adjustment_factors, 'scale') <- scale
  attr(adjustment_factors, 'rscales') <- rscales

  # Create the survey design object
  rep_design <- survey::svrepdesign(
    variables = design$variables,
    weights = stats::weights(design, type = "sampling"),
    repweights = as.matrix(adjustment_factors),
    combined.weights = FALSE,
    compress = compress, mse = mse,
    scale = scale,
    rscales = rscales,
    type = "other"
  )

  rep_design$tau <- tau

  if (inherits(design, 'tbl_svy') && ('package:srvyr' %in% search())) {
    rep_design <- srvyr::as_survey_rep(
      rep_design
    )
  }

  rep_design$call <- sys.call(which = -1)

  return(rep_design)
}

#' @export
as_gen_boot_design.DBIsvydesign <- function(design, variance_estimator = NULL,
                                            aux_var_names = NULL,
                                            replicates = 500, tau = 1,
                                            exact_vcov = FALSE, psd_option = "warn",
                                            mse = getOption("survey.replicates.mse"),
                                            compress = TRUE) {

  # Produce a (potentially) compressed survey design object
  compressed_design_structure <- compress_design(design, vars_to_keep = aux_var_names)

  # Get the quadratic form of the variance estimator,
  # for the compressed design object
  Sigma <- get_design_quad_form(
    compressed_design_structure$design_subset,
    variance_estimator,
    aux_var_names = aux_var_names
  )

  # Check that the matrix is positive semidefinite
  if (!is_psd_matrix(Sigma)) {
    problem_msg <- "The sample quadratic form matrix for this design and variance estimator is not positive semidefinite."
    if (psd_option == "warn") {

      warning_msg <- paste0(
        problem_msg,
        " It will be approximated by the nearest positive semidefinite matrix."
      )
      warning(warning_msg)
      Sigma <- get_nearest_psd_matrix(Sigma)

    } else {
      error_msg <- paste0(
        problem_msg, " This can be handled using the `psd_option` argument."
      )
      stop(error_msg)
    }
  }

  # Generate adjustment factors for the compressed design object
  adjustment_factors <- make_gen_boot_factors(
    Sigma = Sigma,
    num_replicates = replicates,
    tau = tau,
    exact_vcov = exact_vcov
  )

  tau <- attr(adjustment_factors, 'tau')
  scale <- attr(adjustment_factors, 'scale')
  rscales <- attr(adjustment_factors, 'rscales')

  # Uncompress the adjustment factors
  adjustment_factors <- distribute_matrix_across_clusters(
    cluster_level_matrix = adjustment_factors,
    cluster_ids = compressed_design_structure$index,
    rows = TRUE, cols = FALSE
  )

  attr(adjustment_factors, 'tau') <- tau
  attr(adjustment_factors, 'scale') <- scale
  attr(adjustment_factors, 'rscales') <- rscales

  # Create the survey design object
  rep_design <- survey::svrepdesign(
    data = data.frame(DUMMY = seq_len(nrow(adjustment_factors))),
    weights = stats::weights(design, type = "sampling"),
    repweights = as.matrix(adjustment_factors),
    combined.weights = FALSE,
    compress = compress, mse = mse,
    scale = scale,
    rscales = rscales,
    type = "other"
  )
  rep_design$tau <- tau

  # Replace 'variables' with a database connection
  # and make the object have the appropriate class
  rep_design$variables <- NULL
  if (design$db$dbtype == "ODBC") {
    stop("'RODBC' no longer supported. Use the odbc package")
  } else {
    db <- DBI::dbDriver(design$db$dbtype)
    dbconn <- DBI::dbConnect(db, design$db$dbname)
  }
  rep_design$db <- list(
    dbname = design$db$dbname, tablename = design$db$tablename,
    connection = dbconn,
    dbtype = design$db$dbtype
  )
  class(rep_design) <- c("DBIrepdesign", "DBIsvydesign", class(rep_design))

  if (inherits(design, 'tbl_svy') && ('package:srvyr' %in% search())) {
    rep_design <- srvyr::as_survey_rep(
      rep_design
    )
  }

  rep_design$call <- sys.call(which = -1)

  return(rep_design)

}
bschneidr/svrep documentation built on June 14, 2025, 10 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
bschneidr/svrep
Tools for Creating, Updating, and Analyzing Survey Replicate Weights

R/generalized_bootstrap.R
In bschneidr/svrep: Tools for Creating, Updating, and Analyzing Survey Replicate Weights

Defines functions as_gen_boot_design.DBIsvydesign as_gen_boot_design.survey.design as_gen_boot_design.twophase2 as_gen_boot_design make_gen_boot_factors

Documented in as_gen_boot_design make_gen_boot_factors

R Package Documentation

Browse R Packages

We want your feedback!

bschneidr/svrep Tools for Creating, Updating, and Analyzing Survey Replicate Weights

R/generalized_bootstrap.R In bschneidr/svrep: Tools for Creating, Updating, and Analyzing Survey Replicate Weights

Defines functions as_gen_boot_design.DBIsvydesign as_gen_boot_design.survey.design as_gen_boot_design.twophase2 as_gen_boot_design make_gen_boot_factors

Documented in as_gen_boot_design make_gen_boot_factors

R Package Documentation

Browse R Packages

We want your feedback!

bschneidr/svrep
Tools for Creating, Updating, and Analyzing Survey Replicate Weights

R/generalized_bootstrap.R
In bschneidr/svrep: Tools for Creating, Updating, and Analyzing Survey Replicate Weights