boot_auto_mrp: Bootstrappinng wrapper for auto_mrp

View source: R/boot_auto_mrp.R

boot_auto_mrpR Documentation

Bootstrappinng wrapper for auto_mrp

Description

boot_auto_mrp estimates uncertainty for auto_mrp via botstrapping.

Usage

boot_auto_mrp(
  y,
  L1.x,
  L2.x,
  mrp.L2.x,
  L2.unit,
  L2.reg,
  L2.x.scale,
  pcs,
  folds,
  bin.proportion,
  bin.size,
  survey,
  census,
  ebma.size,
  k.folds,
  cv.sampling,
  loss.unit,
  loss.fun,
  best.subset,
  lasso,
  pca,
  gb,
  svm,
  mrp,
  deep.mrp,
  best.subset.L2.x,
  lasso.L2.x,
  pca.L2.x,
  pc.names,
  gb.L2.x,
  svm.L2.x,
  svm.L2.unit,
  svm.L2.reg,
  gb.L2.unit,
  gb.L2.reg,
  deep.L2.x,
  deep.L2.reg,
  deep.splines,
  lasso.lambda,
  lasso.n.iter,
  gb.interaction.depth,
  gb.shrinkage,
  gb.n.trees.init,
  gb.n.trees.increase,
  gb.n.trees.max,
  gb.n.minobsinnode,
  svm.kernel,
  svm.gamma,
  svm.cost,
  ebma.tol,
  boot.iter,
  cores
)

Arguments

y

Outcome variable. A character vector containing the column names of the outcome variable. A character scalar containing the column name of the outcome variable in survey.

L1.x

Individual-level covariates. A character vector containing the column names of the individual-level variables in survey and census used to predict outcome y. Note that geographic unit is specified in argument L2.unit.

L2.x

Context-level covariates. A character vector containing the column names of the context-level variables in survey and census used to predict outcome y. To exclude context-level variables, set L2.x = NULL.

mrp.L2.x

MRP context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the MRP classifier. The character vector empty if no context-level variables should be used by the MRP classifier. If NULL and mrp is set to TRUE, then MRP uses the variables specified in L2.x. Default is NULL. Note: For the empty MrP model, set L2.x = NULL and mrp.L2.x = "".

L2.unit

Geographic unit. A character scalar containing the column name of the geographic unit in survey and census at which outcomes should be aggregated.

L2.reg

Geographic region. A character scalar containing the column name of the geographic region in survey and census by which geographic units are grouped (L2.unit must be nested within L2.reg). Default is NULL.

L2.x.scale

Scale context-level covariates. A logical argument indicating whether the context-level covariates should be normalized. Default is TRUE. Note that if set to FALSE, then the context-level covariates should be normalized prior to calling auto_MrP().

pcs

Principal components. A character vector containing the column names of the principal components of the context-level variables in survey and census. Default is NULL.

folds

EBMA and cross-validation folds. A character scalar containing the column name of the variable in survey that specifies the fold to which an observation is allocated. The variable should contain integers running from 1 to k + 1, where k is the number of cross-validation folds. Value k + 1 refers to the EBMA fold. Default is NULL. Note: if folds is NULL, then ebma.size, k.folds, and cv.sampling must be specified.

bin.proportion

Proportion of ideal types. A character scalar containing the column name of the variable in census that indicates the proportion of individuals by ideal type and geographic unit. Default is NULL. Note: if bin.proportion is NULL, then bin.size must be specified.

bin.size

Bin size of ideal types. A character scalar containing the column name of the variable in census that indicates the bin size of ideal types by geographic unit. Default is NULL. Note: ignored if bin.proportion is provided, but must be specified otherwise.

survey

Survey data. A data.frame whose column names include y, L1.x, L2.x, L2.unit, and, if specified, L2.reg, pcs, and folds.

census

Census data. A data.frame whose column names include L1.x, L2.x, L2.unit, if specified, L2.reg and pcs, and either bin.proportion or bin.size.

ebma.size

EBMA fold size. A number in the open unit interval indicating the proportion of respondents to be allocated to the EBMA fold. Default is 1/3. Note: ignored if folds is provided, but must be specified otherwise.

k.folds

Number of cross-validation folds. An integer-valued scalar indicating the number of folds to be used in cross-validation. Default is 5. Note: ignored if folds is provided, but must be specified otherwise.

cv.sampling

Cross-validation sampling method. A character-valued scalar indicating whether cross-validation folds should be created by sampling individual respondents (individuals) or geographic units (L2 units). Default is L2 units. Note: ignored if folds is provided, but must be specified otherwise.

loss.unit

Loss function unit. A character-valued scalar indicating whether performance loss should be evaluated at the level of individual respondents (individuals), geographic units (L2 units) or at both levels. Default is c("individuals", "L2 units"). With multiple loss units, parameters are ranked for each loss unit and the loss unit with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.

loss.fun

Loss function. A character-valued scalar indicating whether prediction loss should be measured by the mean squared error (MSE), the mean absolute error (MAE), binary cross-entropy (cross-entropy), mean squared false error (msfe), the f1 score (f1), or a combination thereof. Default is c("MSE", "cross-entropy","msfe", "f1"). With multiple loss functions, parameters are ranked for each loss function and the parameter combination with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.

best.subset

Best subset classifier. A logical argument indicating whether the best subset classifier should be used for predicting outcome y. Default is TRUE.

lasso

Lasso classifier. A logical argument indicating whether the lasso classifier should be used for predicting outcome y. Default is TRUE.

pca

PCA classifier. A logical argument indicating whether the PCA classifier should be used for predicting outcome y. Default is TRUE.

gb

GB classifier. A logical argument indicating whether the GB classifier should be used for predicting outcome y. Default is TRUE.

svm

SVM classifier. A logical argument indicating whether the SVM classifier should be used for predicting outcome y. Default is TRUE.

mrp

MRP classifier. A logical argument indicating whether the standard MRP classifier should be used for predicting outcome y. Default is FALSE.

deep.mrp

Deep MRP classifier. A logical argument indicating whether the deep MRP classifier should be used for predicting outcome y. Default is FALSE.

best.subset.L2.x

Best subset context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the best subset classifier. If NULL and best.subset is set to TRUE, then best subset uses the variables specified in L2.x. Default is NULL.

lasso.L2.x

Lasso context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the lasso classifier. If NULL and lasso is set to TRUE, then lasso uses the variables specified in L2.x. Default is NULL.

pca.L2.x

PCA context-level covariates. A character vector containing the column names of the context-level variables in survey and census whose principal components are to be used by the PCA classifier. If NULL and pca is set to TRUE, then PCA uses the principal components of the variables specified in L2.x. Default is NULL.

pc.names

A character vector of the principal component variable names in the data.

gb.L2.x

GB context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the GB classifier. If NULL and gb is set to TRUE, then GB uses the variables specified in L2.x. Default is NULL.

svm.L2.x

SVM context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the SVM classifier. If NULL and svm is set to TRUE, then SVM uses the variables specified in L2.x. Default is NULL.

svm.L2.unit

SVM L2.unit. A logical argument indicating whether L2.unit should be included in the SVM classifier. Default is FALSE.

svm.L2.reg

SVM L2.reg. A logical argument indicating whether L2.reg should be included in the SVM classifier. Default is FALSE.

gb.L2.unit

GB L2.unit. A logical argument indicating whether L2.unit should be included in the GB classifier. Default is FALSE.

gb.L2.reg

GB L2.reg. A logical argument indicating whether L2.reg should be included in the GB classifier. Default is FALSE.

deep.L2.x

Deep MRP context-level covariates. A character vector containing the column names of the context-level variables in survey and census to be used by the deep MRP classifier. If NULL and deep.mrp is set to TRUE, then deep MRP uses the variables specified in L2.x. Default is NULL.

deep.L2.reg

Deep MRP L2.reg. A logical argument indicating whether L2.reg should be included in the deep MRP classifier. Default is TRUE.

deep.splines

Deep MRP splines. A logical argument indicating whether splines should be used in the deep MRP classifier. Default is TRUE.

lasso.lambda

Lasso penalty parameter. A numeric vector of non-negative values. The penalty parameter controls the shrinkage of the context-level variables in the lasso model. Default is a sequence with minimum 0.1 and maximum 250 that is equally spaced on the log-scale. The number of values is controlled by the lasso.n.iter parameter.

lasso.n.iter

Lasso number of lambda values. An integer-valued scalar specifying the number of lambda values to search over. Default is 100. Note: Is ignored if a vector of lasso.lambda values is provided.

gb.interaction.depth

GB interaction depth. An integer-valued vector whose values specify the interaction depth of GB. The interaction depth defines the maximum depth of each tree grown (i.e., the maximum level of variable interactions). Default is c(1, 2, 3).

gb.shrinkage

GB learning rate. A numeric vector whose values specify the learning rate or step-size reduction of GB. Values between 0.001 and 0.1 usually work, but a smaller learning rate typically requires more trees. Default is c(0.04, 0.01, 0.008, 0.005, 0.001).

gb.n.trees.init

GB initial total number of trees. An integer-valued scalar specifying the initial number of total trees to fit by GB. Default is 50.

gb.n.trees.increase

GB increase in total number of trees. An integer-valued scalar specifying by how many trees the total number of trees to fit should be increased (until gb.n.trees.max is reached). Default is 50.

gb.n.trees.max

GB maximum number of trees. An integer-valued scalar specifying the maximum number of trees to fit by GB. Default is 1000.

gb.n.minobsinnode

GB minimum number of observations in the terminal nodes. An integer-valued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is 20.

svm.kernel

SVM kernel. A character-valued scalar specifying the kernel to be used by SVM. The possible values are linear, polynomial, radial, and sigmoid. Default is radial.

svm.gamma

SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale.

svm.cost

SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale.

ebma.tol

EBMA tolerance. A numeric vector containing the tolerance values for improvements in the log-likelihood before the EM algorithm stops optimization. Values should range at least from 0.01 to 0.001. Default is c(0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001).

boot.iter

Number of bootstrap iterations. An integer argument indicating the number of bootstrap iterations to be computed. Will be ignored unless uncertainty = TRUE. Default is 200 if uncertainty = TRUE and NULL if uncertainty = FALSE.

cores

The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1.


autoMrP documentation built on May 29, 2024, 6:40 a.m.