auto_MrP | R Documentation |
This package improves the prediction performance of multilevel regression with post-stratification (MrP) by combining a number of machine learning methods through ensemble Bayesian model averaging (EBMA).
auto_MrP(
y,
L1.x,
L2.x,
L2.unit,
L2.reg = NULL,
L2.x.scale = TRUE,
pcs = NULL,
folds = NULL,
bin.proportion = NULL,
bin.size = NULL,
survey,
census,
ebma.size = 1/3,
cores = 1,
k.folds = 5,
cv.sampling = "L2 units",
loss.unit = c("individuals", "L2 units"),
loss.fun = c("msfe", "cross-entropy", "f1", "MSE"),
best.subset = TRUE,
lasso = TRUE,
pca = TRUE,
gb = TRUE,
svm = TRUE,
mrp = FALSE,
deep.mrp = FALSE,
oversampling = FALSE,
best.subset.L2.x = NULL,
lasso.L2.x = NULL,
pca.L2.x = NULL,
gb.L2.x = NULL,
svm.L2.x = NULL,
mrp.L2.x = NULL,
gb.L2.unit = TRUE,
gb.L2.reg = FALSE,
svm.L2.unit = TRUE,
svm.L2.reg = FALSE,
deep.L2.x = NULL,
deep.L2.reg = TRUE,
deep.splines = TRUE,
lasso.lambda = NULL,
lasso.n.iter = 100,
gb.interaction.depth = c(1, 2, 3),
gb.shrinkage = c(0.04, 0.01, 0.008, 0.005, 0.001),
gb.n.trees.init = 50,
gb.n.trees.increase = 50,
gb.n.trees.max = 1000,
gb.n.minobsinnode = 20,
svm.kernel = c("radial"),
svm.gamma = NULL,
svm.cost = NULL,
ebma.n.draws = 100,
ebma.tol = c(0.01, 0.005, 0.001, 5e-04, 1e-04, 5e-05, 1e-05),
verbose = FALSE,
uncertainty = FALSE,
boot.iter = NULL
)
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
L2.x.scale |
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is |
pcs |
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
|
folds |
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in |
bin.proportion |
Proportion of ideal types. A character scalar
containing the column name of the variable in |
bin.size |
Bin size of ideal types. A character scalar containing the
column name of the variable in |
survey |
Survey data. A |
census |
Census data. A |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
best.subset |
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
|
lasso |
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome |
pca |
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome |
gb |
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome |
svm |
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome |
mrp |
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for predicting outcome |
oversampling |
Over sample to create balance on the dependent variable.
A logical argument. Default is |
best.subset.L2.x |
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
|
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
pca.L2.x |
PCA context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.x |
GB context-level covariates. A character vector containing the
column names of the context-level variables in |
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.unit |
GB L2.unit. A logical argument indicating whether
|
gb.L2.reg |
GB L2.reg. A logical argument indicating whether
|
svm.L2.unit |
SVM L2.unit. A logical argument indicating whether
|
svm.L2.reg |
SVM L2.reg. A logical argument indicating whether
|
deep.L2.x |
Deep MRP context-level covariates. A character vector
containing the column names of the context-level variables in |
deep.L2.reg |
Deep MRP L2.reg. A logical argument indicating whether
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
lasso.lambda |
Lasso penalty parameter. A numeric |
lasso.n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
gb.interaction.depth |
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is |
gb.shrinkage |
GB learning rate. A numeric vector whose values specify
the learning rate or step-size reduction of GB. Values between |
gb.n.trees.init |
GB initial total number of trees. An integer-valued
scalar specifying the initial number of total trees to fit by GB. Default
is |
gb.n.trees.increase |
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until |
gb.n.trees.max |
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB. Default is |
gb.n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
svm.kernel |
SVM kernel. A character-valued scalar specifying the kernel
to be used by SVM. The possible values are |
svm.gamma |
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale. |
svm.cost |
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale. |
ebma.n.draws |
EBMA number of samples. An integer-valued scalar
specifying the number of bootstrapped samples to be drawn from the EBMA
fold and used for tuning EBMA. Default is |
ebma.tol |
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the log-likelihood before the EM
algorithm stops optimization. Values should range at least from |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
uncertainty |
Uncertainty estimates. A logical argument indicating
whether uncertainty estimates should be computed. Default is |
boot.iter |
Number of bootstrap iterations. An integer argument
indicating the number of bootstrap iterations to be computed. Will be
ignored unless |
Bootstrapping samples the level two units, sometimes referred to as the cluster bootstrap. For the multilevel model, for example, when running MrP only, the bootstrapped median level two predictions will differ from the level two predictions without bootstrapping. We recommend assessing the difference by running autoMrP without bootstrapping alongside autoMrP with bootstrapping and then comparing level two predictions from the model without bootstrapping to the median level two predictions from the model with bootstrapping.
To ensure reproducability of the results, use the set.seed()
function to specify a seed.
The context-level predictions. A list with two elements. The first
element, EBMA
, contains the post-stratified ensemble bayesian model
avaeraging (EBMA) predictions. The second element, classifiers
,
contains the post-stratified predictions from all estimated classifiers.
# An MrP model without machine learning
set.seed(123)
m <- auto_MrP(
y = "YES",
L1.x = c("L1x1"),
L2.x = c("L2.x1", "L2.x2"),
L2.unit = "state",
bin.proportion = "proportion",
survey = taxes_survey,
census = taxes_census,
ebma.size = 0,
cores = 2,
best.subset = FALSE,
lasso = FALSE,
pca = FALSE,
gb = FALSE,
svm = FALSE,
mrp = TRUE
)
# summarize and plot results
summary(m)
plot(m)
# An MrP model without context-level predictors
m <- auto_MrP(
y = "YES",
L1.x = "L1x1",
L2.x = NULL,
mrp.L2.x = "",
L2.unit = "state",
bin.proportion = "proportion",
survey = taxes_survey,
census = taxes_census,
ebma.size = 0,
cores = 1,
best.subset = FALSE,
lasso = FALSE,
pca = FALSE,
gb = FALSE,
svm = FALSE,
mrp = TRUE
)
# Predictions with machine learning
# detect number of available cores
max_cores <- parallel::detectCores()
# autoMrP with machine learning
ml_out <- auto_MrP(
y = "YES",
L1.x = c("L1x1", "L1x2", "L1x3"),
L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"),
L2.unit = "state",
L2.reg = "region",
bin.proportion = "proportion",
survey = taxes_survey,
census = taxes_census,
gb.L2.reg = TRUE,
svm.L2.reg = TRUE,
cores = min(2, max_cores)
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.