Model averaging (and model selection) after multiple imputation

Share:

Description

Performs model selection/averaging on multiply imputed data and combines the resulting estimates. The package also provides access to less frequently used model averaging techniques and offers integrated bootstrap estimation.

Usage

1
2
3
4
5
mami(X, method = c("criterion.average", "criterion.selection", "MMA", "LASSO/LAE"), criterion = c("AIC", "BIC", "BIC+", "CV", "GCV"),
 B = 20, X.org = NULL, inference = c("standard", "+bootstrapping"), missing.data = c("imputed", "none", "CC"), var.remove = NULL, 
 user.weights = NULL, candidate.models = c("all", "restricted", "very restricted"), model.family = c("gaussian", "binomial", "poisson",
 "coxph"), add.factor = NULL, add.interaction = NULL, add.transformation = NULL, ycol = 1, CI = 0.95, kfold = 5, id = NULL,
 use.stratum = NULL, report.exp = FALSE, print.time = FALSE, print.warnings = TRUE, ...)

Arguments

X

Either a list of multiply imputed datasets (each of them a dataframe), or an object of class ‘amelia’ created by Amelia II, or a single dataframe.

method

A character string specifying the model selection or model averaging technique: "criterion.average" for model averaging based on exponential AIC or BIC weights, "criterion.selection" for stepwise variable selection based on AIC or BIC or cross validation, "MMA" for Mallow's model averaging (only linear model), and "LASSO/LAE" for shrinkage estimation and averaging with LASSO (only linear model).

criterion

A character string specifying the model selection criterion used for criterion-based model selection/averaging; currently either "AIC" for Akaike's Information Criterion, "BIC" for the Bayes criterion of Schwarz, "BIC+" for the Bayes criterion of Schwarz with quicker model averaging based on the leaps algorithm of the BMA package, CV for the cross validation error (based on the squared loss function), or GCV for generalized cross validation.

B

An integer indicating the number of bootstrap replications to use (when inference = "+bootstrapping" is chosen).

X.org

A dataframe consisting of the original unimputed data (which needs to be specified when inference = "+bootstrapping" is chosen).

inference

A character string, either "standard" for applying multiple imputation combining rules upon the model selection/averaging estimates or "+bootstrapping" if additional bootstrap based inference is required. See reference section for more information.

missing.data

A character string, typically "imputed" when multiply imputed data are provided under X, or "CC" if a complete case analysis is desired, or "none" if there is no missing data.

var.remove

Either a vector of character strings or integers, specifying the variables or columns which are part of the data but not to be considered in the model selection/averaging procedure.

user.weights

A weight vector that is relevant to the analysis model.

candidate.models

A character string specifying whether for criterion based model selection/averaging all possible candidate models should be considered ("all"), or only candidate models with a limited amount of variables ("restricted","very restricted").

model.family

A character string specifying the model family, either "gaussian" for linear regression models, "binomial" for logistic regression models, "poisson" for Poisson regression models, or "coxph" for Cox's proportional hazards models.

add.factor

Either a vector of character strings or integers, specifying the variables which should be treated as categorical/factors in the analysis. Variables which are already defined to be factors in the data are detected automatically and do not necessarily need to be specified with this option.

add.interaction

A list of vectors of either character strings or integers, specifying the variables which should be added as interactions in the analysis model.

add.transformation

A vector of character strings, specifying transformations of variables which should be added to the analysis models.

ycol

A vector or integer specifying the variable(s) or column(s) which should be treated as outcome variable.

CI

A value greater than 0 and less than 1 specifying the confidence of the confidence interval.

kfold

An integer specifying kfold cross validation; to be used when applying shrinkage estimation (method="LASSO/LAE") or criterion CV.

id

A character vector or integer specifying the variable or column to be used for a random intercept in the analysis model.

use.stratum

A character vector or integer specifying the variable used as a stratum in Cox regression analysis.

report.exp

A logical value specifying whether exponentiated coefficients should be reported or not.

print.time

A logical value specifying whether analysis time and anticipated estimation time for bootstrap estimation should be printed.

print.warnings

A logical value specifying whether warnings and any other output from the function should be printed or not.

...

Further arguments to be passed, i.e. for functions lae, dredge from the MuMIn package or bic.glm and bic.surv from the BMA package.

Details

Model selection/averaging will be performed on each imputed dataset. The results will be combined according to formulae (7)-(10) in Schomaker and Heumann (CSDA, 2014), see References below for more details. If inference="+bootstrapping" is chosen, then the procedure described in Table 1 will be performed in addition to standard MI inference. For longitudinal data (specified via id) the bootstrap is based on the subject/person/id level. To obtain insightful results from bootstrap estimation B should be large, at least B>200 and plot.mami may be used.

Note that a variable will be formally selected if it is selected (by means of either model selection or averaging) in at least one imputed set of data, but its overall impact will depend on how often it is chosen. As a result, effects of variables which are not supported throughout imputed datasets and candidate models will simply be less pronounced. Variable importance measures based on model averaging weights are calculated for each imputed dataset and will then be averaged.

If method="criterion.average" is chosen and the number of variables is large, then computation time might be a burden and obtaining results can even become unfeasible. The reason for this is that for criterion based model averaging the implementation of package MuMIn is used, which considers all possible candidate models, that is 2^p different candidate models for p parameters to estimate. If it is clear that only a subset of variables are relevant then the options candidate.models="restricted/very restricted" may be useful which essentially specifies that only up to a half/fourth of the provided variables can be added to a single candidate model. However, this option should be used with caution. Alternatively criterion="BIC+" can be used which utilizes efficient Bayesian Model Averaging based on the leaps algorithm of package "BMA". Also, one may consider a model selection or averaging strategy not implemented here and combine estimates "by hand" according to formulae (7)-(10) in the cited reference below.

The function provides access to linear, logistic, Poisson and Cox proportional hazard models; one may add a random intercept to each of these models with the id option. Other models are not supported yet. Variables used for the imputation model but not needed for the analysis model can be removed with option var.remove.

Value

Returns an object of class ‘mami’:

coefficients.ma

A matrix of coefficients, standard errors and confidence intervals for model averaging estimators.

coefficients.ma.boot

A matrix of coefficients and bootstrap results (confidence intervals, mean, standard error) for model averaging.

coefficients.s

A matrix of coefficients, standard errors and confidence intervals for model selection estimators.

coefficients.boot.s

A matrix of coefficients and bootstrap results (confidence intervals, mean, standard error) for model selection.

variable.importance

A vector containing the variable importance for each variable based on model averaging weights.

boot.results

A list of detailed estimation results for each bootstrap sample. The first list element refers to the results from model selection, the second entry the results from model averaging.

Author(s)

Michael Schomaker

References

Schomaker, M., Heumann, C. (2014) Model Selection and Model Averaging after Multiple Imputation, Computational Statistics & Data Analysis, 71:758-770

See Also

plot.mami to visualize bootstrap results, lae and mma for model averaging techniques.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
####################################################
# Example 1: Freetrade example from Amelia package #
#            Cross-Section-Time-Series Data        #
#            Linear and linear mixed model         #
####################################################
set.seed(24121980)
library(Amelia)
data(freetrade)
freetrade$pop <- log(freetrade$pop) # in line with original publication
freetrade_imp <- amelia(freetrade, ts = "year", cs = "country", noms="signed", polytime = 2, 
                        intercs = TRUE, empri = 2)

# AIC based model averaging and model selection in a linear model after MI 
# (with and without bootstrapping)
mami(freetrade_imp, method="criterion.selection", ycol="tariff",add.factor=c("country"))
mami(freetrade_imp, method="criterion.average", ycol="tariff",add.factor=c("country"))
m1 <- mami(freetrade_imp, method="criterion.selection", ycol="tariff",add.factor=c("country"),
           inference="+bootstrapping",B=25,X.org=freetrade,print.time=TRUE)
m1         # be patient with bootstrapping, increase B>=200 for better results
plot(m1, plots.p.page="4")

# For comparison: Mallow's model averaging (MMA) and complete case analysis 
mami(freetrade_imp, method="MMA", ycol="tariff",add.factor=c("country"))
mami(freetrade, method="criterion.selection", missing.data="CC", ycol="tariff",
     add.factor=c("country")) #Note the difference to imputed analysis  (e.g. usheg)
     

# Use linear mixed model with random intercept for country as alternative, just specify "id" option
# Note: same procedure for random intercepts (frailty) in logistic, Poisson, or Cox Model 
mami(freetrade_imp, ycol="tariff", id="country")

####################################################################
# Example 2: HIV treatment data, linear model and Cox model        #
####################################################################

# Impute with Amelia
data(HIV)
HIV_imp <-  amelia(HIV, m=5, idvars="patient",noms=c("hospital","sex","dead","tb","cm"),
                   ords=c("period","stage"),logs=c("futime","cd4"),
                   bounds=matrix(c(3,7,9,11,0,0,0,0,3000,5000,200,150),ncol=3,nrow=4))

# i)  Cox PH model
# Model selection (with AIC) to select risk factors for the hazard of death, 
# reported as hazard ratios
# Also: add transformations and interaction terms to candidate models 
mami(HIV_imp, method="criterion.selection",model.family="coxph", ycol=c("futime","dead"), 
     add.factor=c("hospital","stage","period"), add.transformation=c("cd4^2","age^2"), 
     add.interaction=list(c("cd4","age")), report.exp=TRUE, var.remove=c("patient","cd4slope6")) 
# Similar as above (= same but no hazard ratios reported, no interaction, hospitals as stratum),
# but with boostrap CI and visualization of bootstrap distribution (be patient...it's worth it)
m2 <- mami(HIV_imp, method="criterion.selection",model.family="coxph", inference="+bootstrapping",
           X.org=HIV, ycol=c("futime","dead"), add.factor=c("stage","period"),
           add.transformation=c("cd4^2","age^2"), use.stratum="hospital", B=25, 
           var.remove=c("patient","cd4slope6"),print.time=TRUE,print.warnings=FALSE) 
m2
plot(m2)

# ii) Linear model
# Model selection and averaging to identify predictors for immune recovery 6 months
# after starting antiretroviral therapy, presented as CD4 slope which is the average
# change in number of CD4 cells per week (deaths are ignored for this example)

# AIC based model selection (stepAIC) after multiple imputation 
mami(HIV_imp, method="criterion.selection", ycol="cd4slope6", 
     add.factor=c("hospital","stage","period"), var.remove=c("patient","dead","futime"))
# Model averaging (AIC weights) for variables typically captured
mami(HIV_imp,ycol="cd4slope6", add.factor=c("hospital","stage","period"),
     var.remove=c("patient","dead","futime","tb","cm","haem"))
# Mallow's model averaging
mami(HIV_imp, method="MMA", ycol="cd4slope6", add.factor=c("hospital","stage","period"),
     var.remove=c("patient","dead","futime"))


#########################################################################################
# Example 3:   Model selection/averaging with no missing data, using shrinkage          #
# Example from Tibshirani, R. (1996) Regression shrinkage and selection via the lasso,  #
# Journal of the Royal Statistical Society, Series B 58(1), 267-288.                    #
# Useful to use mami to obtain Bootstrap CI after model selection/averaging             #
#########################################################################################

library(lasso2)
data(Prostate)
mami(Prostate,method="LASSO/LAE",missing.data="none",ycol="lpsa"
     ,kfold=10) # LASSO (selection/averaging) based on 10-fold CV
mami(Prostate,missing.data="none",ycol="lpsa")  # AIC based averaging
m3 <- mami(Prostate,missing.data="none",ycol="lpsa", inference="+bootstrapping",
           B=50,print.time=TRUE) # with Boostrap CI
m3
plot(m3) # a few bimodal distributions: effect or not?


###################################################
# Example 4: use utilities from other packages    #
###################################################

# Model Averaging with AIC: use restrictions as done in "dredge"
# Example: Candidate models cannot contain "svi" and "lcp" at the same time
mami(Prostate,missing.data="none",ycol="lpsa", subset = !(svi && lcp))   
# Example: Make Occam's Window smaller 
mami(Prostate,missing.data="none",ycol="lpsa",criterion="BIC+", OR=5)