To deal with missing data, multiple imputation is the golden standard (Schafer & Graham, 2002). With GLMs, the models fitted on each imputed dataset can then be pooled. For non-parameteric methods like prediction rule ensembles, such pooling is more difficult. Little research has been performed on how to best deal with missing data in fitting prediction rule ensembles, but there are currently three options:
Listwise deletion. Although the default in pre()
, this is certainly the least favorable option.
Single imputation: Perform only a single imputation and fit a prediction rule ensemble on this single dataset. This is likely better than listwise deletion, but likely inferior to multiple imputation, but easy to implement.
Multiple imputation approach by aggregating ensembles: Create multiple imputed datasets; fit a separate prediction rule ensemble to each of the imputed datasets; aggregate the ensembles into a single final ensemble. In terms of predictive accuracy, this approach will work well. It will, however, yield more complex ensembles than the former two approaches, and the next approach.
Combining mean imputation with the Missing-In-Attributes approach. According to Josse et al. (2019), mean imputation and the Missing-In-Attributes approaches perform well from a prediction perspective. Furthermore, they are computationally inexpensive.
Below, we provide examples for the first three approaches described above. In future versions of package pre
, the mean imputation combined with MIA approach will be implemented.
For the examples, we will be predicting Wind speels using the airquality
dataset (we focus on predicting the wind
variable, because it does not have missing values, while variables Ozone
and Solar.R
do):
head(airquality) nrow(airquality) library("mice") md.pattern(airquality, rotate.names = TRUE)
This option is the default of function pre()
:
library("pre") set.seed(43) airq.ens <- pre(Wind ~., data = airquality) airq.ens
With listwise deletion, only r sum(complete.cases(airquality))
out of r nrow(airquality)
observations are retained. We obtain a rather sparse ensemble.
Here we apply single imputation by replacing missing values with the mean:
imp0 <- airquality imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE) imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE) set.seed(43) airq.ens.imp0 <- pre(Wind ~., data = imp0) airq.ens.imp0
We obtain a larger number of rules, and slightly lower cross-validated mean squared error. However, this model cannot really be compared with the listwise deletion model, because they are computed over different sets of observations.
We perform multiple imputation by chained equations, using the predictive mean matching method. We generate 5 imputed datasets:
set.seed(42) imp <- mice(airquality, m = 5)
We create a list
with imputed datasets:
imp1 <- complete(imp, action = "all", include = FALSE)
We load the pre
library:
library("pre")
We create a custom function that fits PREs to several datasets contained in a list:
pre.agg <- function(datasets, ...) { result <- list() for (i in 1:length(datasets)) { result[[i]] <- pre(datasets[[i]], ...) } result }
We apply the new function:
set.seed(43) airq.agg <- pre.agg(imp1, formula = Wind ~ .)
Note that we can used the ellipsis (...
) to pass arguments to pre
(see ?pre
for an overview of arguments that can be specified).
We now define print
, summary
, predict
and coef
methods to extract results from the fitted ensemble. Again, we can use the ellipsis (...
) to pass arguments to the print
, summary
, predict
and coef
methods of function pre
(see e.g., ?pre:::print.pre
for more info):
print.agg <- function(object, ...) { result <- list() sink("NULL") for (i in 1:length(object)) { result[[i]] <- print(object[[i]], ...) } sink() print(result) } print.agg(airq.agg) ## results suppressed for space considerations summary.agg <- function(object, ...) { for (i in 1:length(object)) summary(object[[i]], ...) } summary.agg(airq.agg) ## results suppressed for space considerations
For averaging over predictions, there is only one option for continuous outcomes. For non-continuous outcomes, we can average over the linear predictor, or over the predicted values on the scale of the response. I am not sure which would be more appropriate; the resulting predicted values will not be identical, but highly correlated, though.
predict.agg <- function(object, newdata, ...) { rowMeans(sapply(object, predict, newdata = newdata, ...)) } agg_preds <- predict.agg(airq.agg, newdata = airquality[1:4, ]) agg_preds
Finally, the coef
method should return the averaged / aggregated final PRE. That is, it returns:
1) One averaged intercept;
2) All rules and linear terms, with their coefficients scaled by the number of datasets;
3) In presence of identical rules and linear terms, it aggregates those rules and their coefficients into one rule / term, and adds together the scaled coefficients.
Note that linear terms that do not have the same winsorizing points will not be aggregated. Note that the labels of rules and variables may overlap between different datasets (e.g., the label rule 12
may appear multiple times in the aggregated ensemble, but each rule 12
will have different conditions).
coef.agg <- function(object, ...) { coefs <- coef(object[[1]], ...) coefs <- coefs[coefs$coefficient != 0, ] for (i in 2:length(object)) { coefs_tmp <- coef(object[[i]], ...) coefs <- rbind(coefs, coefs_tmp[coefs_tmp$coefficient != 0, ]) } ## Divide coefficients by the number of datasets: coefs$coefficient <- coefs$coefficient / length(object) ## Identify identical rules: duplicates <- which(duplicated(coefs$description)) for (i in duplicates) { first_match <- which(coefs$description == coefs$description[i])[1] ## Add the coefficients: coefs$coefficient[first_match] <- coefs$coefficient[first_match] + coefs$coefficient[i] } ## Remove duplicates: coefs <- coefs[-duplicates, ] ## Check if there are- duplicate linear terms left and repeat: duplicates <- which(duplicated(coefs$rule)) for (i in duplicates) { first_match <- which(coefs$rule == coefs$rule[i])[1] coefs$coefficient[first_match] <- coefs$coefficient[first_match] + coefs$coefficient[i] } coefs <- coefs[-duplicates, ] ## Return results: coefs } coef.agg(airq.agg)
We have obtained a final ensemble of r nrow(coef.agg(airq.agg))-1
terms.
We compare performance using 10-fold cross validation. We evaluate predictive accuracy and the number of selected rules. We only evaluate accuracy for observations that have no missing values.
k <- 10 set.seed(43) fold_ids <- sample(1:k, size = nrow(airquality), replace = TRUE) observed <- c() for (i in 1:k) { ## Separate training and test data test <- airquality[fold_ids == i, ] test <- test[!is.na(test$Ozone), ] test <- test[!is.na(test$Solar.R), ] observed <- c(observed, test$Wind) } preds <- data.frame(observed) preds$LWD <- preds$SI <- preds$MI <- preds$observed nterms <- matrix(nrow = k, ncol = 3) colnames(nterms) <- c("LWD", "SI", "MI") row <- 1 for (i in 1:k) { if (i > 1) row <- row + nrow(test) ## Separate training and test data train <- airquality[fold_ids != i, ] test <- airquality[fold_ids == i, ] test <- test[!is.na(test$Ozone), ] test <- test[!is.na(test$Solar.R), ] ## Fit and evaluate listwise deletion premod <- pre(Wind ~ ., data = train) preds$LWD[row:(row+nrow(test)-1)] <- predict(premod, newdata = test) tmp <- print(premod) nterms[i, "LWD"] <- nrow(tmp) - 1 ## Fit and evaluate single imputation imp0 <- train imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE) imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE) premod.imp0 <- pre(Wind ~., data = imp0) imp1 <- test imp1$Solar.R[is.na(imp1$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE) imp1$Ozone[is.na(imp1$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE) preds$SI[row:(row+nrow(test)-1)] <- predict(premod.imp0, newdata = imp1) tmp <- print(premod.imp0) nterms[i, "SI"] <- nrow(tmp) - 1 ## Perform multiple imputation imp <- mice(train, m = 5) imp1 <- complete(imp, action = "all", include = FALSE) airq.agg <- pre.agg(imp1, formula = Wind ~ .) preds$MI[row:(row+nrow(test)-1)] <- predict.agg(airq.agg, newdata = test) nterms[i, "MI"] <- nrow(coef.agg(airq.agg)) - 1 }
save(preds, nterms, file = "Missing_data_results.Rda")
load("Missing_data_results.Rda")
sapply(preds, function(x) mean((preds$observed - x)^2)) ## MSE sapply(preds, function(x) sd((preds$observed - x)^2)/sqrt(nrow(preds))) ## SE of MSE var(preds$observed) ## benchmark: Predict mean for all
Interestingly, we see that all three methods yield similar predictions and accuracy, and explain about 20% of variance in the response. Multiple imputation performed best, followed by listwise deletion, followed by single imputation. Taking into account the standard errors, however, these differences are not significant. Also, this simple evaluation on only a single dataset should not be taken too seriously. The better performance of multiple imputation does come at the cost of increased complexity:
boxplot(nterms, main = "Number of selected terms \nper missing-data method", cex.main = .8)
In line with findings of Josse et al. (2019), we expect MIA to work better for rules than mean imputation. In future versions of package pre
, we plan to implement MIA (for the rules) and combine it with mean imputation (for the linear terms).
In case you obtained different results, these results were obtained using the following:
sessionInfo()
Josse, J., Prost, N., Scornet, E., & Varoquaux, G. (2019). On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931. https://arxiv.org/abs/1902.06931
Miles, A. (2016). Obtaining predictions from models fit to multiply imputed data. Sociological Methods & Research, 45(1), 175-185. https://doi.org/10.1177/0049124115610345
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147. https://doi.org/10.1037/1082-989X.7.2.147
# Austin et al. (2019) refer to Wood et al. (2008) for stacking with multiply imputed data: # # "Stacked Imputed Datasets With Weighted Regressions (W1, W2, and W3) # # This approach entails stacking the M imputed datasets into 1 large dataset and then conducting variable selection in this single stacked dataset. To account for the multiple observations for each subject, weights are incorporated into the regression model when conducting variable selection. Wood et al proposed 3 different sets of weights that could be used: W1: w=1/M, in which each subject is weighted using the reciprocal of the number of imputed datasets; W2: w=(1−f)/M, where f denotes the proportion of missing data across all variables; W3: wj=(1−fj)/M, where fj denotes the proportion of missing data for variable Xj. Using the third approach, a different set of weights is used when assessing the statistical significance of a given candidate predictor variable." # # Austin, P. C., Lee, D. S., Ko, D. T., & White, I. R. (2019). Effect of variable selection strategy on the performance of prognostic models when using multiple imputation. Circulation: Cardiovascular Quality and Outcomes, 12(11), e005927. # # Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data?. Statistics in medicine, 27(17), 3227-3246. ## Does glmnet allow for such an approach? Seems not, as weights are scaled automatically, see also https://stats.stackexchange.com/a/196615/173546
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.