Tools for analyzing healthcare cost data with zero inflation and skewed distribution via two stage super learner
Author: Ziyue Wu
The study of healthcare expenditures has become an important area in epidemiological and public health researches as healthcare utilization and associated costs have increased rapidly in recent years. Estimating healthcare expenditures is challenging due to heavily skewed distributions and zero-inflation. Myriad methods have been developed for analyzing cost data, however, a priori determination of an appropriate method is difficult. Super-learning, a technique that considers an ensemble of methods for cost estimation, provides an interesting alternative for modeling healthcare expenditures as it addresses the difficulties in choosing the correct model form. The super learner has shown benefits over a single method in recent studies. Nevertheless, it is notable that none of the healthcare cost studies have explored the use of super learner in the case of zero-inflated cost data. Here we proposed a new method called the two-stage super learner that implemented the super-learning in a two-part model framework. The two-stage super learner had natural support for cases of zero healthcare utilization and extended the standard super learner by providing an additional layer of the ensemble, thus allowing for greater flexibility and improving the cost prediction.
To deal with zero-inflation in the positively skewed data, we designed a method that implemented the super learner under the Two-part model framework (Two-stage Super Learner). Using (E[Y|X] = Pr(Y>0|X)E[Y|Y>0,X]), the objective of estimating (E[Y|X]) could be broken into two separate pieces: estimating (Pr(Y>0|X)) and (E[Y|Y>0,X]). This is done by adding an additional layer of the ensemble to tackle the problem of a point mass at zero. Rather than specifying a single library of prediction algorithms as in standard super learner, we specify two separate libraries of prediction algorithms, one for each of the two stages. Specifically, we specify the first stage library by positing a collection of different estimators predicting the probability of the outcome being positive and specify the second stage library by positing another collection of estimators predicting the mean of positive outcomes. Assuming the first stage library includes (K1) methods and the second stage library includes (K2) methods, then the two-stage super learner’s “whole library” would contain (K1*K2) candidate estimators with each one representing a specific combination of methods from the first stage and second stage. The two-stage super learner is also an ensembling method built through a combination of algorithms that provide the best fit to the data.
twostageSL
is an R package that provides tools for analyzing health
care cost data useing the super learning technique in the two stage
framework.
twostageSL can be installed from Github using the following command:
# install.packages("devtools")
library(devtools)
devtools::install_github("wuziyueemory/twostageSL")
library(twostageSL)
To show an example on how to use this package, we simulated an example with training data containing 1000 observations, 5 predictors and testing data containing 100 observations, 5 predictions, all following a normal distribution. The outcome is continuous and non-negative with relatively high proportion of zeros.
Generate training and testing set.
library(twostageSL)
set.seed(121)
## training set
n <- 1000
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- rep(NA,n)
## probability of outcome being zero
prob <- plogis(1 + X[,1] + X[,2] + X[,1]*X[,2])
g <- rbinom(n,1,prob)
## assign zero outcome
ind <- g==0
Y[ind] <- 0
## assign non-zero outcome
ind <- g==1
Y[ind] <- 10 + X[ind, 1] + sqrt(abs(X[ind, 2] * X[ind, 3])) + X[ind, 2] - X[ind, 3] + rnorm(sum(ind))
## test set
m <- 100
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
newX <- data.frame(newX)
newY <- rep(NA,m)
## probability of outcome being zero
newprob <- plogis(1 + newX[,1] + newX[,2] + newX[,1]*newX[,2])
newg <- rbinom(n,1,newprob)
## assign zero outcome
newind <- newg==0
newY[newind] <- 0
## assign non-zero outcome
newind <- g==1
newY[newind] <- 10 + newX[newind, 1] + sqrt(abs(newX[newind, 2] * newX[newind, 3])) + newX[newind, 2] - X[newind, 3] + rnorm(sum(newind))
The training data looks like
#> X1 X2 X3 X4 X5
#> 1 -0.25536047 0.5210512 0.07889016 1.5794606 -0.2476091
#> 2 0.10837472 0.8525208 0.26945885 -0.4203506 0.5246956
#> 3 0.12778056 -1.9302377 -0.33684853 -0.9691443 -0.6481140
#> 4 -0.08107345 -1.4007554 0.23814700 -0.2929321 -1.9521166
#> 5 -0.71368372 0.4402426 -1.15347200 -0.5136347 -1.3213276
#> 6 1.61533208 0.2069617 -0.04666425 0.2745282 -2.6145711
The proportion of zeros in outcome Y.
#> [1] 0.368
The distribution of outcome Y, which is zero-inflated and heavily
skewed.
twostageSL
is the core function to fit the two stage super learner. At
a minimum for implementation, you need to specify the predictor matrix
X, outcome variable Y, library of prediction algorithms at two stages,
boolean for two stage superlearner or standard superlearner,
distribution at two stages and number of folds for cross-validation. The
package incldues all the prediction algorithms from package
SuperLearner
and also includes additional prediction algorithms that
specifically used at stage 2 to deal with heavily skewed data.
Generate the library and run the two stage super learner
## generate the Library
twostage.library <- list(stage1=c("SL.mean","SL.glm","SL.earth"),
stage2=c("SL.mean","SL.glm","SL.earth","SL.coxph"))
onestage.library <- c("SL.mean","SL.glm","SL.earth")
## run the twostage super learner
two <- twostageSL(Y=Y,
X=X,
newX = newX,
library.2stage <- twostage.library,
library.1stage <- onestage.library,
twostage = TRUE,
cvControl = list(V = 5))
two
#>
#> Call:
#> twostageSL(Y = Y, X = X, newX = newX, library.2stage = library.2stage <- twostage.library,
#> library.1stage = library.1stage <- onestage.library, twostage = TRUE,
#> cvControl = list(V = 5))
#>
#>
#> Risk Coef
#> S1: SL.mean_All + S2: SL.mean_All 30.99158 0.0000000
#> S1: SL.mean_All + S2: SL.glm_All 26.02499 0.0000000
#> S1: SL.mean_All + S2: SL.earth_All 26.34461 0.0000000
#> S1: SL.mean_All + S2: SL.coxph_All 26.25908 0.0000000
#> S1: SL.glm_All + S2: SL.mean_All 22.91554 0.0000000
#> S1: SL.glm_All + S2: SL.glm_All 20.80390 0.0000000
#> S1: SL.glm_All + S2: SL.earth_All 20.97040 0.0000000
#> S1: SL.glm_All + S2: SL.coxph_All 20.80807 0.2111024
#> S1: SL.earth_All + S2: SL.mean_All 21.59218 0.0000000
#> S1: SL.earth_All + S2: SL.glm_All 19.64819 0.4474539
#> S1: SL.earth_All + S2: SL.earth_All 19.64437 0.1748323
#> S1: SL.earth_All + S2: SL.coxph_All 19.65313 0.0000000
#> Single: SL.mean_All 30.99155 0.0000000
#> Single: SL.glm_All 21.34758 0.0000000
#> Single: SL.earth_All 20.13396 0.1666114
When setting twostage
to FALSE, we get the result from standard super
learner
## run the standard super learner
one <- twostageSL(Y=Y,
X=X,
newX = newX,
library.2stage <- twostage.library,
library.1stage <- onestage.library,
twostage = FALSE,
cvControl = list(V = 5))
one
#>
#> Call:
#> SuperLearner(Y = Y, X = X, family = family.single, SL.library = library.1stage,
#> method = method.CC_LS.scale, verbose = verbose, control = list(saveCVFitLibrary = T),
#> cvControl = cvControl)
#>
#>
#> Risk Coef
#> SL.mean_All 30.99155 0.01539633
#> SL.glm_All 21.34758 0.27146007
#> SL.earth_All 20.13396 0.71314359
We could also use CV.twostageSL
to get v-fold cross validated risk
estimate for two stage super learner, standard super learner, and all
prediction algorithms in the library. In this case we set numbert of
folds to be 2.
two_cv <- CV.twostageSL(
Y = Y, X = X,
library.2stage = twostage.library,
library.1stage = onestage.library,
cvControl = list(V = 2),
innerCvControl = list(list(V = 5),
list(V = 5))
)
two_cv
#>
#> Call:
#> CV.twostageSL(Y = Y, X = X, library.2stage = twostage.library, library.1stage = onestage.library,
#> cvControl = list(V = 2), innerCvControl = list(list(V = 5), list(V = 5)))
#>
#>
#>
#> Cross-validated predictions from the two-stage Super Learner: SL.predict
#>
#> Cross-validated predictions from the discrete super learner (cross-validation selector): discreteSL.predict
#>
#> Which library algorithm was the discrete super learner: whichDiscreteSL
#>
#> Cross-validated prediction for all algorithms in the library and standard super learner: library.predict
The summary and plots are shown below, with the two stage super learner outperforming all other prediction algorithms.
#>
#> Call:
#> CV.twostageSL(Y = Y, X = X, library.2stage = twostage.library, library.1stage = onestage.library,
#> cvControl = list(V = 2), innerCvControl = list(list(V = 5), list(V = 5)))
#>
#> Risk is based on: Mean Squared Error
#>
#> All risk estimates are based on V = 2
#>
#> Algorithm Ave se Min Max
#> Two-stage Super Learner 19.514 0.72115 19.013 20.015
#> Discrete SL 19.895 0.84434 18.929 20.861
#> S1: SL.mean_All + S2: SL.mean_All 30.991 0.64242 30.953 31.030
#> S1: SL.mean_All + S2: SL.glm_All 26.036 0.49851 25.657 26.415
#> S1: SL.mean_All + S2: SL.earth_All 26.168 0.50502 25.790 26.545
#> S1: SL.mean_All + S2: SL.coxph_All 26.206 0.48504 25.714 26.697
#> S1: SL.glm_All + S2: SL.mean_All 23.099 0.67940 22.807 23.391
#> S1: SL.glm_All + S2: SL.glm_All 20.889 0.71529 20.883 20.894
#> S1: SL.glm_All + S2: SL.earth_All 20.952 0.71781 20.923 20.981
#> S1: SL.glm_All + S2: SL.coxph_All 20.919 0.70313 20.908 20.929
#> S1: SL.earth_All + S2: SL.mean_All 21.466 0.79182 19.988 22.945
#> S1: SL.earth_All + S2: SL.glm_All 19.921 0.84741 18.837 21.005
#> S1: SL.earth_All + S2: SL.earth_All 19.895 0.84434 18.929 20.861
#> S1: SL.earth_All + S2: SL.coxph_All 19.888 0.83303 18.854 20.923
#> Single: SL.mean_All 30.991 0.64242 30.953 31.030
#> Single: SL.glm_All 21.451 0.71995 20.942 21.961
#> Single: SL.earth_All 21.004 0.78972 19.124 22.884
#> Standard SuperLearner 20.184 0.69532 19.268 21.101
We could use predict
to obtain predictions on a new dataset from the
twostageSL fit. The onlySL
argument controls whether to compute
predictions for all algorithms or for algorithms with non-zero
coefficients in the two stage super learner fit.
newdata <- newX[c(1:50),]
prediction <- predict(two,newdata = newX, onlySL = TRUE)
#> Predictions from the two-stage Super Learner: pred
#>
#> Prediction for all algorithms in the library: library.predict
This returns a list with predictions from two stage super learner
(pred
) and predictions for each algorithm in library
(library.predict
).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.