stride.estimator: STRIDE: Robust powerful mixture models

Description Usage Arguments Value Details References Examples

View source: R/main.R

Description

STRIDE estimators are nonparametric estimates of the distribution function for mixture data where the population identifiers are unknown, and the probability of belonging to a population is known (typically estimated with external data). The distribution functions are evaluated at time points tval. All STRIDE estimators can adjust for dynamic landmark prediction. The NPNA, NPNA_avg and NPNA_wrong estimators can adjust for one discrete covariate (zz) and one continuous covariate (ww). See details below.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
stride.estimator(
  n,
  m,
  p,
  qvs,
  q,
  x,
  delta,
  ww,
  zz,
  run.NPMLEs,
  run.NPNA,
  run.NPNA_avg,
  run.NPNA_wrong,
  run.OLS,
  run.WLS,
  run.EFF,
  run.EMPAVA,
  tval,
  tval0,
  z.use,
  w.use,
  update.qs,
  know.true.groups = FALSE,
  true.group.identifier = NULL,
  run.prediction.accuracy,
  do_cross_validation_AUC_BS
)

Arguments

n

sample size, must be at least 1.

m

number of different mixture proportions, must be at least 2.

p

number of populations, must be at least 2.

qvs

a numeric matrix of size p by m containing all possible mixture proportions (i.e., the probability of belonging to each population k, k=1,...,p.).

q

a numeric matrix of size p by n containing the mixture proportions for each person in the sample.

x

a numeric vector of length n containing the observed event times for each person in the sample.

delta

a numeric vector of length n that denotes censoring (1 denotes event is observed, 0 denotes event is censored).

ww

a numeric vector of length n containing the values of the continuous covariate for each person in the sample. Can be NULL.

zz

a numeric vector of length n containing the values of the discrete covariate for each person in the sample. Can be NULL.

run.NPMLEs

a logical indicator. If TRUE, then the output includes the estimated distribution function for mixture data based on the type-I and type II nonparametric maximum likelihood estimators. The type I nonparametric maximum likelihood estimator is referred to as the "Kaplan-Meier" estimator in Garcia and Parast (2020). Neither the type I nor type II adjust for covariates.

run.NPNA

a logical indicator. If TRUE, then the output includes the estimated distribution function for mixture data that accounts for covariates and dynamic landmarking. This estimator is called "NPNA" in Garcia and Parast (2020).

run.NPNA_avg

a logical indicator. If TRUE, then the output includes the estimated distribution function for mixture data that averages out over the observed covariates. This is referred to as NPNA_marg in Garcia and Parast (2020).

run.NPNA_wrong

a logical indicator. If TRUE, then the output includes the estimated distribution function for mixture data that adjusts for covariates, but ignores landmarking. This is referred to as NPNA_t_0=0 in Garcia and Parast (2020).

run.OLS

a logical indicator. If TRUE, then the output includes the estimated distribution function computed using an ordinary least squares influence function. The estimator adjusts for censoring using inverse probability weighting (IPW), augmented inverse probability weighting (AIPW), and imputation (IMP). See details in Wang et al (2012). These estimators do not adjust for covariates.

run.WLS

a logical indicator. If TRUE, then the output includes the estimated distribution function computed using a weighted least squares influence function. The estimator adjusts for censoring using inverse probability weighting (IPW), augmented inverse probability weighting (AIPW), and imputation (IMP). See details in Wang et al (2012). These estimators do not adjust for covariates.

run.EFF

a logical indicator. If TRUE, then the output includes the estimated distribution function computed using the efficient influence function based on Hilbert space projection theory results. The estimator adjusts for censoring using inverse probability weighting (IPW), augmented inverse probability weighting (AIPW), and imputation (IMP). See details in Wang et al (2012). These estimators do not adjust for covariates.

run.EMPAVA

logical indicator. If TRUE, we compute the distribution function for the mixture data based on an expectation-maximization (EM) algorithm that uses the pool adjacent violators algorithm (PAVA) from isotone regression to yield a non-negative and monotone estimator. This estimator does not adjust for covariates. See details in Qing et al (2014).

tval

numeric vector of time points at which the distribution function is evaluated, all values must be non-negative.

tval0

numeric vector of time points representing the landmark times. All values must be non-negative and smaller than the maximum of tval.

z.use

numeric vector at which to evaluate the discrete covariate Z at in the estimated distribution function. The values of z.use must be in the range of the observed zz. Can be NULL.

w.use

numeric vector at which to evaluate the continuous covariate W at in the estimated distribution function. The values of w.use must be in the range of the observed ww. Can be NULL.

update.qs

logical indicator. If TRUE, the mixture proportions q will be updated. This is currently not implemented.

know.true.groups

logical indicator. If TRUE, then we know the population identifier for each person in the sample. This option is only used for simulation studies to check prediction accuracy. Default is FALSE.

true.group.identifier

numeric vector of length n denoting the population identifier for each person in the sample. Default is NULL.

run.prediction.accuracy

logical indicator. If TRUE, then we compute the prediction accuracy measures, including the area under the receiver operating characteristic curve (AUC) and the Brier Score (BS). Prediction accuracy is only valid in simulation studies where know.true.groups=TRUE and true.group.identifier is available.

do_cross_validation_AUC_BS

logical indicator. If TRUE, then we compute the prediction accuracy measures, including the area under the receiver operating characteristic curve (AUC) and the Brier Score (BS) using cross-validation. Prediction accuracy is only valid in simulation studies where know.true.groups=TRUE and true.group.identifier is available.

Value

stride.estimator returns a list containing

Details

We estimate nonparametric distribution functions for mixture data where the population identifiers are unknown, and the probability of belonging to a population is known (typically estimated with external data). The distribution functions are evaluated at time points tval. All estimators adjust for dynamic landmark prediction. Dynamic landmark prediction means that the distribution function is computed knowing that the survival time, T, satisfies T >t_0 where t_0 are the time points in tval0. The NPNA, NPNA_avg, and NPNA_wrog adjust for one discrete covariate (zz) and one continuous covariate (ww).

References

Garcia, T.P. and Parast, L. (2020). Dynamic landmark prediction for mixture data. Biostatistics, doi:10.1093/biostatistics/kxz052.

Garcia, T.P., Marder, K. and Wang, Y. (2017). Statistical modeling of Huntington disease onset. In Handbook of Clinical Neurology, vol 144, 3rd Series, editors Andrew Feigin and Karen E. Anderson.

Qing, J., Garcia, T.P., Ma, Y., Tang, M.X., Marder, K., and Wang, Y. (2014). Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint. Annals of Applied Statistics, 8(2), 1182-1208.

Wang, Y., Garcia, T.P., and Ma. Y. (2012). Nonparametric estimation for censored mixture data with application to the Cooperative Huntington's Observational Research Trial. Journal of the American Statistical Association, 107, 1324-1338.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
# Setup parameters to generate the data
set.seed(1)
censoring.rate <- 40
p <- 2
n <- 2000
m <- 4

simu.setting <- "HD-With-Covariates"
qvs <- qvs.values(p,m)

## generate the data
data.gen <- GenerateData(n,p,m,qvs,censoring.rate,simu.setting)
x <- data.gen$x
delta <- data.gen$delta
q <- data.gen$q
ww <- data.gen$ww
zz <- data.gen$zz


## Estimation procedures to run to estimate F(t|t0,z,w)
update.qs <- FALSE
run.NPMLEs <- TRUE
run.NPNA <- TRUE
run.NPNA_avg <- FALSE
run.NPNA_wrong <- FALSE
run.OLS <- FALSE
run.WLS <- FALSE
run.EFF <- FALSE
run.EMPAVA <- FALSE


## The distribution function we are estimating is F(t|t0,z,w).
tval <- seq(0,80,by=5)  ## tval refers to "t" in F(t|t0,z,w)
tval0 <- c(0,20,30,40,50) ##tval0 refers to "t0" in F(t|t0,z,w)
z.use <- c(0,1)  ## z.use refers to "z" in  F(t|t0,z,w)
w.use <- seq(35,55,by=1)  ## w.use refers to "w" in F(t|t0,z,w)

## Setup to compute AUC/BS as in Garcia and Parast (2020). Only for simulated data.
run.prediction.accuracy <- TRUE
do_cross_validation_AUC_BS <- FALSE
know.true.groups <- TRUE
true.group.identifier <- data.gen$true.group.identifier


## Perform the estimation
estimators.out <- stride.estimator(n,m,p,qvs,q,
                                   x,delta,ww,zz,
                                   run.NPMLEs,
                                   run.NPNA,
                                   run.NPNA_avg,
                                   run.NPNA_wrong,
                                   run.OLS,
                                   run.WLS,
                                   run.EFF,
                                   run.EMPAVA,
                                   tval,tval0,
                                   z.use,w.use,
                                   update.qs,
                                   know.true.groups,
                                   true.group.identifier,
                                   run.prediction.accuracy,
                                   do_cross_validation_AUC_BS)

## Show results for the estimates
## estimators.out$Ft.estimate
## estimators.out$St.estimate

## Show results for prediction accuracy AUC and BS measures (only valid for simulated data
##  where we know the true.group.identifiers.)
## estimators.out$Ft.AUC.BS
## estimators.out$St.AUC.BS


## NOT RUN
## Do bootstrap variance
#nboot <- 100
#variance.estimation <- TRUE

#varboot <- stride.bootstrap.variance(
#						nboot,n,m,p,qvs,q,
#						x,delta,ww,zz,
#						run.NPMLEs,
#						run.NPNA,
#						run.NPNA_avg,
#						run.NPNA_wrong,
#           run.OLS,
#           run.WLS,
#           run.EFF,
#           run.EMPAVA,
#						tval,tval0,
#						z.use,w.use,
#						update.qs,
#						know.true.groups,
#						true.group.identifer,
#						estimator_Ft=estimators.out$Ft.estimate,
#						estimator_St=estimators.out$St.estimate,
#						AUC_BS_Ft=estimators.out$Ft.AUC.BS,
#						AUC_BS_St=estimators.out$St.AUC.BS,
#						run.prediction.accuracy,
#						do_cross_validation_AUC_BS=FALSE)

## Show results for the bootstrap variances of the estimates
## varboot$Ft.estimate.boot
## varboot$St.estimate.boot


## Show results for the bootstrap variances of the prediction accuracy measures, AUC and BS
## varboot$Ft.AUC.BS.boot
## varboot$St.AUC.BS.boot

tpgarcia/stride documentation built on March 18, 2021, 3:42 p.m.