tmle: Targeted Maximum Likelihood Estimation

View source: R/tmle.R

tmleR Documentation

Targeted Maximum Likelihood Estimation

Description

Targeted maximum likelihood estimation of parameters of a marginal structural model, and of marginal treatment effects of a binary point treatment on an outcome. In addition to the additive treatment effect, risk ratio and odds ratio estimates are reported for binary outcomes. The tmle function is generally called with arguments (Y,A,W), where Y is a continuous or binary outcome variable, A is a binary treatment variable, (A=1 for treatment, A=0 for control), and W is a matrix or dataframe of baseline covariates. The population mean outcome is calculated when there is no variation in A. If values of binary mediating variable Z are supplied, estimates are returned at each level of Z. Missingness in the outcome is accounted for in the estimation procedure if missingness indicator Delta is 0 for some observations. Repeated measures can be identified using the id argument. Option to adjust for biased sampling using the obsWeights argument. Targeted bootstrap inference can be obtained in addition to IC-based inference by setting B to a value greater than 1 (10,000 recommended for analyses requiring high precision).

Usage

tmle(Y, A, W, Z=NULL, Delta = rep(1,length(Y)), Q = NULL, Q.Z1 = NULL, Qform = NULL, 
     Qbounds = NULL, Q.SL.library = c("SL.glm", "tmle.SL.dbarts2", "SL.glmnet"), 
     cvQinit = TRUE, g1W = NULL, gform = NULL, 
     gbound = NULL,  pZ1=NULL,
     g.Zform = NULL, pDelta1 = NULL, g.Deltaform = NULL, 
     g.SL.library = c("SL.glm", "tmle.SL.dbarts.k.5", "SL.gam"),
     g.Delta.SL.library =  c("SL.glm", "tmle.SL.dbarts.k.5", "SL.gam"),
     family = "gaussian", fluctuation = "logistic", alpha = 0.9995, id=1:length(Y), 
     V.Q = 10, V.g=10, V.Delta=10, V.Z = 10,
     verbose = FALSE, Q.discreteSL=FALSE, g.discreteSL=FALSE, g.Delta.discreteSL=FALSE,
     prescreenW.g=TRUE, min.retain = 5, target.gwt = TRUE, automate=FALSE,
     obsWeights = NULL, alpha.sig = 0.05, B = 1)

Arguments

Y

continuous or binary outcome variable

A

binary treatment indicator, 1 - treatment, 0 - control

W

vector, matrix, or dataframe containing baseline covariates

Z

optional binary indicator for intermediate covariate for controlled direct effect estimation

Delta

indicator of missing outcome or treatment assignment. 1 - observed, 0 - missing

Q

optional n \times 2 matrix of initial values for Q portion of the likelihood, (E(Y|A=0,W), E(Y|A=1,W))

Q.Z1

optional n \times 2 matrix of initial values for Q portion of the likelihood, (E(Y|Z=1,A=0,W), E(Y|Z=1,A=1,W)). (When specified, values for E(Y|Z=0,A=0,W), E(Y|Z=0,A=1,W) are passed in using the Q argument

Qform

optional regression formula for estimation of E(Y|A,W), suitable for call to glm

Qbounds

vector of upper and lower bounds on Y and predicted values for initial Q. Defaults to the range of Y, widened by 1% of the min and max values.

Q.SL.library

optional vector of prediction algorithms to use for SuperLearner estimation of initial Q

cvQinit

logical, if TRUE, estimates cross-validated predicted values, default=TRUE

g1W

optional vector of conditional treatment assingment probabilities, P(A=1|W)

gform

optional regression formula of the form A~W, if specified this overrides the call to SuperLearner

gbound

value between (0,1) for truncation of predicted probabilities. See Details section for more information

pZ1

optionaln \times 2 matrix of conditional probabilities P(Z=1|A=0,W), P(Z=1|A=1,W)

g.Zform

optional regression formula of the form Z~A+W, if specified this overrides the call to SuperLearner

pDelta1

optional matrix of conditional probabilities for missingness mechanism, n \times 2 when Z is NULL P(Delta=1|A=0,W), P(Delta=1|A=1,W). n \times 4 otherwise, P(Delta=1|Z=0,A=0,W), P(Delta=1|Z=0,A=1,W), P(Delta=1|Z=1,A=0,W), P(Delta=1|Z=1,A=1,W)

g.Deltaform

optional regression formula of the form Delta~A+W, if specified this overrides the call to SuperLearner

g.SL.library

optional vector of prediction algorithms to use for SuperLearner estimation of g1W

g.Delta.SL.library

optional vector of prediction algorithms to use for SuperLearner estimation of pDelta1

family

family specification for working regression models, generally ‘gaussian’ for continuous outcomes (default), ‘binomial’ for binary outcomes

fluctuation

‘logistic’ (default), or ‘linear’

alpha

used to keep predicted initial values bounded away from (0,1) for logistic fluctuation

id

optional subject identifier

V.Q

Number of cross-validation folds for super learner estimation of Q

V.g

Number of cross-validation folds for super learner estimation of g

V.Delta

Number of cross-validation folds for super learner estimation of missingness mechanism

V.Z

Number of cross-validation folds for super learner estimation of intermediate variable

verbose

status messages printed if set to TRUE (default=FALSE)

Q.discreteSL

if TRUE, discreteSL is used instead of ensemble SL. Ignored when SL not used to estimate Q

g.discreteSL

if TRUE, discreteSL is used instead of ensemble SL. Ignored when SL not used to estimate g1W

g.Delta.discreteSL

if TRUE, discreteSL is used instead of ensemble SL. Ignored when SL not used to estimate P(Delta = 1 | A,W)

prescreenW.g

Option to screen covariates before estimating g in order to retain only those associated with the outcome (Recommend FALSE in low dimensional datasets)

min.retain

Minimum number of covariates to retain when prescreening covariates for g. Ignored when prescreenW.g=FALSE

target.gwt

When TRUE, move g from denominator of clever covariate to the weight when fitting epsilon

automate

When TRUE, all tuning parameters are set to their default values. Number of cross validation folds, truncation level for g, and decision to prescreen covariates for modeling g are set data-adaptively based on sample size (see details).

obsWeights

Optional observation weights to account for biased sampling

alpha.sig

significance level for constructing 1-alpha.sig confidence intervals

B

Number of boostrap iterations. Set B>1 to obtain targeted bootstrap based inference in addition to IC-based inference (see Details).

Details

gbounds Lower bound defaults to lb = 5/sqrt(n)/log(n). For treatment effect estimates and population mean outcome the upper bound defaults to 1. For ATT and ATC, the upper bound defaults to 1- lb.

W may contain factors. These are converted to indicators via a call to model.matrix.

Controlled direct effects are estimated when binary covariate Z is non-null. The tmle function returns an object of class tmle.list, a list of two items of class tmle. The first corresponds to estimates obtained when Z is fixed at 0, the second corresponds to estimates obtained when Z is fixed at 1.

When automate = TRUE the sample size determines the number of cross validation folds, V based on the effective sample size. When Y is continuous n.effective = n. When Y is binary n.effective = 5 * size of minority class. When n.effective <= 30 V= n.effective; When n.effective <= 500 V= 20; When 500 < n <=1000 V=10; When 1000 < n <= 10000 V=5; Otherwise V=2. Bounds on g set to (5/sqrt(n)/log(n), 1), except for ATT and ATE, where upper bound is 1-lower bound. Wretain.g set to TRUE when number of covariates >= n.effective / 5.

Set B = 10,000 to obtain high precision targeted bootstrap quantile-based confidence intervals and variance of bootstrap point estimates. Set B = 1,000 for rough approximation, and B = 1 for IC-based inference only.

Value

estimates

list with elements EY1 (population mean), ATE (additive treatment effect), ATT (additive treatment effect among the treated), ATC (additive treatment effect among the controls), RR (relative risk), OR (odds ratio). Each element in the estimates of these is itself a list containing

  • psi - parameter estimate

  • pvalue - two-sided p-value

  • CI - 95% confidence interval

  • var.psi - Influence-curve based variance of estimate (ATE parameter only)

  • log.psi - Parameter estimate on log scale (RR and OR parameters)

  • var.log.psi - Influence-curve based variance of estimate on log scale (RR and OR parameters)

  • bs.var - Variance of bootstrap point estimates (when B > 1)

  • bs.CI.twosided - Quantile-based 2-sided confidence interval bounds

  • bs.CI.onesided.lower - Quantile-based 1-sided lower confidence interval bounds

  • bs.CI.onesided.upper - Quantile-based 1-sided upper confidence interval bounds

Qinit

initial estimate of Q. Qinit$coef are the coefficients for a glm model for Q, if applicable. Qinit$Q is an n \times 2 matrix, where n is the number of observations. Columns contain predicted values for Q(0,W),Q(1,W) using the initial fit. Qinit$type is method for estimating Q. Qinit$Rsq is Rsq for initial estimate of Q. Qinit$Rsq.type empirical or cross-validated (depends on value of cvQinit), Rsq or pseudo-Rsq when Y is binary.

Qstar

targeted estimate of Q, an n \times 2 matrix with predicted values for Q(0,W), Q(1,W) using the updated fit

g

treatment mechanism estimate. A list with four items: g$g1W contains estimates of P(A=1|W) for each observation, g$coef the coefficients for the model for g when glm used, g$type estimation procedure, g$discreteSL flag, g$AUC empirical AUC if ROCR package is available

g.Z

intermediate covariate assignment estimate (when applicable). A list with four items: g.Z$g1W an n \times 2 matrix containing values of P(Z=1|A=1,W), P(Z=1|A=0,W) for each observation, g.Z$coef the coefficients for the model for g when glm used, g.Z$type estimation procedure, g.Z$discreteSL flag

g.Delta

missingness mechanism estimate. A list with four items: g.Delta$g1W an n \times 4 matrix containing values of P(Delta=1|Z,A,W) for each observation, with (Z=0,A=0), (Z=0,A=1), (Z=1,A=0),(Z=1,A=1). (When Z is NULL, columns 3 and 4 are duplicates of 1 and 2.) g.Delta$coef the coefficients for the model for g when glm used, g.Delta$type estimation procedure, g.Delta$discreteSL flag

gbound

bounds used to truncate g

gbound.ATT

bounds used to truncated g for ATT and ATC estimation

W.retained

names of covariates used to model the components of g

Author(s)

Susan Gruber sgruber@cal.berkeley.edu, in collaboration with Mark van der Laan.

References

1. Gruber, S. and van der Laan, M.J. (2012), tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software, 51(13), 1-35. https://www.jstatsoft.org/v51/i13/

2. Gruber, S. and van der Laan, M.J. (2009), Targeted Maximum Likelihood Estimation: A Gentle Introduction. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 252. https://biostats.bepress.com/ucbbiostat/paper252/

3. Gruber, S. and van der Laan, M.J. (2010), A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome. The International Journal of Biostatistics, 6(1), 2010.

4. Rosenblum, M. and van der Laan, M.J. (2010).Targeted Maximum Likelihood Estimation of the Parameter of a Marginal Structural Model. The International Journal of Biostatistics, 6(2), 2010.

5. van der Laan, M.J. and Rubin, D. (2006), Targeted Maximum Likelihood Learning. The International Journal of Biostatistics, 2(1). https://biostats.bepress.com/ucbbiostat/paper252/

6. van der Laan, M.J., Rose, S., and Gruber,S., editors, (2009) Readings in Targeted Maximum Likelihood Estimation . U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 254. https://biostats.bepress.com/ucbbiostat/paper254/

7. van der Laan, M.J. and Gruber S. (2016), One-Step Targeted Minimum Loss-based Estimation Based on Universal Least Favorable One-Dimensional Submodels. The International Journal of Biostatistics, 12 (1), 351-378.

8. Gruber, S., Phillips, R.V., Lee, H., van der Laan, M.J. Data-Adaptive Selection of the Propensity Score Truncation Level for Inverse Probability Weighted and Targeted Maximum Likelihood Estimators of Marginal Point Treatment Effects. American Journal of Epidemiology 2022; 191(9), 1640-1651.

See Also

summary.tmle, estimateQ, estimateG, calcParameters, oneStepATT, tmleMSM, calcSigma

Examples

library(tmle)
  set.seed(1)
  n <- 250
  W <- matrix(rnorm(n*3), ncol=3)
  A <- rbinom(n,1, 1/(1+exp(-(.2*W[,1] - .1*W[,2] + .4*W[,3]))))
  Y <- A + 2*W[,1] + W[,3] + W[,2]^2 + rnorm(n)

# Example 1. Simplest function invocation 
# SuperLearner called to estimate Q, g
# Delta defaults to 1 for all observations   
## Not run: 
  result1 <- tmle(Y,A,W)
  summary(result1)

## End(Not run)
# Example 2: 
# User-supplied regression formulas to estimate Q and g
# binary outcome
  n <- 250
  W <- matrix(rnorm(n*3), ncol=3)
  colnames(W) <- paste("W",1:3, sep="")
  A <- rbinom(n,1, plogis(0.6*W[,1] +0.4*W[,2] + 0.5*W[,3]))
  Y <- rbinom(n,1, plogis(A + 0.2*W[,1] + 0.1*W[,2] + 0.2*W[,3]^2 ))
  result2 <- tmle(Y,A,W, family="binomial", Qform="Y~A+W1+W2+W3", gform="A~W1+W2+W3")
  summary(result2)

## Not run: 
# Example 3:
# Incorporate sampling weights and
# request targeted bootstrap-based inference along with IC-based results
  pi <- .25 + .5*W[,1] > 0
  enroll <- sample(1:n, size = n/2, p = pi)
  result3 <- tmle(Y[enroll],A[enroll],W[enroll,], family="binomial", Qform="Y~A+W1+W2+W3",
             gform="A~W1+W2+W3", obsWeights = 1/pi[enroll],B=1000)
  summary(result3)

# Example 4: Population mean outcome
# User-supplied (misspecified) model for Q, 
# Super learner called to estimate g, g.Delta
# V set to 2 for demo, not recommended at this sample size
# approx. 20
  Y <- W[,1] + W[,2]^2 + rnorm(n)
  Delta <- rbinom(n, 1, 1/(1+exp(-(1.7-1*W[,1]))))
  result4 <- tmle(Y,A=NULL,W, Delta=Delta, Qform="Y~A+W1+W2+W3", V.g=2, V.Delta=2)
  print(result4)

# Example 5: Controlled direct effect
# User-supplied models for g, g.Z
# V set to 2 for demo, not recommended at this sample size
  A <- rbinom(n,1,.5)
  Z <- rbinom(n, 1, plogis(.5*A + .1*W[,1]))
  Y <- 1 + A + 10*Z + W[,1]+ rnorm(n)
  
  CDE <- tmle(Y,A,W, Z, gform="A~1", g.Zform = "Z ~ A + W1", V.Q=2, V.g=2)
  print(CDE)
  total.effect <- tmle(Y,A, W,  gform="A~1")
  print(total.effect)

## End(Not run)

tmle documentation built on May 29, 2024, 6:38 a.m.