tmleCommunity: Estimate Marginal Treatment Effects For Arbitrary...

Description Usage Arguments Details Value IPTW estimator TMLE estimator GCOMP estimator Modeling P(A | W, E) for covariates (A, W, E) Three methods for choosing bin (interval) locations for a univariate and continuous variable A See Also Examples

View source: R/tmleCommunity.R

Description

Estimate the marginal treatment effect among independent communities (or i.i.d units if no hierarchical structure) using TMLE (targeted maximum likelihood estimation). It supports two different TMLEs that are based on community-level and individual-level analysis, respectively. The individual-level TMLE cooperates with additional working assumptions and has potential efficiency gain. It also provide corresponding IPTW (the inverse-probability-of-treatment or Horvitz-Thompson) and GCOMP (parametric G-computation).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
tmleCommunity(data, Ynode, Anodes, WEnodes, YnodeDet = NULL,
  obs.wts = c("equal.within.pop", "equal.within.community"),
  community.step = c("NoCommunity", "community_level",
  "individual_level", "perCommunity"), communityID = NULL,
  community.wts = c("size.community", "equal.community"),
  pooled.Q = FALSE, f_g0 = NULL, f_gstar1, f_gstar2 = NULL,
  Qform = NULL, Qbounds = NULL, alpha = 0.995,
  fluctuation = "logistic", hform.g0 = NULL, hform.gstar = NULL,
  lbound = 0.005, h.g0_GenericModel = NULL,
  h.gstar_GenericModel = NULL, TMLE.targetStep = c("tmle.intercept",
  "tmle.covariate"), n_MCsims = 1, CI_alpha = 0.05, rndseed = NULL,
  verbose = getOption("tmleCommunity.verbose"))

Arguments

data

Observed data, data.frame with named columns, containing WEnodes, Anode, Ynode and possibly communityID, YnodeDet. See "Details".

Ynode

Column name or index in data of outcome variable. Outcome can be either binary or continuous (could be beyond 0 and 1). If Ynode undefined, the left-side of the regression formula in argument Qform will be treated as Ynode.

Anodes

Column names or indices in data of exposure (treatment) variables

WEnodes

Column names or indices in data of individual-level (and possibly community-level) baseline covariates. Factors are not allowed.

YnodeDet

Optional column name or index in data of indicators of deterministic values of outcome Ynode, coded as (TRUE / FALSE) or (1 / 0). If TRUE or 1, value of Ynode is given deterministically / constant.

obs.wts

Optional choice to provide/ construct a vector of individual-level observation (sampling) weights (of length nrow(data)). Currently supports a non-negative numeric vector, "equal.within.pop" (Default) and equal.within.community. If "equal.within.pop", weigh individuals in the entire dataset equally (weigh to be all 1); If "equal.within.community", weigh individuals within the same community equally (i.e., 1 / (number of individuals in each community)).

community.step

Methods to deal with hierarchical data, one of "NoCommunity" (Default), "community_level", "individual_level" and "PerCommunity". If "NoCommunity", claim that no hirerachical structure in data; If "community_level", use community-level TMLE; If "individual_level", use individual-level TMLE cooperating with the assumption of no covariate interference. If "perCommunity", use stratified TMLE. If communityID = NULL, then automatically pool over all communities (i.e., treated it as "NoCommunity"). See "Details".

communityID

Optional column name or index in data of community identifier variable. If known, it can support the three options within community.step: "community_level", "individual_level" and "PerCommunity" (See details for community.step).

community.wts

Optional choice to provide/ construct a matrix of community-level observation weights (where dimension = J\times2, where J = the number of communities). The first column contains the identifiers / names of communities (ie., data[, communityID]) and the second column contains the corresponding non-negative weights. Currently only support a numeric matrix with 2 columns, "size.community" (Default) and "equal.community". If setting community.wts = "size.community", treat the number of individuals within each community as its weight, respectively. If community.wts = "equal.community", assumed weights to be all 1;

pooled.Q

Logical for incorporating hierarchical data to estimate the outcome mechanism. If TRUE, use a pooled individual-level regression for initial estimation of the mean outcome (i.e., outcome mechanism). Default to be FASLE. See "Details".

f_g0

Optional function used to specify model knowledge about value of Anodes. It estimates P(A | W, E) under g0 by sampling a large vector/ data frame of Anode (of length nrow(data)*n_MCsims or number of rows if a data frame) from f_g0 function.

f_gstar1

Either a function or a vector or a matrix/ data frame of counterfactual exposures, dependin on the number of exposure variables. If a matrix/ data frame, its number of rows must be either nrow(data) or 1 (constant exposure assigned to all observations), and its number of columns must be length(Anodes). Note that the column names should match with the names in Anodes. If a vector, it must be of length nrow(data) or 1. If a function, it must return a vector or a data frame of counterfactual exposures sampled based on Anodes, WEnodes (and possibly communityID) passed as a named argument "data". Thus, the function must include "data" as one of its argument names. The interventions defined by f_gstar1 can be static, dynamic or stochastic. See Exmaples below.

f_gstar2

Either a function or a vector or a matrix/ data frame of counterfactual exposures, dependin on the number of exposure variables. It has the same components and requirements as f_gstar1 has.

Qform

Character vector of regression formula for Ynode. If not specified (i.e., NULL), the outcome variable is regressed on all covariates included in Anodes and WEnodes (i.e., Ynode ~ Anodes + WEnodes). See "Details".

Qbounds

Vector of upper and lower bounds on Y and predicted value for initial Q. Default to the range of Y, widened by 10% of the min and max values. See "Details".

alpha

Used to keep predicted values for initial Q bounded away from (0,1) for logistic fluctuation (set Qbounds to (1 - alpha), alpha).

fluctuation

Default to "logistic", it could also be "linear" (for targeting step).

hform.g0

Character vector of regression formula for estimating the conditional density of P(A | W, E) under observed treatment mechanism g0. If not specified, its form will be Anodes ~ WEnodes. If there are more than one expsosure, it fits a joint probability. See section "Modeling P(A | W, E) for covariates (A, W, E)".

hform.gstar

Character vector of regression formula for estimating the conditional density P(A | W, E) under user-supplied interventions f_gstar1 or f_gstar2. If not specified, it use the same regression formula as used in hform.g0.

lbound

Value between (0,1) for truncation of predicted P(A | W, E). Default to 0.005

h.g0_GenericModel

An object of GenericModel R6 class containing the previously fitted models for P(A | W, E) under observed treatment mechanism g0, one of the returns of tmleCommunity function. If known, predictions for P(A=a | W=w, E=e) under g0 are based on the fitted models in h.g0_GenericModel.

h.gstar_GenericModel

An object of GenericModel R6 class containing the previously fitted models for P(A^* | W, E) under intervention gstar, one of the returns of tmleCommunity function. If known, predictions for P(A=a | W=w, E=e) under gstar are based on the fitted models in h.gstar_GenericModel.

TMLE.targetStep

TMLE targeting step method, either "tmle.intercept" (Default) or "tmle.covariate". See "Details".

n_MCsims

Number of simulations for Monte-Carlo analysis. Each simulation generates new exposures under f_gstar1 or f_gstar2 (if specified) or f_g0 (if specified), with a sample size of nrow(data). Then these generated expsosures are used when fitting the conditional densities P(A | W, E). and estimating for IPTW and GCOMP under intervention f_gstar1 or f_gstar2. Note that deterministic intervention only needs one simulation and stochastic intervention could use more simulation times such as 10 (Default to 1).

CI_alpha

Significance level (alpha) used in constructing a confidence interval. Default to 0.05.

rndseed

Random seed for controlling sampling A under f_gstar1 or f_gstar2 (for reproducibility of Monte-Carlo simulations)

verbose

Flag. If TRUE, print status messages. Default to getOption("tmleCommunity.verbose") (Default to FALSE). It can be turned on by setting options(tmleCommunity.verbose = TRUE).

Details

The estimates returned by tmleCommunity are of a treatment-specific mean, E[Y_{g^*}], the expected community-level outcome if all communities in the target population received the exposures generated under the user-supplied (deterministic or stochastic) intervention g^*.

data must be a data frame (no matrix accepted) and does not support factor values (considering removing or recording such columns as strings), data includes the following (optional) columns:

community.step specifies how to read the structure of the input data (as hierarchical or non-hierarchical) and how to analyze the data. community_level TMLE ("community_level") is exactly analogous to the TMLE of the treatment specific individual-level mean outcome ("NoCommunity") with the trivial modification that the community rather than the individual serves as the unit of analysis. communityID is needed when using "community_level", "individual_level" and "perCommunity". Lack of communityID forces the algorithm to automatically pool data over all communities and treat it as non-hierarchical dataset (so force community.step = "NoCommunity"). If community.step = "individual_level", it incorporates with working models that assume that each individual's outcome is known not to be affected by the covariates of other individuals in the same community (i.e., "no covariate interference"). This strong assumption can be relaxed by integrating knowledge of the dependence structure among individuals within communities (i.e., "weak covariate interference"). But currrently only the "no covariate interference" assumption is implemented. If community.step = "perCommunity"), run a single TMLE on each community and calculate a (weighted) mean outcome for the J communities.

pooled.Q is in regard to incorporate the working model of "no covariate inference" in community-level TMLE (and the corresponding IPTW and GCOMP) although the working model is not assumed to hold. In other words, when community.step = "community_level", if pooled.Q = TRUE, add pooled individual-level regressions as candidates in the Super Learner library for initial estimation of the outcome mechanism. If pooled.Q = FALSE, both outcome and treatment mechanisms are estimated on the community-level (no use of individual-level information).

Qform should be NULL, in which cases all parent nodes of Y node will be used as regressors, or a character vector that can be coerced to class "formula". If Qestimator (an argument in tmleCom_Options) is "speedglm__glm" (or "glm__glm"), then speedglm (or glm) will be called using the components of Qform. If Qestimator is "SuperLearner", then SuperLearner will be called after a data frame is created using Qform, based on the specified algorithms in SL.library (an argument in tmleCom_Options); If Qestimator is "h2o__ensemble", then h2o and h2oEnsemble will be called after a H2OFrame dataset is creating using Qform, based on specified algorithms in h2olearner and h2ometalearner. See "Arguments" in tmleCom_Options.

hform.g0 and hform.gstar should also be NULL, in which cases all parent nodes of A node(s) will be used as regressors, or a character vector that can be coerced to class "formula". It follows the same rules applied to Qform except it's now based on gestimator (an argument in tmleCom_Options). Besides, gestimator can be "sl3_pipelines". If gestimator is "sl3_pipelines", then learner$train will be called after a sl3 learner has been created by using make_learner. See "Arguments" in tmleCom_Options.

Qbounds can be used to bound continuous outcomes Y. If Qbounds not specified (NULL), it will be set to the range of Y, widened by 10% of the minimum and maximum. That is, [0.9*min(Y), 1.1*max(Y)]. If specified, then Y will be truncated at the min max values of Qbounds, and then scaled to be in [0, 1] by (Y - min(Qbound))/(diff(Qbound)). Statistical inferences and for the transformed outcome can be directly translated back into statistical inference for the unscaled outcome. Once Qbounds finish bounding the observed outcomes, it will be set to (1 - alpha, alpha) and used to bound the predicted values for the initial outcome mechanism. Thus, alpha needs to be between (0, 1), otherwise reset to 0.995. Besides, lbound can be used to truncate the weights h_gstar/h_gN, that is, [0, 1/lbound].

TMLE.targetStep specifies how to use weights h_gstar/h_gN in the TMLE targeting step. If tmle.intercept (default), it performs the weighted intercept-based TMLE that runs a intercept-only weighted logistic regression using offsets logit(Qstar) and weights h_gstar/h_gN and so no covariate. If tmle.covariate, it performs the unweighted covariate-based TMLE that run a unweighted logistic regression using offsets logit(Qstar) and a clever covariate h_gstar/h_gN.

Value

tmleCommunity returns an object of class "tmleCommunity", which is a named list containing the estimation results stored by the following 3 elements:

Each element in the returned tmleCommunity object is itself a list containing the following 8 items:

Estimations are based on either community-level or individual-level analysis. Each analysis currently implements 3 estimators:

IPTW estimator

The IPTW estimator is based on the TMLE weights h_{g^*}(A^*,W,E)/h_{g}(A,W,E), which is equivalent to P_{g^*}(A^*|W,E)/P_{g}(A|W,E) and is defined as the the weighted average of the observed outcomes Y. The following algorithm shows a general template of the community-level IPTW:

For individual-level IPTW, it reads the input data as O_{i,j} = (E_j, W_{i,j}, A_j, Y_{i,j}: j = 1, ..., J; i = 1, ..., n_j) and incorporates working model that assumes no covariate interference, weighing each individual within one community by α_{i,j}, where the IPTW estimator is given by:

ψ^{II}_{IPTW, n}=\frac{1}{J}∑_{j=1}^{J}∑_{i=1}^{n_j}α_{i,j}Y_{i,j}\frac{P_{g^{*}}(A^*_j|W_{i,j}, E_j)} {P_{g^{*}}(A_j | W_{i,j}, E_j)}

TMLE estimator

The TMLE estimator is based on the updated model prediction \bar{Q}^*(A, W, E) and is defined by the G-formula. The following algorithm shows a general template of the community-level TMLE:

For individual-level TMLE, its estimator is obtained as:

ψ^{II}_{TMLE,n}=\frac{1}{J}∑_{j=1}^{J}∑_{i=1}^{n_j}α_{i,j}\int_{a}\hat{\bar{Q}}^*(a, W_{i,j}, E_j)g^*(a|W_{i,j}, E_j)dμ(a)

GCOMP estimator

The GCOMP estimator is similar to the the TMLE estimator except it uses the untargeted (initial) model \hat{\bar{Q}}^c(A|W,E) instead of its targeted version \hat{\bar{Q}}^{c*}(A | W, E), for example the community-level GCOMP estimator is given by:

ψ^{I}_{GCOMP,n}=\frac{1}{J}∑_{j=1}^{J}\int_{a}\hat{\bar{Q}}^{c}(a, \textbf{W}_j, E_j)g^{c*}(a|\textbf{W}_j, E_j)dμ(a)

Modeling P(A | W, E) for covariates (A, W, E)

For simplicity (and without loss of generality), we now suppose that there is no hierarchical structure in data and are interested in finding an non-parametric estimator of the common (in i) individual-level exposure mechanism g_0(A|W), or the commom multivariate joint conditional probability model P_{g_0}(A|W), where the exposures and baseline covariates ((A,W)=(A_i, W_i: i=1,...,n)) denote the random variables drawn jointly from distribution H_0(A, W) with denisty h_0(a, w) \equiv g_0(a|w)q_{W,0}(w) and q_{W,0}(W) denotes the marginal density of the baseline covariates W specified by the regression formula hform.g0 (Notice that an non-parametric estimator of the model P_{g^*_0}(A|W)) is similar, except that now the exposures and baseline covariates ((A^*,W)=(A^*_i, W_i: i=1,...,n)) are randomly drawn from H^*_0(A, W) with density h^*_0(a, w) \equiv g^*_0(a|w)q_{W,0}(w), where A^* is determined by the user-supplied (stochastic) intervention f_gstar1 or f_gstar2 and q_{W,0}(W) denotes the marginal density of the baseline covariates W specified by the regression formula hform.gstar. Thus, the fitting algorithm for P_{g^*_0}(A|W)) is equivalent for P_{g_0}(A|W)).

Note that A can be multivariate (i.e., (A(1), ..., A(M))) and each of its components A(m) can be either binary, categorical or continuous. The joint probability model for P(A|W)=P(A(1),...,A(M)|W) can be factorized as a sequence P(A(1)|W) \times P(A(2)|W,A(1)) \times ... \times P(A(M)|W, A(1),...,A(M-1)), where each of these conditional probability models P(A(m)|W,A(1),...,A(m-1)) is fit separately, depending on the type of the m-specific outcome variable A(m). For binary A(m), the conditional probability P(A(m)|W,A(1),...,A(m-1)) will be esimtated by a user-specific library of candidate algorithms, including parametric estimators such as logistic model with only main terms, and data-adaptive estimator such as super-learner algorithms. For continuous (or categorical) A(m), consider a sequence of values δ_1, δ_2,...,δ_{K+1} that span the range of A and define K bins and the corresponding K bin indicators (B_1,...,B_K), in which case each bin indicator B_k \equiv [δ_k, δ_{k+1}) is used as an binary outcome in a seperate user-specific library of candidate algorithms, with predictors given by (W, A(1),..., A(m-1)). That is how the joint probability P(A|W) is factorized into such an entire tree of binary regression models.

For simplicity (and without loss of generality), we now suppose A is univariate (i.e., M=1) and continuous and a general template of an fitting algorithm for P_{g_0}(A|W) is summrized below:

  1. Consider the usual setting in which we observe n independently and identically distributed copies o_i=(w_i, a_i, y_i: i=1,..,n) of the random variable O=(W, A, Y), where the observed (a_i: i=1...,n) are continuous.

  2. As described above, consider a sequence of K+1 values that span the support of A values into K bin intervals Δ = (δ_1, δ_2,...,δ_{K+1}) so that any observed data point a_i belongs to one interval within R, in other words, for each possible value a \in A (even if it is not in the observed (a_i:i)), there always exists a k \in {1, ...,K} such that δ_{k}≤q a<δ_{k+1}, and the length (bandwidth) of the interval can be defined as bw_{k}=δ_{k+1}-δ_{k}. Then let the mapping S(a)\in \{1,2,..,K\} denote a unique index of the indicator in Λ that a falls in, where S(a)=k if a\in [δ_{k},δ_{k+1}), namely, δ_{S(a)} ≤q a < δ_{S(a)+1}. Moreover, we use b_k to denote a binary indicator of whether the observed a belongs to bin k (i.e., b_k\equiv I(S(a)=k) for all k≤q S(a); b_k\equiv NA for all k>S(a)). This is similar to methods for censored longitudinal data, which code observations as NA (censored or missing) once the indicator b_k jumps from 0 to 1. Since a is a realization of the random variable A for one individual, the corresponding random binary indicators of whether A belongs to bin k can be denoted by B_k:k=1,..,=K where B_k \equiv I(S(A)=k) for all k≤q S(A); B_k\equivNA for all k>S(A).

  3. Then for each k = 1,...,K, a binary nonparametric regression is used to estimate the conditional probability P(B_k=1|B_{k-1}=0,W), which corresponds to the probability of B_k jumping from 0 to 1, given B_{k-1}=0 and tbe baseline covariates W. Note tha for each k, the corresponding nonparametric regression model is fit only among observations that are uncensored (i.e., still at risk of getting B_{k}=1 with B_{k-1}=0). Note the above conditional probability P(B_k=1|B_{k-1}=0,W) is equivalent to P(A\in [δ_{k}, δ_{k+1}) | A≥q δ_{k}, W), which is the probability of A belongs to the interval [δ_{k},δ_{k+1}), conditional on A does not belong to any intervals before [δ_{k}, δ_{k+1}) and W. Then the discrete conditional hazard function for each k is defined as a normalization of the conditional probability using the corresponding interval bandwidth bw_{k}: λ_k(A,W)=\frac{P(B_k=1|B_{k-1}=0,W)}{bw_k}=\frac{P(A\in[δ_{k},δ_{k+1})|A≥q δ_{k+1},W)}{bw_k}

  4. Finally, for any given observation (a,w), we first find out the interval index k to which a belongs (i.e., k=S(a)\in{1,...,K}). Then the discretized conditional density of P(A=a|W=w) can be evaluated by λ_k(A, W){\times}[∏_{j=1}^{k-1}(1-λ_j(A, W))], which corresponds to the conditional probability of a belongs to the interval [δ_{k},δ_{k+1}) and does not belong to any intervals before, given W.

Three methods for choosing bin (interval) locations for a univariate and continuous variable A

Note that the choice of the values δ_k (k=1,..,K) implies defining the number and positions of the bins. First, a cross- validation selector can be applied to data-adaptively select the candidate number of bins, which minimizes variance and maximizes precision (Do not recommend too many bins due to easily violating the positivity assumption). Then, we need to choose the most convenient locations (cuttoffs) for the bins (for fixed K). There are 3 alternative methods that use the histogram as a graphical descriptive tool to define the bin cutoffs Δ=(δ_1,...,δ_K,δ_{K+1}) for a continuous variable A. In tmleCommunity package, the choice of methods bin.method together with the other discretization arguments in function tmleCom_Options() such as nbins (total number of bins) and maxNperBin (the maximum number of observations in each bin), can be used to define the values of bin cutoffs. See help(tmleCom_Options) for more argument details.

See Also

tmleCommunity-package for the general description of the package,

tmleCom_Options for additional parameters that control the estimation in tmleCommunity,

DatKeepClass for details about storing, managing, subsetting and manipulating the input data,

indSample.iid.cA.cY_list for an example of a continuous exposure and its estimation,

BinaryOutModel, RegressionClass, GenericModel, MonteCarloSimClass

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
## Not run: 
#***************************************************************************************
# Example 1: Hierarchical example, with one binary A and bianry Y 
# True ATE of the community-based treatment is approximately 0.103716
data(comSample.wmT.bA.bY_list)  # load the sample data 
comSample.wmT.bA.bY <- comSample.wmT.bA.bY_list$comSample.wmT.bA.bY
N <- NROW(comSample.wmT.bA.bY)
Qform.corr <- "Y ~ E1 + E2 + W2 + W3 + A" # correct Q form
gform.corr <- "A ~ E1 + E2 + W1"  # correct g
#***************************************************************************************

#***************************************************************************************
# 1.1 Estimating the additive treatment effect (ATE) for two deterministic interventions
# (f_gstar1 = 1 vs f_gstar2 = 0) via community-level / individual-level analysis.
# speed.glm using correctly specified Qform, hform.g0 and hform.gstar;
#***************************************************************************************
# Setting global options that may be used in tmleCommunity(), e.g., using speed.glm
tmleCom_Options(Qestimator = "speedglm__glm", gestimator = "speedglm__glm", maxNperBin = N)

# Community-level analysis without a pooled individual-level regression on outcome
tmleCom_wmT.bA.bY.1a_sglm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = FALSE, 
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)

# Examples of estimates under f_gstar1 = 1:
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar1$estimates
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar1$vars
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar1$CIs

# Examples of estimates under f_gstar0 = 0:
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar2$estimates
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar2$vars
tmleCom_wmT.bA.bY.1a_sglm$EY_gstar2$CIs

# Examples of estimates for ATE under f_gstar1 - f_gstar0:
tmleCom_wmT.bA.bY.1a_sglm$ATE$estimates
tmleCom_wmT.bA.bY.1a_sglm$ATE$vars
tmleCom_wmT.bA.bY.1a_sglm$ATE$CIs
head(tmleCom_wmT.bA.bY.1a_sglm$ATE$IC)

# Community-level analysis with a pooled individual-level regression on outcome
tmleCom_wmT.bA.bY.1b_sglm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = TRUE, 
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleCom_wmT.bA.bY.1b_sglm$ATE$estimates

# Individual-level analysis with both individual-level outcome and treatment mechanisms
tmleCom_wmT.bA.bY.2_sglm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "individual_level", communityID = "id", 
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleCom_wmT.bA.bY.2_sglm$ATE$estimates

# Failing to provide communityID will automatically set community.step to "NoCommunity"
tmleCom_wmT.bA.bY.NoC_sglm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "individual_level", communityID = NULL, 
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleCom_wmT.bA.bY.NoC_sglm$ATE$estimates

# Stratification analysis that run separate outcome (exposure) mechanism for each community
# use glm since only around 50 observations per community, speed.glm easily fails
# takes longer time than the tests above since doing 1000 TMLEs (one TMLE per community)
# so set verbose to TRUE to track running progress
tmleCom_Options(Qestimator = "glm__glm", gestimator = "glm__glm", maxNperBin = N)
tmleCom_wmT.bA.bY.str_sglm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "perCommunity", communityID = "id", verbose = TRUE,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleCom_wmT.bA.bY.str_sglm$ATE$estimates

#***************************************************************************************
# 1.2 Same as above but for different Qestimator and gestimator through tmleCom_Options()
# via community-level analysis with a pooled individual-level regression on outcome.
# (See more details in examples in tmleCom_Options())
#***************************************************************************************
# SuperLearner for both outcome and treatment (clever covariate) regressions
# using all parent nodes (of Y and A) as regressors (respectively)
require("SuperLearner")
tmleCom_Options(Qestimator = "SuperLearner", gestimator = "SuperLearner", 
                maxNperBin = N, SL.library = c("SL.glm", "SL.step", "SL.bayesglm"))
tmleCom_wmT.bA.bY.2_SL <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = TRUE, 
                Qform = NULL, hform.g0 = NULL, hform.gstar = NULL)
tmleCom_wmT.bA.bY.2_SL$ATE$estimates

# SuperLearner for outcome regressions and glm for treatment regressions
# using all regressors in the correctly specified Qform and 
# all regressors in the misspecified hform.g0 and hform.gstar
tmleCom_Options(Qestimator = "SuperLearner", gestimator = "glm__glm", 
                maxNperBin = N, SL.library = c("SL.mean", "SL.stepAIC", "SL.bayesglm"))
tmleCom_wmT.bA.bY.2_SL.glm <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = TRUE,
                Qform = NULL, hform.g0 = "A ~ W1", hform.gstar = "A ~ E1 + W2")
tmleCom_wmT.bA.bY.2_SL.glm$ATE$estimates
  
# Speedglm for outcome regressions and sl3 for treatment regressions
# using all regressors in the correctly specified Qform and 
# all regressors in the misspecified hform.g0 and hform.gstar  
tmleCom_Options(Qestimator = "speedglm__glm", gestimator = "sl3_pipelines", maxNperBin = N, 
                sl3_learners = list(glm_fast = sl3::make_learner(sl3::Lrnr_glm_fast), 
                                    xgb = sl3::make_learner(sl3::Lrnr_xgboost),), 
                sl3_metalearner = sl3::make_learner(
                  sl3::Lrnr_optim, loss_function = sl3::loss_loglik_binomial,
                  learner_function = sl3::metalearner_logistic_binomial))
tmleCom_wmT.bA.bY.2_speedglm.sl3 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = TRUE,
                Qform = NULL, hform.g0 = "A ~ W1", hform.gstar = "A ~ E1 + W2")
tmleCom_wmT.bA.bY.2_speedglm.sl3$ATE$estimates  

#***************************************************************************************
# 1.3 Evaluating mean population outcome under static intervention A = 0
# with different community-level and individual-level weight choices 
#***************************************************************************************
tmleCom_Options(Qestimator = "speedglm__glm", gestimator = "speedglm__glm", maxNperBin = N)
# weigh individuals in data equally & weigh community by its number of individuals
tmleCom_wmT.bA.bY.1a_w1 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, 
                obs.wts = "equal.within.pop", community.wts = "size.community", 
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_w1$EY_gstar1$estimates

# same as above but weigh individuals within the same community equally
tmleCom_wmT.bA.bY.1a_w2 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, 
                obs.wts = "equal.within.community", community.wts = "size.community", 
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_w2$EY_gstar1$estimates

# weigh individuals within the same community equally & weigh community equally
tmleCom_wmT.bA.bY.1a_w3 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, 
                obs.wts = "equal.within.community", community.wts = "equal.community", 
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_w3$EY_gstar1$estimates

#***************************************************************************************
# 1.4 Specifying user-supplied stochastic or deterministic intervention function
#***************************************************************************************
# Intervention function that will sample A with probability P(A=1) = prob.val
define_f.gstar <- function(prob.val, rndseed = NULL) {
  eval(prob.val) 
  f.gstar <- function(data, ...) {
    print(paste0("probability of selection: ", prob.val))
    rbinom(n = NROW(data), size = 1, prob = prob.val)
  }
  return(f.gstar)
}
# Stochastically set 50% of the population to A=1 
f.gstar_stoch.0.5 <- define_f.gstar(prob.val = 0.5)
# Deterministically set 100% of the population to A=1 
f.gstar_determ.1 <- define_f.gstar(prob.val = 1)
# Deterministically set 100% of the population to A=0
f.gstar_determ.0 <- define_f.gstar(prob.val = 0)

#***************************************************************************************
# 1.5 Equivalent ways of specifying user-supplied (static) intervention (f_gstar1 = 1)
#***************************************************************************************
# Alternative 1: via intervention functoin that sets every invidual's A to constant 1
tmleCom_wmT.bA.bY.1a_fgtar1 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = f.gstar_determ.1, 
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_fgtar1$EY_gstar1$estimates

# Alternative 2: by simply setting f_gstar1 to 1
tmleCom_wmT.bA.bY.1a_fgtar2 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L,
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_fgtar2$EY_gstar1$estimates

# Alternative 3: by setting f_gstar1 to a vector of 1's of length NROW(data)
tmleCom_wmT.bA.bY.1a_fgtar3 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), 
                f_gstar1 = rep(1L, NROW(comSample.wmT.bA.bY)), 
                community.step = "community_level", communityID = "id")
tmleCom_wmT.bA.bY.1a_fgtar1$EY_gstar1$estimates

#***************************************************************************************
# 1.6 Running exactly the same estimator as 1.1 but using h_gstar/h_gN as a coviariate 
# in the targeting step (default to use weighted intercept-based TMLE)
#***************************************************************************************
# unweighted covariate-based TMLE
tmleCom_wmT.bA.bY.1a_covTMlE <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", pooled.Q = FALSE, 
                TMLE.targetStep = "tmle.covariate",  # default as "tmle.intercept"
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleCom_wmT.bA.bY.1a_covTMlE$ATE$estimates

#***************************************************************************************
# 1.7 Equivalent ways of specifying the regression formulae 
# (if Ynode is specified as "Y" and WEnodes = c("E1", "E2", "W1", "W2", "W3"))
#***************************************************************************************
# For outcome regression, the left side of Qform will be ignored if Ynode is specified,
# with dependent variable being set to Ynode.
Qform1 <- "Y ~ E1 + E2 + W2 + W3 + A" 
Qform2 <- "AnythingIsFine ~ E1 + E2 + W2 + W3 + A" 
Qform3 <- NULL  # since all parent nodes of Y will be used as regressors

# For treatment regressions, if hform.gstar unspecified, it uses the same regression
# formula as hform.g0 does. 
# Alternative 1: specify hform.g0 and hform.gstar respectively
hform.g0 <- "A ~ E1 + E2 + W1"
hform.gstar <- "A ~ E1 + E2 + W1"

# Alternative 2: specify hform.g0 only
hform.g0 <- "A ~ E1 + E2 + W1"
hform.gstar <- NULL

#***************************************************************************************
# 1.8 Equivalent ways of allowing printing status message
#***************************************************************************************
# Controlling the global setting 
options(tmleCommunity.verbose = TRUE)
tmleCom_wmT.bA.bY.print1 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id")

# Alternative: using the verbose argument in tmleCommunity()
tmleCom_wmT.bA.bY.print2 <- 
  tmleCommunity(data = comSample.wmT.bA.bY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("E1", "E2", "W1", "W2", "W3"), f_gstar1 = 1L, f_gstar2 = 0L,
                community.step = "community_level", communityID = "id", verbose = TRUE)

#***************************************************************************************
# Example 2: Non-hierarchical example, with one continuous A and continuous Y 
# True mean population outcome under stochastic intervention (specified below)
# is approximately 3.50856
data(indSample.iid.cA.cY_list)  # load the sample data 
indSample.iid.cA.cY <- indSample.iid.cA.cY_list$indSample.iid.cA.cY
true.shift <- indSample.iid.cA.cY_list$shift.val  # 2
true.truncBD <- indSample.iid.cA.cY_list$truncBD  # 10
N <- NROW(indSample.iid.cA.cY)
Qform.corr <- "Y ~ W1 + W2 + W3 + W4 + A" # correct Q form
gform.corr <- "A ~ W1 + W2 + W3 + W4"  # correct g
#***************************************************************************************

#***************************************************************************************
# 2.1 Specifying stochastic intervention function that could represent the true 
# shifted version of the current treatment mechanism
#***************************************************************************************
define_f.gstar <- function(shift.val, truncBD, rndseed = NULL) {
  shift.const <- shift.val
  trunc.const <- truncBD
  f.gstar <- function(data, ...) {
    print(paste0("shift.const: ", shift.const))
    set.seed(rndseed)
    A.mu <- 0.86 * data[,"W1"] + 0.41 * data[,"W2"] - 0.34 * data[,"W3"] + 0.93 * data[,"W4"]
    untrunc.A <- rnorm(n = nrow(data), mean = A.mu + shift.const, sd = 1)
    r.new.A <- exp(0.8 * shift.const * (untrunc.A - A.mu - shift.const / 3))
    trunc.A <- ifelse(r.new.A > trunc.const, untrunc.A - shift.const, untrunc.A)
    return(trunc.A)
  }
  return(f.gstar)
}
# correctly specified stochastic intervention with true shift value and truncated bound
f.gstar.corr <- define_f.gstar(shift = true.shift, truncBD = true.truncBD)
# Misspecified specified stochastic intervention 
f.gstar.mis <- define_f.gstar(shift = 5, truncBD = 8)

#***************************************************************************************
# 2.2 Estimating mean population outcome under different stochastic interventions
# speed.glm using correctly specified Qform, hform.g0 and hform.gstar
#***************************************************************************************
tmleCom_Options(Qestimator = "speedglm__glm", gestimator = "speedglm__glm", maxNperBin = N)
# correctly specified stochastic intervention 
tmleind_iid.cA.cY_true.fgstar <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = f.gstar.corr,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleind_iid.cA.cY_true.fgstar$EY_gstar1$estimates

# misspecified stochastic intervention
tmleind_iid.cA.cY_mis.fgstar <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = f.gstar.mis,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleind_iid.cA.cY_mis.fgstar$EY_gstar1$estimates

#***************************************************************************************
# 2.3 Same as above but using larger number of Monte-Carlo simulations
# using all parent nodes (of Y and A) as regressors (respectively)
#***************************************************************************************
# A will be sampled 10 times (for a total sample size of NROW(data)*10 under f_gstar1)
tmleind_iid.cA.cY_10MC <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = f.gstar.corr, n_MCsims = 10)
tmleind_iid.cA.cY_10MC$EY_gstar1$estimates

#***************************************************************************************
# 2.4 Running exactly the same estimator as 2.1 but defining different values of bin cutoffs 
#***************************************************************************************
# using equal-length method with 10 bins 
tmleCom_Options(bin.method = "equal.len", nbins = 10, maxNperBin = N)
tmleind_iid.cA.cY_len <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = f.gstar.corr,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleind_iid.cA.cY_len$EY_gstar1$estimates

# using combination of equal-length and equal-mass method with 20 bins 
tmleCom_Options(bin.method = "dhist", nbins = 20, maxNperBin = N)
tmleind_iid.cA.cY_dhist <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = f.gstar.corr,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)
tmleind_iid.cA.cY_dhist$EY_gstar1$estimates

#***************************************************************************************
# 2.5 Estimating the additive treatment effect (ATE) for two stochastic interventions
#***************************************************************************************
# Intervention function that will shift A by constant rate (shift.rate)
# (A special case of stochastic intervention with constant shift)
define_f.gstar <- function(shift.rate) {
  eval(shift.rate) 
  f.gstar <- function(data, ...) {
    print(paste0("rate of shift: ", shift.rate))
    data[, "A"] * shift.rate
  }
  return(f.gstar)
}
f.gstar_shift0.8 <- define_f.gstar(shift.rate = 0.8)
f.gstar_shift0.5 <- define_f.gstar(shift.rate = 0.6)

tmleCom_Options(bin.method = "equal.mass", nbins = 5, maxNperBin = N)
tmleind_iid.cA.cY_ATE <- 
  tmleCommunity(data = indSample.iid.cA.cY, Ynode = "Y", Anodes = "A", 
                WEnodes = c("W1", "W2", "W3", "W4"),
                f_gstar1 = f.gstar_shift0.8, f_gstar2 = f.gstar_shift0.5,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr)

# ATE estimates for f_gstar1-f_gstar2:
tmleind_iid.cA.cY_ATE$ATE$estimates
tmleind_iid.cA.cY_ATE$ATE$vars
tmleind_iid.cA.cY_ATE$ATE$CIs   

#***************************************************************************************
# Example 3: Non-Hierarchical example, with one binary A and one rare bianry Y 
# (Independent case-control)  True ATE is approximately 0.012662
data(indSample.iid.bA.bY.rareJ1_list)
indSample.iid.bA.bY.rareJ1 <- indSample.iid.bA.bY.rareJ1_list$indSample.iid.bA.bY.rareJ1
obs.wt.J1 <- indSample.iid.bA.bY.rareJ1_list$obs.wt.J1
Qform.corr <- "Y ~ W1 + W2*A + W3 + W4" # correct Q form
gform.corr <- "A ~ W1 + W2 + W3 + W4"  # correct g
tmleCom_Options(maxNperBin = NROW(indSample.iid.bA.bY.rareJ1))
#***************************************************************************************

#***************************************************************************************
# 3.1 Estimating ATE for f_gstar1 = 1 vs f_gstar2 = 0
# using correct observation weights and correctly specified Qform & gform 
#***************************************************************************************
tmleind_iid.bA.bY_corrWT <- 
  tmleCommunity(data = indSample.iid.bA.bY.rareJ1, Ynode = "Y", Anodes = "A",
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = 1, f_gstar2 = 0,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr,
                obs.wts = obs.wt.J1, verbose = TRUE)
tmleind_iid.bA.bY_corrWT$ATE$estimates["tmle", ]  # 0.01220298, good estimate

#***************************************************************************************
# 3.2 Same as above but not specifying the observation weights
# obs.wts = NULL is equivalent to obs.wts = "equal.within.pop"
#***************************************************************************************
tmleind_iid.bA.bY_misWT <- 
  tmleCommunity(data = indSample.iid.bA.bY.rareJ1, Ynode = "Y", Anodes = "A",
                WEnodes = c("W1", "W2", "W3", "W4"), f_gstar1 = 1, f_gstar2 = 0,
                Qform = Qform.corr, hform.g0 = gform.corr, hform.gstar = gform.corr,
                obs.wts = NULL, verbose = TRUE)
tmleind_iid.bA.bY_misWT$ATE$estimates["tmle", ]  # 0.2466575, bad estimate

## End(Not run)

chizhangucb/tmleCommunity documentation built on May 20, 2019, 3:34 p.m.