knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(ltmle)
Consider the observed data with time ordering W -> A -> Y. W is a continuous baseline covariate that affects A and Y. A is a binary treatment variable that affects Y. Y is a binary outcome variable.
W ~ N(0, 1)
A ~ binomial with P(A = 1) = expit(W)
Y ~ binomial with P(Y = 1) = expit(W + A)
where $expit(z) = \frac{1}{1 + e^{-z}}$
We want to know $E[Y_1]$ the expected value of Y, intervening to set A to 1.
rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x)) n <- 10000 W <- rnorm(n) A <- rexpit(W) Y <- rexpit(W + A) data <- data.frame(W, A, Y) head(data)
We could try to use the simple mean of Y or the mean of Y when A = 1, but these will be biased.
mean(Y) mean(Y[A == 1])
Now we use ltmle()
.
result <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 1) result
Because we're using simulated data, we can calculate the true value of $E[Y_1]$
n.large <- 1e6 W <- rnorm(n.large) A <- 1 Y <- rexpit(W + A) mean(Y)
Time ordering of data is W1 -> W2 -> W3 -> A -> Y
n <- 1000 W1 <- rnorm(n) W2 <- rbinom(n, size=1, prob=0.3) W3 <- rnorm(n) A <- rexpit(-1 + 2 * W1 + W3) Y <- rexpit(-0.5 + 2 * W1^2 + 0.5 * W2 - 0.5 * A + 0.2 * W3 * A - 1.1 * W3) data <- data.frame(W1, W2, W3, A, Y)
True value of $E[Y_1]$ is approximately 0.5939.
SuperLearner semiparametric estimation using all parents as regressors
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y", abar=1, SL.library="default")
TMLE estimate:
summary(result)
IPTW estimate:
summary(result, estimator="iptw")
SuperLearner semiparametric estimation using correctly specified regressors. This passes W1^2, W2, W3, A and W3:A as columns of matrix X to SuperLearner for the Q regression and W1 and W3 as columns of matrix X to SuperLearner for the g regression.
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y", Qform=c(Y="Q.kplus1 ~ I(W1^2) + W2 + W3*A"), gform="A ~ W1 + W3", abar=1, SL.library="default") summary(result)
glm using correctly specified Qform and gform
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y", Qform=c(Y="Q.kplus1 ~ I(W1^2) + W2 + W3*A"), gform="A ~ W1 + W3", abar=1, SL.library=NULL) summary(result)
Get summary measures (additive treatment effect, odds ratio, relative risk) for abar=1 vs abar=0
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y", abar=list(1, 0), SL.library="default") summary(result)
Time ordering of data is W -> C -> Y. C is a censoring node. The parameter of interest is Y, intervening to set C to be uncensored. Censoring nodes are very similar to treatment nodes. The main difference is that after censoring, other data is expected to be NA. It is assumed that the parameter of interest always intervenes to set censoring nodes to uncensored, so this is not specified in abar
. g and Q regressions always stratify on censoring nodes, whereas g and Q regressions can either stratify or pool over treatment nodes (by using the stratify
argument.)
Censoring nodes in data
should be factors with two levels - "censored" and "uncensored". The utility function BinaryToCensoring
can be used to facilitate this.
n <- 100000 W <- rnorm(n) C <- BinaryToCensoring(is.censored = rexpit(W)) summary(C) Y <- rep(NA, n) Y[C == "uncensored"] <- rexpit(W[C == "uncensored"]) data <- data.frame(W, C, Y) head(data, 20) result <- ltmle(data, Anodes = NULL, Cnodes = "C", Ynodes = "Y", abar = NULL) summary(result)
The naive estimate is biased (the true value is 0.5):
mean(data$Y) mean(data$Y, na.rm = T)
Time ordering of data is W -> A1 -> L -> A2 -> Y. L is time-dependent covariate because it occurs after a treatment (or censoring) variable, A1. We indicate this using the Lnodes
argument.
n <- 1000 W <- rnorm(n) A1 <- rexpit(W) L <- 0.3 * W + 0.2 * A1 + rnorm(n) A2 <- rexpit(W + A1 + L) Y <- rexpit(W - 0.6 * A1 + L - 0.8 * A2) data <- data.frame(W, A1, L, A2, Y) head(data)
Treatment regime of interest: set A1 to 0, set A2 to 0
ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0))
W -> A1 -> C -> L -> A2 -> Y
n <- 1000 W <- rnorm(n) A1 <- rexpit(W) C <- BinaryToCensoring(is.censored = rexpit(0.6 * W - 0.5 * A1)) uncensored <- C == "uncensored" L <- A2 <- Y <- rep(NA, n) L[uncensored] <- (0.3 * W[uncensored] + 0.2 * A1[uncensored] + rnorm(sum(uncensored))) A2[uncensored] <- rexpit(W[uncensored] + A1[uncensored] + L[uncensored]) Y[uncensored] <- rexpit(W[uncensored] - 0.6 * A1[uncensored] + L[uncensored] - 0.8 * A2[uncensored]) data <- data.frame(W, A1, C, L, A2, Y) head(data)
Treatment regime of interest: set A1 to 1, set A2 to 0, set C to uncensored:
ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", abar=c(1, 0))
Treatment regime of interest is: Always treat at time 1 (A1 = 1), treat at at time 2 (A2 = 1) if L > 0
abar <- matrix(nrow=n, ncol=2) abar[, 1] <- 1 abar[, 2] <- L > 0 result.abar <- ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", abar=abar) result.abar
The regime can also be specified as a rule
function. rule
is a function applied to each row of data
which returns a numeric vector of the same length as Anodes
.
rule <- function(row) c(1, row["L"] > 0) result.rule <- ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", rule=rule) result.rule
Specfifying the regime using abar
and using rule
give the same result:
summary(result.abar) summary(result.rule)
Consider a simple point treatment problem with observed data $W, A, Y$. But there is a positivity problem - for small values of $W$, $Prob(A = 1)$ is very small.
The true parameter value, $E[Y_1]$ is approximately 0.697.
The true TMLE standard deviation (the standard deviation of the TMLE estimate if we ran it many times on many sets of data) is approximately 0.056.
The true IPTW standard deviation (the standard deviation of the IPTW estimate if we ran it many times on many sets of data) is approximately 0.059.
n <- 1000 W <- rnorm(n) A <- rexpit(4 * W) Y <- rexpit(W + A) df <- data.frame(W, A, Y)
The default variance.method
is "tmle"
- use TMLE in order to approximate the variance of the TMLE estimator. The estimated standard deviation is close to the true TMLE standard deviation.
r1 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE) print(summary(r1))
If variance.method
is "ic"
, variance is estimated using the estimated Influence Curve. This is fast to compute, but may be significantly anticonservative in data with positivity violations.
r2 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE, variance.method="ic") print(summary(r2))
If variance.method
is "iptw"
, then use IPTW in order to approximate the variance of the TMLE estimator. This is faster to compute than variance.method = "tmle"
but less accurate (and slower to compute than variance.method = "ic"
but more accurate).
r3 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE, variance.method="iptw") print(summary(r3))
If we use the IPTW estimator, variance.method
does not change the estimated standard deviation (it only affects the estimated standard deviation of the TMLE estimator).
print(summary(r1, estimator="iptw")) print(summary(r2, estimator="iptw")) #the same - variance.method only affects TMLE print(summary(r3, estimator="iptw")) #the same - variance.method only affects TMLE
We can see that the values of g are very small.
summary(r1$cum.g) summary(r1$cum.g.unbounded) head(data.frame(df, g = r1$cum.g, g.unbounded = r1$cum.g.unbounded), 20)
The id
argument can be used to specify hierarchical data, such as people in a household.
num.households <- 500 people.in.household <- round(runif(num.households, min = 1, max = 10)) length(people.in.household) n <- sum(people.in.household) n W.household <- rnorm(num.households) length(W.household) W.household.expanded <- rep(W.household, times = people.in.household) W.indiv <- rnorm(n) length(W.indiv) A <- rexpit(1.5 * W.household.expanded + 0.4 * W.indiv) Y <- rexpit(-1 + 2.3 * W.household.expanded - 0.6 * W.indiv + 1.2 * A)
id can be an integer, factor, or character (or any type that can be coerced to factor),
id <- 1:num.households id.expanded <- rep(id, times = people.in.household) data <- data.frame(W.household.expanded, W.indiv, A, Y) head(cbind(id.expanded, data), 20)
result.without.id <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 0) result.with.id <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 0, id = id.expanded)
Omitting the id argument makes the individuals seem more independent than they are, which gives artificially low variance estimates.
summary(result.without.id) summary(result.with.id)
The influence curve is a vector with length equal to the number of independent units.
length(result.without.id$IC$tmle) length(result.with.id$IC$tmle)
age -> gender -> A1 -> L1a -> L1b -> Y1 -> A2 -> L2a -> L2b -> Y2
n <- 1000 age <- rbinom(n, 1, 0.5) gender <- rbinom(n, 1, 0.5) A1 <- rexpit(age + gender) L1a <- 2*age - 3*gender + 2*A1 + rnorm(n) L1b <- rexpit(age + 1.5*gender - A1) Y1 <- plogis(age - gender + L1a + 0.7*L1b + A1 + rnorm(n)) A2 <- rexpit(age + gender + A1 - L1a - L1b) L2a <- 2*age - 3*gender + 2*A1 + A2 + rnorm(n) L2b <- rexpit(age + 1.5*gender - A1 - A2) Y2 <- plogis(age - gender + L1a + L1b + A1 + 1.8*A2 + rnorm(n)) data <- data.frame(age, gender, A1, L1a, L1b, Y1, A2, L2a, L2b, Y2)
Also show some different ways of specifying the nodes (either names or indexes works):
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"), Ynodes=grep("^Y", names(data)), abar=c(1, 0)) summary(result)
Usually you would specify a Qform
for all of the Lnodes and Ynodes but in this case L1a, L1b, Y1 is a "block" of L/Y nodes not separated by Anodes or Cnodes (the same is true for L2a, L2b, Y2). Only one regression is required at the first L/Y node in a block. You can pass regression formulas for the other L/Y nodes, but they will be ignored.
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"), Ynodes=grep("^Y", names(data)), abar=c(1, 0), Qform=c(L1a="Q.kplus1 ~ 1", L2a="Q.kplus1 ~ 1")) summary(result)
Gives the same result but prints a message saying some regression formulas will be dropped:
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"), Ynodes=grep("^Y", names(data)), abar=c(1, 0), Qform=c(L1a="Q.kplus1 ~ 1", L1b="Q.klus1~A1", Y1="Q.kplus1~L1a", L2a="Q.kplus1 ~ 1", L2b="Q.klus1~A1", Y2="Q.kplus1~A2 + gender")) summary(result)
If there were a Anode or Cnode between L1b and Y1, Y1 would also need a Q regression formula.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.