tmle.npvi.: Targeted Minimum Loss Estimation of NPVI
In tmle.npvi: Targeted Learning of a NP Importance of a Continuous Exposure

Description Usage Arguments Details Value Author(s) References See Also Examples

Carries out the targeted minimum loss estimation (TMLE) of a non-parametric variable importance measure of a continuous exposure.

tmle.npvi(obs, f = identity, nMax = 30L, flavor = c("learning", 
    "superLearning"), lib = list(), nodes = 1L, cvControl = NULL, 
    family = c("parsimonious", "gaussian"), cleverCovTheta = FALSE, 
    bound = 1, B = 1e+05, trueGMu = NULL, iter = 5L, stoppingCriteria = list(mic = 0.01, 
        div = 0.01, psi = 0.1), gmin = 0.05, gmax = 0.95, mumin = quantile(f(obs[obs[, 
        "X"] != 0, "X"]), type = 1, probs = 0.01), mumax = quantile(f(obs[obs[, 
        "X"] != 0, "X"]), type = 1, probs = 0.99), verbose = FALSE, 
    tabulate = TRUE, exact = TRUE, light = TRUE)

`obs`	A `n x p` `matrix` or `data frame` of observations, with p ≥ 3. Column `"X"` corresponds to the continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model, with a reference value x_0 equal to 0. Column `"Y"` corresponds to the outcome variable (e.g. gene expression level), or "effect" in a causal model. All other columns are interpreted as baseline covariates W taken into account in the definition of the "effect" of X on Y.
`f`	A `function` involved in the definition of the parameter of interest ψ, which must satisfy f(0)=0 (see Details). Defaults to `identity`.
`nMax`	An `integer` (defaults to `30L`; `10L` is the smallest authorized value and we recommend a value less than `50L` for reasonable computational time) indicating the maximum number of observed values of X\neq 0 which are used to create the supports of the conditional distributions of X given W and X\neq0 involved in the simulation under P_n^k when `family` is set to "parsimonious".
`flavor`	Indicates whether the construction of the relevant features of P_n^0 and P_n^k, the (non-targeted yet) initial and (targeted) successive updated estimators of the true distribution of (W,X,Y) relies on the Super Learning methodology (option "superLearning") or not (option "learning", default value). In the former case, the `SuperLearner` package is loaded.
`lib`	A `list` providing the function `tmle.npvi` with the necessary algorithms involved in the estimation of the features of interest. If empty (default) then the default algorithms are used. See `Details`.
`nodes`	An `integer`, which indicates how many nodes should be involved in the computation of the TMLE when it relies on the "superLearning" `flavor`. Defaults to `1`. If larger than `1`, then the `parallel` package is loaded and a cluster with `nodes` nodes is created and exploited.
`cvControl`	'NULL' (default value) or an `integer` indicating how many folds are involved in the Super Learning procedure. If `flavor` and `cvControl` are simultaneously set to "superLearning" and `NULL` then the Super Learning procedure relies on 10-fold cross-validation.
`family`	Indicates whether the simulation of the conditional distribution of X given W under P_n^k (the initial estimator if k=0 or its kth update if k ≥ 1) should be based on a weighted version of the empirical measure (case "parsimonious", default value and faster execution) or on a Gaussian model (case "gaussian").
`cleverCovTheta`	A `logical`, indicating whether the one-step (if `FALSE`, default value) or the two-step (if `TRUE`) updating procedure should be carried out.
`bound`	A positive `numeric` (defaults to `1`), upper-bound on the absolute value of the fluctuation parameter.
`B`	An `integer` (defaults to `1e5`) indicating the sample size of the data set simulated under each P_n^k to compute an approximated value of Ψ(P_n^k)), the parameter of interest at P_n^k. The larger `B`, the more accurate the approximation.
`trueGMu`	Either `NULL` (default value) if the true conditional probability g(W)=P(X=0\|W) and conditional expectation muAux(W)=E_P(X\|X\neq 0, W) are not both known beforehand. Otherwise, a `list` with tags `g` and `muAux` and respective values the two respective `function`s (see the related outputs of `getSample`).
`iter`	An `integer` (defaults to `5`) indicating a maximal number of updates.
`stoppingCriteria`	A `list` providing tuning parameters for the stopping criteria. Defaults to `list(mic=0.01, div=0.01, psi=0.1)`, see `Details`.
`gmin`	A positive `numeric`, lower-bound on the range of values of the estimated probabilities P_n^k(X=0\|W) that X be equal to its reference value `0` given W. Defaults to `5e-2`, and must be smaller than `gmax`.
`gmax`	A positive `numeric` smaller than `1`, upper-bound on the range of values of the estimated probabilities P_n^k(X=0\|W) that X be equal to its reference value `0` given W. Defaults to `95e-2`, and must be larger than `gmin`.
`mumin`	A `numeric`, lower-bound on the range of values of the estimated conditional expectation E_{P_n^k}(X\|X\neq 0,W) of X given X\neq 0 and W. Defaults to the first percentile of `X`, and must be smaller than `mumax`.
`mumax`	A `numeric`, upper-bound on the range of values of the estimated conditional expectation E_{P_n^k}(X\|X\neq 0,W) of X given X\neq 0 and W. Defaults to the 99th percentile of X, and must be larger than `mumin`.
`verbose`	Prescribes the amount of information output by the function. Defaults to `FALSE`.
`tabulate`	A `logical`, kept for compatibility, which should be set to `TRUE` (its default value). This requires that the joint distribution of (X,W) under P_n^k be a weighted version of the empirical measure (for a faster execution).
`exact`	A `logical`, kept for compatibility, which should be set to `TRUE` (its default value). This requires that the successive updates of the estimators of the relevant features of the true distribution of (W,X,Y) be derived from the current estimators based on exact formulas rather than on their first-order approximations.
`light`	A `logical`, kept for compatibility, which should be set to `TRUE` (its default value). This requires that the result of each fit be reduced in size (for a faster execution). Currently implemented only for flavor `learning`.

The parameter of interest is defined as ψ=Ψ(P) with

Ψ(P) = E_P[f(X) * (θ(X,W) - θ(0,W))] / E_P[f(X)^2],

with P the distribution of the random vector (W,X,Y), θ(X,W) = E_P[Y|X,W], 0 the reference value for X, and f a user-supplied function such that f(0)=0 (e.g., f=identity, the default value).

The TMLE procedure stops when the maximal number of iterations, iter, is reached or when at least one of the following criteria is met:

The empirical mean P_n effIC(P_n^{k+1}) of the efficient influence curve at P_n^{k+1} scaled by the estimated standard deviation of the efficient influence curve at P_n^{k+1} is smaller, in absolute value, than mic.
The total variation (TV) distance between P_n^k and P_n^k+1 is smaller than div.
The change between the successive values Psi(P_n^k) and Psi(P_n^{k+1}) is smaller than psi.

If lib is an empty list (list(), default value) then the default algorithms for the chosen flavor are loaded (learningLib when flavor is set to "learning" or superLearningLib when flavor is set to "superLearning"). A valid lib argument must mimick the structure of either learningLib or superLearningLib, depending on flavor.

The "superLearning" flavor requires the SuperLearner package and, by default, the e1071, gam, glmnet, polspline and randomForest packages.

If family is set to "parsimonious" (recommended) then the packages sgeostat and geometry are required.

Returns an object of class "NPVI" summarizing the different steps of the TMLE procedure. The method getHistory outputs the "history" of the procedure (see getHistory). The object notably includes the following information:

`obs`	The `matrix` of observations used to carry out the TMLE procedure. Use the `method` `getObs` to retrieve it.
`psi`	The TMLE of the parameter of interest. Use the `method` `getPsi` to retrieve it.
`psi.sd`	The estimated standard deviation of the TMLE of the parameter of interest. Use the `method` `getPsiSd` to retrieve it.

Antoine Chambaz, Pierre Neuvial

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059–1099.

Chambaz, A., Neuvial, P. (2015). tmle.npvi: targeted, integrative search of associations between DNA copy number and gene expression, accounting for DNA methylation. To appear in Bioinformatics Applications Notes.

getSample, getHistory

set.seed(12345)
##
## Simulating a data set and computing the true value of the parameter
##

## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
obs <- sim$obs

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obs <- cbind(V, obs)

## Caution! MAKING '0' THE REFERENCE VALUE FOR 'X'
X0 <- O[2,2]
obsC <- obs
obsC[, "X"] <- obsC[, "X"] - X0
obs <- obsC

## True psi and confidence intervals (case 'f=identity')      
sim <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi <- sim$psi

confInt0 <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(sim$obs))
confInt <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(obs))
cat("\nCase f=identity:\n")
msg <- paste("\ttrue psi is: ", signif(truePsi, 3), "\n", sep="")
msg <- paste(msg, "\t95%-confidence interval for the approximation is: ",
             signif(confInt0, 3), "\n", sep="")
msg <- paste(msg, "\toptimal 95%-confidence interval is: ",
             signif(confInt, 3), "\n", sep="")
cat(msg)

##
## TMLE procedure
##

## Running the TMLE procedure
npvi <- tmle.npvi(obs, f=identity, flavor="learning", B=5e4, nMax=10)

## Summarizing its results
npvi
setConfLevel(npvi, 0.9)
npvi

history <- getHistory(npvi)
print(round(history, 4))

hp <- history[, "psi"]
hs <- history[, "sic"]
hs[1] <- NA
ics <-  c(-1,1) %*% t(qnorm(0.975)*hs/sqrt(nrow(getObs(npvi))))

pch <- 20
ylim <- range(c(confInt, hp, ics+hp), na.rm=TRUE)

xs <- (1:length(hs))-1
plot(xs, hp, ylim=ylim, pch=pch, xlab="Iteration", ylab=expression(psi[n]),
     xaxp=c(0, length(hs)-1, length(hs)-1))
dummy <- sapply(seq(along=xs), function(x) lines(c(xs[x],xs[x]), hp[x]+ics[, x]))

abline(h=confInt, col=4)
abline(h=confInt0, col=2)