getSample: Generates Simulated Data
In tmle.npvi: Targeted Learning of a NP Importance of a Continuous Exposure

Description Usage Arguments Details Value Author(s) References Examples

Generates a run of simulated observations of the form (W,X,Y) to investigate the "effect" of X on Y taking W into account.

1
2
3

getSample(n, O, lambda0, p = rep(1/3, 3), omega = rep(1/2, 3), 
    Sigma1 = matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2), sigma2 = omega[2]/5, 
    Sigma3 = Sigma1, f = identity, verbose = FALSE)

`n`	An `integer`, the number of observations to be generated.
`O`	A 3x3 numeric `matrix` or `data.frame`. Rows are 3 baseline observations used as "class centers" for the simulation. Columns are: W, baseline covariate (e.g. DNA methylation), or "confounder" in a causal model. X, continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model , with a reference value x_0 equal to `O[2, "X"]`. Y, outcome variable (e.g. gene expression level), or "effect" in a causal model.
`lambda0`	A `function` that encodes the relationship between W and Y in observations with levels of X equal to the reference value x_0.
`p`	A `vector` of length 3 whose entries sum to 1. Entry `p[k]` is the probability that each observation belong to class `k` with class center `O[k, ]`. Defaults to `rep(1/3, 3)`.
`omega`	A `vector` of length 3 with positive entries. Entry `omega[k]` is the standard deviation of W in class `k` (on a logit scale). Defaults to `rep(1/2, 3)`.
`Sigma1`	A 2x2 covariance `matrix` of the random vector (X,Y) in class `k=1`, assumed to be bivariate Gaussian with mean `O[1, c("X", "Y")]`. Defaults to `matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2)`.
`sigma2`	A positive `numeric`, the variance of the random variable Y in class `k=2`, assumed to be univariate Gaussian with mean `O[2, "Y"]`. Defaults to `omega[2]/5`.
`Sigma3`	A 2x2 covariance `matrix` of the random vector (X,Y) in class `k=3`, assumed to be bivariate Gaussian with mean `O[3, c("X", "Y")]`. Defaults to `Sigma1`.
`f`	A `function` involved in the definition of the parameter of interest ψ, which must satisfy f(0)=0 (see Details). Defaults to `identity`.
`verbose`	Prescribes the amount of information output by the function. Defaults to `FALSE`.

The parameter of interest is defined as ψ=Ψ(P) with

Ψ(P) = E_P[f(X-x_0) * (θ(X,W) - θ(x_0,W))] / E_P[f(X-x_0)^2],

with P the distribution of the random vector (W,X,Y), θ(X,W) = E_P[Y|X,W], x_0 the reference value for X, and f a user-supplied function such that f(0)=0 (e.g., f=identity, the default value). The value ψ is obtained using the known θ and the joint empirical distribution of (X,W) based on the same run of observations as in obs. Seeing W, X, Y as DNA methylation, DNA copy number and gene expression, respectively, the simulation scheme implements the following constraints:

There are two or three copy number classes: normal regions (k=2), and regions of copy number gains and/or losses (k=1 and/or k=3).
In normal regions, gene expression levels are negatively correlated with DNA methylation.
In regions of copy number alteration, copy number and expression are positively correlated.

Returns a list with the following tags:

`obs`	A `matrix` of `n` observations. The `c(W,X,Y)` colums of `obs` have the same interpretation as the columns of the input argument `O`.
`psi`	A `numeric`, approximated value of the true parameter ψ obtained using the known θ and the joint empirical distribution of (X,W) based on the same run of observations as in `obs`. The larger the sample size, the more accurate the approximation.
`g`	A `function`, the known positive conditional probability P(X=x_0\|W).
`mu`	A `function`, the known conditional expectation E_P(X\|W).
`muAux`	A `function`, the known conditional expectation E_P(X\|X\neq x_0, W).
`theta`	A `function`, the known conditional expectation E_P(Y\|X,W).
`theta0`	A `function`, the known conditional expectation E_P(Y\|X=x_0,W).
`sigma2`	A positive `numeric`, the known expectation E_P(f(X-x_0)^2).
`effIC`	A `function`, the known efficient influence curve of the functional Ψ at P, assuming that the reference value x_0=0.
`varIC`	A positive `numeric`, approximated value of the variance of the efficient influence curve of the functional Ψ at P and evaluated at O, obtained using the same run of observations as in `obs`. The larger the sample size, the more accurate the approximation.

Antoine Chambaz, Pierre Neuvial

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059–1099.

## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
str(sim)

obs <- sim$obs
head(obs)
pairs(obs)

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obsV <- cbind(V, obs)
head(obsV)

## True psi and confidence intervals (case 'f=identity')      
sim01 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi1 <- sim01$psi

confInt01 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(sim01$obs))
confInt1 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(obs))
msg <- "\nCase f=identity:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi1, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt01, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt1, 3)), "\n")
cat(msg)

## True psi and confidence intervals (case 'f=atan')
f2 <- function(x) {1*atan(x/1)}
sim02 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S, f=f2);
truePsi2 <- sim02$psi;

confInt02 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(sim02$obs))
confInt2 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(obs))

msg <- "\nCase f=atan:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi2, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt02, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt2, 3)), "\n")
cat(msg)