getSample: Generates Simulated Data

Description Usage Arguments Details Value Author(s) References Examples

Description

Generates a run of simulated observations of the form (W,X,Y) to investigate the "effect" of X on Y taking W into account.

Usage

1
2
3
getSample(n, O, lambda0, p = rep(1/3, 3), omega = rep(1/2, 3), 
    Sigma1 = matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2), sigma2 = omega[2]/5, 
    Sigma3 = Sigma1, f = identity, verbose = FALSE)

Arguments

n

An integer, the number of observations to be generated.

O

A 3x3 numeric matrix or data.frame. Rows are 3 baseline observations used as "class centers" for the simulation. Columns are:

  • W, baseline covariate (e.g. DNA methylation), or "confounder" in a causal model.

  • X, continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model , with a reference value x_0 equal to O[2, "X"].

  • Y, outcome variable (e.g. gene expression level), or "effect" in a causal model.

lambda0

A function that encodes the relationship between W and Y in observations with levels of X equal to the reference value x_0.

p

A vector of length 3 whose entries sum to 1. Entry p[k] is the probability that each observation belong to class k with class center O[k, ]. Defaults to rep(1/3, 3).

omega

A vector of length 3 with positive entries. Entry omega[k] is the standard deviation of W in class k (on a logit scale). Defaults to rep(1/2, 3).

Sigma1

A 2x2 covariance matrix of the random vector (X,Y) in class k=1, assumed to be bivariate Gaussian with mean O[1, c("X", "Y")]. Defaults to matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2).

sigma2

A positive numeric, the variance of the random variable Y in class k=2, assumed to be univariate Gaussian with mean O[2, "Y"]. Defaults to omega[2]/5.

Sigma3

A 2x2 covariance matrix of the random vector (X,Y) in class k=3, assumed to be bivariate Gaussian with mean O[3, c("X", "Y")]. Defaults to Sigma1.

f

A function involved in the definition of the parameter of interest ψ, which must satisfy f(0)=0 (see Details). Defaults to identity.

verbose

Prescribes the amount of information output by the function. Defaults to FALSE.

Details

The parameter of interest is defined as ψ=Ψ(P) with

Ψ(P) = E_P[f(X-x_0) * (θ(X,W) - θ(x_0,W))] / E_P[f(X-x_0)^2],

with P the distribution of the random vector (W,X,Y), θ(X,W) = E_P[Y|X,W], x_0 the reference value for X, and f a user-supplied function such that f(0)=0 (e.g., f=identity, the default value). The value ψ is obtained using the known θ and the joint empirical distribution of (X,W) based on the same run of observations as in obs. Seeing W, X, Y as DNA methylation, DNA copy number and gene expression, respectively, the simulation scheme implements the following constraints:

Value

Returns a list with the following tags:

obs

A matrix of n observations. The c(W,X,Y) colums of obs have the same interpretation as the columns of the input argument O.

psi

A numeric, approximated value of the true parameter ψ obtained using the known θ and the joint empirical distribution of (X,W) based on the same run of observations as in obs. The larger the sample size, the more accurate the approximation.

g

A function, the known positive conditional probability P(X=x_0|W).

mu

A function, the known conditional expectation E_P(X|W).

muAux

A function, the known conditional expectation E_P(X|X\neq x_0, W).

theta

A function, the known conditional expectation E_P(Y|X,W).

theta0

A function, the known conditional expectation E_P(Y|X=x_0,W).

sigma2

A positive numeric, the known expectation E_P(f(X-x_0)^2).

effIC

A function, the known efficient influence curve of the functional Ψ at P, assuming that the reference value x_0=0.

varIC

A positive numeric, approximated value of the variance of the efficient influence curve of the functional Ψ at P and evaluated at O, obtained using the same run of observations as in obs. The larger the sample size, the more accurate the approximation.

Author(s)

Antoine Chambaz, Pierre Neuvial

References

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059–1099.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
str(sim)

obs <- sim$obs
head(obs)
pairs(obs)

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obsV <- cbind(V, obs)
head(obsV)

## True psi and confidence intervals (case 'f=identity')      
sim01 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi1 <- sim01$psi

confInt01 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(sim01$obs))
confInt1 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(obs))
msg <- "\nCase f=identity:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi1, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt01, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt1, 3)), "\n")
cat(msg)

## True psi and confidence intervals (case 'f=atan')
f2 <- function(x) {1*atan(x/1)}
sim02 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S, f=f2);
truePsi2 <- sim02$psi;

confInt02 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(sim02$obs))
confInt2 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(obs))

msg <- "\nCase f=atan:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi2, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt02, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt2, 3)), "\n")
cat(msg)

tmle.npvi documentation built on May 1, 2019, 6:50 p.m.