node: Create Node Object(s)

Description Usage Arguments Value Details Node formulas (parameters of the distribution) Different types of evaluation for node function arguments Multivariate random variables (multivariate nodes) References Examples

View source: R/node.R

Description

Define a single DAG node and its distribution or define many repeated-measure/time-varying nodes by using argument t. The node distribution is allowed to vary as a function of time (t). Conditionaing on past nodes is accomplished by using the syntactic sugar, such as, NodeName[t]. After all the nodes have been added to the DAG, call set.DAG, a DAG object constructor, and add.action, an action (intervention) constructor.

Usage

1
node(name, t, distr, EFU, order, ..., params = list(), asis.params = list())

Arguments

name

Character node name or a vector of names when specifying a multivariate node. For time-dependent nodes the names will be automatically expanded to a scheme "name_t" for each t provided specified.

t

Node time-point(s). Allows specification of several time-points when t is a vector of positive integers, in which case the output will consist of a named list of length(t) nodes, corresponding to each value in t.

distr

Character name of the node distribution, can be a standard distribution R function, s.a. rnorm, rbinom, runif or user defined. The function must accept a named argument "n" to specify the total sample size. Distributional parameters (arguments) must be passed as either named arguments to node or as a named list of parameters "params".

EFU

End-of-Follow Up flag for designating a survival/censoring type node, only applies to Bernoulli nodes. When EFU=TRUE this node becomes an indicator for the end of follow-up event (censoring, end of study, death, etc). When simulated variable with this node distribution evaluates to value 1 subsequent nodes with higher temporal order values will be set to NA by default (or imputed with carry forward imputation, depending on the settings of the sim function). This can only be set to TRUE and should be omitted otherwise.

order

An optional integer parameter specifying the order in which these nodes will be sampled. The value of order has to start at 1 and be unique for each new node, can be specified as a range / vector and has to be of the same length as the argument t above. When order is left unspecified it will be automatically inferred based on the order in which the node(s) were added in relation to other nodes. See Examples and Details below.

...

Named arguments specifying distribution parameters that are accepted by the distr function. The parameters can be R expressions that are themselves formulas of the past node names.

params

A list of additional named parameters to be passed on to the distr function. The parameters have to be either constants or character strings of R expressions of the past node names.

asis.params

(ADVANCED USE) A list of additional named distributional parameters that will be evaluated "as is", inside the currently simulated data.frame + the calling environment, without any modifications to the R expression strings inside the asis.params list. There is no error-checking for existing node names and no parent node name extraction (the arrows from parents will not appear in plotDAG). Time varying nodes should be referenced by their names as they appear in the simulated data, as in "TVar_t". See details and examples 7 and 8 below.

Value

A list containing node object(s) (expanded to several nodes if t is an integer vector of length > 1)

Details

The combination of name and t must uniquely identify each node in the DAG. Use argument t to identify measurements of the same attribute (e.g. 'A1c') at various time points. The collection of all unique t values, across all nodes, should consist of non-negative integers (i.e., starting at 0).

The optional order argument can be specified, used for determining the sampling order of each node. When order not specified, it is automatically inferred based on the actual order in which the nodes were added to the DAG (earlier added nodes get lower order value and are sampled first)

All node calls that share the same generic name name must also share the same EFU value (if any is specified in at least one of them). A value of TRUE for the EFU indicates that if a simulated value for a measurement of the attribute represented by node is 1 then all the following nodes with that measurement (in terms of higher t values) in the DAG will be unobserved (i.e., their simulated value will be set to NA).

Node formulas (parameters of the distribution)

Each user-supplied argument to the node function is an evaluable R expression, their evaluation is delayed until the actual simulation time. These arguments can refer to standard or user-specified R functions that must only apply to the values of parent nodes, i.e. a subset of the node(s) with an order value strictly lower than that of the node characterized by the formula. Formulas must reference the parent nodes with unique name identifiers, employing the square bracket vector subsetting name[t] for referencing a parent node at a particular time point t (if any time-points were specified). The square bracket notation is used to index a generic name with the relevant time point as illustrated in the examples. When an input node is used to define several nodes (i.e., several measurement of the same attribute, t=0:5), the formula(s) specified in that node can apply to each node indexed by a given time point denoted by t. This generic expression t can then be referenced within a formula to simultaneously identify a different set of parent nodes for each time point as illustrated below. Note that the parents of each node represented by a given node object are implicitly defined by the nodes referenced in formulas of that node call.

Different types of evaluation for node function arguments

There is quite a bit of flexibility in the way in which the node function arguments can be evaluated. By default, the named arguments specified as expressions are first captured by delayed-evaluation and then modified by simcausal to enable the special types of functional syntax. For example, simcausal will over-ride the subsetting operators '[...]' (for time varying nodes) and '[[...]]' (for networks), implying that these operators can no longer be used in their typical R way. Furthermore, simcausal will over-ride the standard 'c' function, with its own definition. Similarly, it will over-ride any calls to sum and mean functions with their row-matrix counterpart functions rowSums and rowMeans. When programming with simcausal (such as passing node arguments inside a function, prior to defining the node), it may be helpful to instead pass such node arguments as character strings, rather than as R expressions. In this case one should use the argument params by adopting the following syntax node(...,params = list(mean="A+B")), which in this case is equivalent to: node(..., mean = A+B).

There are also instances when it might be desirable to retain the original behavior of all R expressions and functions and evaluate a particular node argument "as is". For example, the user may wish to retain the original R functionality of all its operators, including those of [...] and [[...]]. In this case the node argument (or a specific part of the node argument) should be wrapped in .() or eval(). Note that once the expression has been wrapped with .(...) (or eval(...)), the simcausal definitions of operators [...] and [[...]] no longer apply to these expressions and no error checking for "correctness" of these node arguments will be performed.

The forced-evaluation operator .() can be also used as part of an expression, which will prevent the typical simcausal evaluation on only that specific part of the expression. Example 8 below demonstrates the following use case for the expression .(coefAi[t]) * A[t-1], which will look for vector coefAi and then subset it by current value of t (and return a scalar), while A[t-1] will evaluate to the entire column vector of variable A for time point t-1. Such an expression will multiply the entire time-varying vector A[t-1] by scalar value determined by current value of t and the previously defined vector coefAi.

Furthermore, even when a vector or a matrix is wrapped in .(...) it still will be automatically re-parsed into K column matrix with n rows. When this is not desired, for example, when defining a multivariate node distribution, the user may pass such vector or matrix node arguments as a character string in a list argument asis.params. See Example 7 and 8 below for additional details.

Multivariate random variables (multivariate nodes)

Starting from v.0.5, a single node call can be used for defining a multivariate (and possibly correlated) random vector. To define a random vector that has more than 1 dimension, use the argument name to specify a vector with names for each dimension, e.g.,

node(c("X1","X2"), distr = "mvtnorm::rmvnorm", asis.params = list(mean = "c(0,1)", sigma = "matrix(c(1,0.75,0.75,1), ncol=2)"))

will define a bi-variate (correlated) normally distributed node, the simulated data set will contain this bi-variately distributed random variable in columns "X1" and "X2". Note that the multivariate sampling distribution function (such as function rmvnorm from the package mvtnorm) must return a matrix of n rows (number of observations) and length(name) columns (dimensionality). See additional examples below.

Note that one can also define time-varying multivariate nodes, e.g.,

node(c("X1","X2"), t=0:5, distr = "mvtnorm::rmvnorm", asis.params = list(mean = "c(0,1)")).

References

Sofrygin O, van der Laan MJ, Neugebauer R (2017). "simcausal R Package: Conducting Transparent and Reproducible Simulation Studies of Causal Effect Estimation with Complex Longitudinal Data." Journal of Statistical Software, 81(2), 1-47. doi: 10.18637/jss.v081.i02.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
#---------------------------------------------------------------------------------------
# EXAMPLE 1A: Define some Bernoulli nodes, survival outcome Y and put it together in a
# DAG object
#---------------------------------------------------------------------------------------
W1node <- node(name = "W1", distr = "rbern",
	prob = plogis(-0.5), order = 1)
W2node <- node(name = "W2", distr = "rbern",
	prob = plogis(-0.5 + 0.5 * W1), order = 2)
Anode <- node(name = "A", distr = "rbern",
	prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2), order = 3)
Ynode <- node(name = "Y", distr = "rbern",
	prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2), order = 4)
D1Aset <- set.DAG(c(W1node,W2node,Anode,Ynode))

#---------------------------------------------------------------------------------------
# EXAMPLE 1B: Same as 1A using +node interface and no order argument
#---------------------------------------------------------------------------------------
D1B <- DAG.empty()
D1B <- D1B +
  node(name = "W1", distr = "rbern", prob = plogis(-0.5)) +
  node(name = "W2", distr = "rbern", prob = plogis(-0.5 + 0.5 * W1)) +
  node(name = "A", distr = "rbern", prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2)) +
  node(name = "Y", distr = "rbern", prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2))
D1Bset <- set.DAG(D1B)

#---------------------------------------------------------------------------------------
# EXAMPLE 1C: Same as 1A and 1B using add.nodes interface and no order argument
#---------------------------------------------------------------------------------------
D1C <- DAG.empty()
D1C <- add.nodes(D1C, node(name = "W1", distr = "rbern", prob = plogis(-0.5)))
D1C <- add.nodes(D1C, node(name = "W2", distr = "rbern",
	prob = plogis(-0.5 + 0.5 * W1)))
D1C <- add.nodes(D1C, node(name = "A", distr = "rbern",
	prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2)))
D1C <- add.nodes(D1C, node(name = "Y", distr = "rbern",
	prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2)))
D1C <- set.DAG(D1C)

#---------------------------------------------------------------------------------------
# EXAMPLE 1D: Add a uniformly distributed node and redefine outcome Y as categorical
#---------------------------------------------------------------------------------------
D_unif <- DAG.empty()
D_unif <- D_unif +
 node("W1", distr = "rbern", prob = plogis(-0.5)) +
 node("W2", distr = "rbern", prob = plogis(-0.5 + 0.5 * W1)) +
 node("W3", distr = "runif", min = plogis(-0.5 + 0.7 * W1 + 0.3 * W2), max = 10) +
 node("An", distr = "rbern", prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2 - 0.2 * sin(W3)))
# Categorical syntax 1 (probabilities as values):
D_cat_1 <- D_unif + node("Y", distr = "rcat.b1", probs = {0.3; 0.4})
D_cat_1 <- set.DAG(D_cat_1)
# Categorical syntax 2 (probabilities as formulas):
D_cat_2 <- D_unif +
node("Y", distr = "rcat.b1",
	probs={plogis(-0.1 + 1.2 * An + 0.3 * W1 + 0.3 * W2 + 0.2 * cos(W3));
			   plogis(-0.5 + 0.7 * W1)})
D_cat_2 <- set.DAG(D_cat_2)

#---------------------------------------------------------------------------------------
# EXAMPLE 2A: Define Bernoulli nodes using R rbinom() function, defining prob argument
# for L2 as a function of node L1
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbinom", prob = 0.05, size = 1) +
node("L2", t = 0, distr = "rbinom", prob = ifelse(L1[0] == 1, 0.5, 0.1), size = 1)
Dset <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 2B: Equivalent to 2A, passing argument size to rbinom inside a named list
# params
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbinom", prob = 0.05, params = list(size = 1)) +
node("L2", t = 0, distr = "rbinom",
	prob = ifelse(L1[0] == 1,0.5,0.1), params = list(size = 1))
Dset <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 2C: Equivalent to 2A and 2B, define Bernoulli nodes using a wrapper "rbern"
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbern", prob = 0.05) +
node("L2", t = 0, distr = "rbern", prob = ifelse(L1[0] == 1, 0.5, 0.1))
Dset <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 3: Define node with normal distribution using rnorm() R function
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + node("L2", t = 0, distr = "rnorm", mean = 10, sd = 5)
Dset <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 4: Define 34 Bernoulli nodes, or 2 Bernoulli nodes over 17 time points,
#---------------------------------------------------------------------------------------
t_end <- 16
D <- DAG.empty()
D <- D +
node("L2", t = 0:t_end, distr = "rbinom",
	prob = ifelse(t == t_end, 0.5, 0.1), size = 1) +
node("L1", t = 0:t_end, distr = "rbinom",
	prob = ifelse(L2[0] == 1, 0.5, 0.1), size = 1)
Dset <- set.DAG(D)
sim(Dset, n=10)

#---------------------------------------------------------------------------------------
# EXAMPLE 5: Defining new distribution function 'rbern', defining and passing a custom
# vectorized node function 'customfun'
#---------------------------------------------------------------------------------------
rbern <- function(n, prob) { # defining a bernoulli wrapper based on R rbinom function
  rbinom(n = n, prob = prob, size = 1)
}
customfun <- function(arg, lambda) {
  res <- ifelse(arg == 1, lambda, 0.1)
  res
}
D <- DAG.empty()
D <- D +
node("W1", distr = "rbern", prob = 0.05) +
node("W2", distr = "rbern", prob = customfun(W1, 0.5)) +
node("W3", distr = "rbern", prob = ifelse(W1 == 1, 0.5, 0.1))
D1d <- set.DAG(D, vecfun = c("customfun"))
sim(D1d, n = 10, rndseed = 1)

#---------------------------------------------------------------------------------------
# EXAMPLE 6: Defining latent variables I and U.Y (will be hidden from simulated data)
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
  node("I",
    distr = "rcat.b1",
    probs = c(0.1, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1)) +
  node("W1",
    distr = "rnorm",
    mean = ifelse(I == 1, 0, ifelse(I == 2, 3, 10)) + 0.6 * I,
    sd = 1) +
  node("W2",
    distr = "runif",
    min = 0.025*I, max = 0.7*I) +
  node("W3",
    distr = "rbern",
    prob = plogis(-0.5 + 0.7*W1 + 0.3*W2 - 0.2*I)) +
  node("A",
    distr = "rbern",
    prob = plogis(+4.2 - 0.5*W1 + 0.2*W2/2 + 0.2*W3)) +
	node("U.Y", distr = "rnorm", mean = 0, sd = 1) +
  node("Y",
    distr = "rconst",
    const = -0.5 + 1.2*A + 0.1*W1 + 0.3*W2 + 0.2*W3 + 0.2*I + U.Y)
Dset1 <- set.DAG(D, latent.v = c("I", "U.Y"))
sim(Dset1, n = 10, rndseed = 1)

#---------------------------------------------------------------------------------------
# EXAMPLE 7: Multivariate random variables
#---------------------------------------------------------------------------------------
if (requireNamespace("mvtnorm", quietly = TRUE)) {
  D <- DAG.empty()
  # 2 dimensional normal (uncorrelated), using rmvnorm function from rmvnorm package:
  D <- D +
    node(c("X1","X2"), distr = "mvtnorm::rmvnorm",
      asis.params = list(mean = "c(0,1)")) +
  # Can define a degenerate (rconst) multivariate node:
    node(c("X1.copy", "X2.copy"), distr = "rconst", const = c(X1, X2))
  Dset1 <- set.DAG(D, verbose = TRUE)
  sim(Dset1, n = 10)
}

# On the other hand this syntax wont work,
# since simcausal will parse c(0,1) into a two column matrix:
## Not run: 
D <- DAG.empty()
D <- D + node(c("X1","X2"), distr = "mvtnorm::rmvnorm", mean = c(0,1))
Dset1 <- set.DAG(D, verbose = TRUE)

## End(Not run)

if (requireNamespace("mvtnorm", quietly = TRUE)) {
  D <- DAG.empty()
  # Bivariate normal (correlation coef 0.75):
  D <- D +
    node(c("X1","X2"), distr = "mvtnorm::rmvnorm",
      asis.params = list(mean = "c(0,1)",
        sigma = "matrix(c(1,0.75,0.75,1), ncol=2)")) +
  # Can use any component of such multivariate nodes when defining new nodes:
    node("A", distr = "rconst", const = 1 - X1)
  Dset2 <- set.DAG(D, verbose = TRUE)
  sim(Dset2, n = 10)
}

# Time-varying multivar node (3 time-points, 2 dimensional normal)
# plus changing the mean over time, as as function of t:
if (requireNamespace("mvtnorm", quietly = TRUE)) {
  D <- DAG.empty()
  D <- D +
    node(c("X1", "X2"), t = 0:2, distr = "mvtnorm::rmvnorm",
      asis.params = list(
        mean = "c(0,1) + t",
        sigma = "matrix(rep(0.75,4), ncol=2)"))
  Dset5b <- set.DAG(D)
  sim(Dset5b, n = 10)
}

# Two ways to define the same bivariate uniform copula:
if (requireNamespace("copula", quietly = TRUE)) {
  D <- DAG.empty()
  D <- D +
  # with a warning since normalCopula() returns an object unknown to simcausal:
    node(c("X1","X2"), distr = "copula::rCopula",
      copula = eval(copula::normalCopula(0.75, dim = 2))) +
  # same, as above:
    node(c("X3","X4"), distr = "copula::rCopula",
      asis.params = list(copula = "copula::normalCopula(0.75, dim = 2)"))
  vecfun.add("qbinom")
  # Bivariate binomial derived from previous copula, with same correlation:
  D <- D +
    node(c("A.Bin1", "A.Bin2"), distr = "rconst",
      const = c(qbinom(X1, 10, 0.5), qbinom(X2, 15, 0.7)))
  Dset3 <- set.DAG(D)
  sim(Dset3, n = 10)
}

# Same as "A.Bin1" and "A.Bin2", but directly using rmvbin function in bindata package:
if (requireNamespace("bindata", quietly = TRUE)) {
  D <- DAG.empty()
  D <- D +
    node(c("B.Bin1","B.Bin2"), distr = "bindata::rmvbin",
      asis.params = list(
        margprob = "c(0.5, 0.5)",
        bincorr = "matrix(c(1,0.75,0.75,1), ncol=2)"))
  Dset4b <- set.DAG(D)
  sim(Dset4b, n = 10)
}

#---------------------------------------------------------------------------------------
# EXAMPLE 8: Combining simcausal non-standard evaluation with eval() forced evaluation
#---------------------------------------------------------------------------------------
coefAi <- 1:10
D <- DAG.empty()
D <- D +
  node("A", t = 1, distr = "rbern", prob = 0.7) +
  node("A", t = 2:10, distr = "rconst", const = eval(coefAi[t]) * A[t-1])
Dset8 <- set.DAG(D)
sim(Dset8, n = 10)

#---------------------------------------------------------------------------------------
# TWO equivalent ways of creating a multivariate node (combining nodes W1 and W2):
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + node("W1", distr = "rbern", prob = 0.45)
D <- D + node("W2", distr = "rconst", const = 1)

# option 1:
D <- D + node(c("W1.copy1", "W2.copy1"), distr = "rconst", const = c(W1, W2))

# equivalent option 2:
create_mat <- function(W1, W2) cbind(W1, W2)
vecfun.add("create_mat")
D <- D + node(c("W1.copy2", "W2.copy2"), distr = "rconst", const = create_mat(W1, W2))

Dset <- set.DAG(D)
sim(Dset, n=10, rndseed=1)

simcausal documentation built on Oct. 9, 2017, 1:03 a.m.