node_mixture: Generate Data from a Mixture of Node Definitions
In simDAG: Simulate Data from a DAG and Associated Node Information

node_mixture

R Documentation

Generate Data from a Mixture of Node Definitions

Description

This node type allows users to apply different nodes to different subsets of the already generated data, making it possible to generate data for arbitrary mixture distributions. It is similar to node_conditional_distr and node_conditional_prob, with the main difference being that the former only allow univariate distributions conditional on categorical variables, while this function allows any kind of node definition and condition. This makes it, for example, possible to generate data for a variable from different regression models for different subsets of simulated individuals.

Usage

node_mixture(data, parents, name, distr, default=NA)

Arguments

`data`	A `data.table` (or something that can be coerced to a `data.table`) containing all columns specified by `parents`.
`parents`	A character vector specifying the names of the parents that this particular child node has. This vector should include all nodes that are used in the conditions and the `node` calls specified in `distr`.
`name`	A single character string specifying the name of the node.
`distr`	A unnamed list that specifies both the conditions and the `node` definitions. It should be specified in a similar way as the `fcase` function in pairs of conditions (coded as strings) and `node` definitions. This means that a condition comes first, for example `"A==0"`, followed by some call `node` and so on. Arbitrary numbers of those pairs are allowed with no restrictions to what can be specified in the `node` calls. The `name` argument has to be specified in all `node` calls, but it does not matter which value is used as it will be ignored in further processing. Currently only supports time-fixed nodes defined using the `node` function, not time-dependent nodes defined using the `node_td` function. See examples.
`default`	A single value of some kind, used as a default value for those individuals not covered by all the conditions defined in `distr`. Defaults to `NA`.

Details

Internally, the data is generated by extracting only the relevant part of the already generated data as defined by the condition and using node function to generate the new response-part. This generation is done in the order in which the distr was specified, meaning that data for the first condition is checked first and so on. There are no safeguards to guarantee that the conditions do not overlap. For example, users are free to set the first condition to something like A > 10 and the next one to A > 11, in which case the value for every individual with A > 11 is generated twice (first with the first specification, secondly with the next specification). In this case, only the last generated value is retained.

Note that it is also possible to use the mixture node itself inside the conditions or node calls in distr, because it is directly added to the data before the first condition is applied (by setting everyone to the default value). See examples.

Additionally, because the output of each of the parts of the mixture distributions is forced into one vector, they might be coerced from one class to another, depending on the input to distr and the order used. This also needs to be taken care of by the user.

Value

Returns a vector of length nrow(data). The class of the vector is determined by what is specified in distr.

Author(s)

Robin Denz

Examples

library(simDAG)

set.seed(1234)

## different linear regression models per level of a different covariate
# here, A is the group that is used for the conditioning, B is a predictor
# and Y is the mixture distributed outcome
dag <- empty_dag() +
  node("A", type="rbernoulli") +
  node("B", type="rnorm") +
  node("Y", type="mixture", parents="A",
       distr=list(
         "A==0", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
         "A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
       ))
data <- sim_from_dag(dag, n_sim=100)
head(data)

# also works with multiple conditions
dag <- empty_dag() +
  node(c("A", "C"), type="rbernoulli") +
  node("B", type="rnorm") +
  node("Y", type="mixture", parents=c("A", "C"),
    distr=list(
      "A==0 & C==1", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
      "A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
    ))
data <- sim_from_dag(dag, n_sim=100)
head(data)

# using the mixture node itself in the condition
# see cookbook vignette, section on outliers for more info
dag <- empty_dag() +
  node(c("A", "B", "C"), type="rnorm") +
  node("Y", type="mixture", parents=c("A", "B", "C"),
       distr=list(
         "TRUE", node(".", type="gaussian", formula= ~ -2 + A*0.1 + B*1 + C*-2,
                      error=1),
         "Y > 2", node(".", type="rnorm", mean=10000, sd=500)
       ))
data <- sim_from_dag(dag, n_sim=100)

simDAG documentation built on June 25, 2025, 1:07 a.m.