knitr::opts_chunk$set( collapse=TRUE, comment="#>", fig.align="center" )
In this small vignette, we give a detailed explanation on how to define custom functions that can be used in the type
argument of node()
or node_td()
calls. Although simDAG
includes a large number of different node types that can be used in this argument directly, it also allows the user to pass any function to this argument, as long as that function meets some limited criteria (as described below). This is an advanced feature that most users probably don't need for standard simulation studies. We strongly recommend reading the documentation and the other vignettes first, because this vignette assumes that the reader is already familiar with the simDAG
syntax and general features.
The support for custom functions in type
allows users to create root nodes, child nodes or time-dependent nodes that are not directly implemented in this package. By doing so, users may create data with any functional dependence they can think of. The requirements for each node type are listed below. Some simple examples for each node type are given in each section. If you think that your custom node type might be useful to others, please contact the maintainer of this package via the supplied e-mail address or github and we might add it to this package.
library(simDAG) set.seed(1234)
Any function that generates some vector of size n
with n==nrow(data)
, or a data.frame()
with as many rows as the current data can be used as a child node. The only requirement is:
n
which controls how many samples to generate.Some examples that are already implemented in R outside of this package are stats::rnorm()
, stats::rgamma()
and stats::rbeta()
. The function may take any amount of further arguments, which will be passed through the three-dot (...
) syntax. Note that whenever the supplied function produces a data.frame()
(or similar object), the user has to ensure that the included columns are named properly.
Using external functions that fulfill the requirements which are already defined by some other package can be done this way:
dag <- empty_dag() + node("A", type="rgamma", shape=0.1, rate=2) + node("B", type="rbeta", shape1=2, shape2=0.3)
Of course users may also define an appropriate root node function themselves. The code below defines a function that takes the sum of a normally distributed random number and a uniformly distributed random number for each simulated individual:
custom_root <- function(n, min=0, max=1, mean=0, sd=1) { out <- runif(n, min=min, max=max) + rnorm(n, mean=mean, sd=sd) return(out) } # the function may be supplied as a string dag <- empty_dag() + node("A", type="custom_root", min=0, max=10, mean=5, sd=2) # equivalently, the function can also be supplied directly # This is the recommended way! dag <- empty_dag() + node("A", type=custom_root, min=0, max=10, mean=5, sd=2) data <- sim_from_dag(dag=dag, n_sim=100) head(data)
Again, almost any function may be used to generate a child node. Only four things are required for this to work properly:
node_
(if you want to use a string to define it in type
).data
(contains the already generated data).parents
(contains a vector of the child nodes parents).n_sim
or a data.frame()
(or similar object) with any number of columns and n_sim
rows.The function may include any amount of additional arguments specified by the user.
Below we define a custom child node type that is basically just a gaussian node with some (badly done) truncation, limiting the range of the resulting variable to be between left
and right
.
node_gaussian_trunc <- function(data, parents, betas, intercept, error, left, right) { out <- node_gaussian(data=data, parents=parents, betas=betas, intercept=intercept, error=error) out <- ifelse(out <= left, left, ifelse(out >= right, right, out)) return(out) }
Please note that this is a terrible form of truncation in most cases, because it artificially distorts the resulting normal distribution at the left
and right
values. It is only meant as an illustration. Here is another example of a custom child node function, which simply returns the sum of its parents:
parents_sum <- function(data, parents, betas=NULL) { out <- rowSums(data[, parents, with=FALSE]) return(out) }
We can use both of these functions in a DAG like this:
dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("custom_1", type="gaussian_trunc", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2, error=2, left=10, right=25) + node("custom_2", type=parents_sum, parents=c("age", "custom_1")) data <- sim_from_dag(dag=dag, n_sim=100) head(data)
By time-dependent nodes we mean nodes that are created using the node_td()
function. In general, this works in essentially the same way as for simple root nodes or child nodes. The requirements are:
node_
(if you want to use a string to define it in type
).data
(contains the already generated data).parents
(contains a vector of the child nodes parents). This is not necessary for nodes that are independently generated.n_sim
or a data.frame()
(or similar object) with any number of columns and n_sim
rows.Again, any number of additional arguments is allowed and will be passed through the three-dot syntax. Additionally, there are two build-in arguments that users may specify in custom time-dependent nodes, which are then used internally. First, users may add an argument to this function called sim_time
. If included in the function definition, the current time of the simulation will be passed to the function on every call made to it. Secondly, the argument past_states
may be added. If done so, a list containing all previous states of the simulation (as saved using the save_states
argument of the sim_discrete_time()
function) will be passed to it internally, giving the user access to the data generated at previous points in time.
An example for a custom time-dependent root node is given below:
node_custom_root_td <- function(data, n, mean=0, sd=1) { return(rnorm(n=n, mean=mean, sd=sd)) }
This function simply draws a new value from a normal distribution at each point in time of the simulation. A DAG using this node type could look like this:
n_sim <- 100 dag <- empty_dag() + node_td(name="Something", type=node_custom_root_td, n=n_sim, mean=10, sd=5)
Below is an example for a function that can be used to define a custom time-dependent child node:
node_custom_child <- function(data, parents) { out <- numeric(nrow(data)) out[data$other_event] <- rnorm(n=sum(data$other_event), mean=10, sd=3) out[!data$other_event] <- rnorm(n=sum(!data$other_event), mean=5, sd=10) return(out) } dag <- empty_dag() + node_td("other", type="time_to_event", prob_fun=0.1) + node_td("whatever", type="custom_child", parents="other_event")
This function takes a random draw from a normal distribution with different specifications based on whether a previously updated time-dependent node called other
is currently TRUE
or FALSE
.
sim_time
ArgumentBelow we give an example on how the sim_time
argument may be used. The following function simply returns the square of the current simulation time as output:
node_square_sim_time <- function(data, sim_time, n_sim) { return(rep(sim_time^2, n=n_sim)) } dag <- empty_dag() + node_td("unclear", type=node_square_sim_time, n_sim=100)
Note that we did not (and should not!) actually define the sim_time
argument in the node_td()
definition of the node, because it will be passed internally, just like data
is. As long as sim_time
is a named argument of the function the user is passing, it will be handled automatically. In real simulation studies this feature may be used to create time-scale dependent risks or effects for some time-dependent events of interest.
past_states
ArgumentAs stated earlier, another special kind of argument is the past_states
argument, which allows users direct access to past states of the simulation. Below is an example of how this might be used:
node_prev_state <- function(data, past_states, sim_time) { if (sim_time < 3) { return(rnorm(n=nrow(data))) } else { return(past_states[[sim_time-2]]$A + rnorm(n=nrow(data))) } } dag <- empty_dag() + node_td("A", type=node_prev_state, parents="A")
This function simply returns the value used two simulation time steps ago plus a normally distributed random value. To make this happen, we actually use both the sim_time
argument and the past_states
argument. Note that, again, we do not (and cannot!) define these arguments in the node_td()
definition of the node. They are simply used internally.
A crucial thing to make the previous code work in an actual simulation is the save_states
argument of the sim_discrete_time()
function. This argument controls which states should be saved internally. If users want to use previous states, these need to be saved, so the argument should in almost all cases be set to save_states="all"
, as shown below:
sim <- sim_discrete_time(dag, n_sim=100, max_t=10, save_states="all")
Users may also use the enhanced formula
interface directly with custom child nodes and custom time-dependent nodes. This is described in detail in the vignette on specifying formulas (see vignette(topic="v_using_formulas", package="simDAG")
).
Using custom functions as node types is an advanced technique to obtain specialized simulated data. It is sadly impossible to cover all user cases here, but we would like to give some general recommendations nonetheless:
type
directly, do not use a string. This might avoid some weird scoping issues, depending on which environment the simulation is performed in.node_identity()
might be used instead. In many cases, it is a lot easier to just use a node of type identity
instead of defining a new function.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.