R/simulate.R

Defines functions simulateInterventions

Documented in simulateInterventions

#' Simulate data of a causal (possibly cyclic model) under interventions.
#' 
#' @description Simulate data of a causal (possibly cyclic model) under interventions.
#' 
#' @param n Number of observations.
#' @param p Number of variables.
#' @param df Degrees of freedom in t-distribution of noise and interventions. 
#' @param rhoNoise Correlation between noise terms to model hidden variabkes. 
#' Set to 0 for independent noise.
#' @param snrPar Signal-to-noise parameter: steers what proportion of the variance stems from 
#' the signal resp.\ from the noise: The SNR is given by $SNR = (1-\code{snrPar})/\code{snrPar}$), 
#' see details. Only holds when \code{cyclic = FALSE}.
#' @param sparse Probability that an entry \eqn{i,j} in adjacency matrix is 1.
#' @param doInterv Set to TRUE if interventions should be do-interventions; otherwise
#' noise interventions (also called shift interventions) are generated.
#' @param numberInt Total number of settings. 
#' @param strengthInt Regulates the strength of the interventions, see details.
#' @param cyclic Set to TRUE is resulting graph should contain a cycle.
#' @param strengthCycle Steers strength of feedback, see details.
#' @param modelMis Add a model misspecification that applies \code{tanh(modelMisPar*x)/modelMisPar)}
#' morginally to each variable after having generated X from the causal DAG.
#' @param modelMisPar Parameter steering the strength of the model misspecification. 
#' @param seed Random seed.
#' @return A list with the following elements: 
#' \itemize{
#'   \item \code{X} \eqn{n x p}-dimensional data matrix
#'   \item \code{environment} Indicator of the experiment or the intervention type an 
#'   observation belongs to. A numeric vector of length \eqn{n}.  
#'   \item \code{interventions} A list of length \eqn{n}. Indicates location of interventions
#'   for each data point.
#'   \item \code{whereInt} A list of length  \code{numberInt}. Indicates location of interventions
#'   in each setting.
#'   \item \code{noise}
#'   \item \code{configs} A list with the generated adjacency matrix (\code{trueA})
#'   as well as all input arguments.
#' }
#' 
#' @details The adjacency matrix \eqn{A} is generated as follows. Assume the variables 
#' with indices \eqn{{1, \ldots, p}} are causally ordered. For each edge from node 
#' \eqn{i} to node \eqn{j} where \eqn{i} precedes \eqn{j} in the causal ordering, 
#' we draw a sample from Bin(\code{sparse}) to determine whether to add an edge 
#' from node \eqn{i} to node \eqn{j}. After having sampled the non-zero entries 
#' of \eqn{A} in this fashion, we sample the coefficients from Unif(-1,1). 
#' As described below, the edge weights are later rescaled to achieve a specified 
#' signal-to-noise ratio. We exclude the possibility of \eqn{A = 0}, 
#' i.e. we resample until \eqn{A} contains at least one non-zero entry.
#' 
#' Second, the interventions are generated as follows. \code{numberInt} denotes the total 
#' number of (interventional and observational) settings that are generated. 
#' For each variable, we sample uniformly at random with replacement one setting 
#' in which this variable is intervened on. In other words, each variable is 
#' intervened on in exactly one setting. Hence it is possible that there are 
#' settings where no interventions take place which then correspond to the 
#' observational case. Similarly, there may be settings where interventions 
#' are performed on multiple variables at once. After defining the settings, 
#' we sample (uniformly at random with replacement) what setting each data point 
#' belongs to. So for each setting we generate approximately the same number of 
#' samples. In one generated data set, the interventions are all of the same 
#' type, i.e. they are either all shift interventions (when \code{doInterv = FALSE}) 
#' or do-interventions (when \code{doInterv = TRUE}). In both cases, an intervention 
#' on \eqn{X_j} is modelled by generating \eqn{Z_j} as \eqn{Z_j ~} \code{strengthInt} \eqn{* t}(\code{dfNoise}). 
#' If \code{strengthInt} = 0, all interventional settings correspond to purely 
#' observational data.
#' 
#' Third, the noise terms \eqn{\epsilon} are generated by first sampling from 
#' \eqn{N(0,\Sigma)} where \eqn{\Sigma_{i,i} = 1} and 
#' \eqn{\Sigma_{i,j} =} \code{rhoNoise}. To steer the signal-to-noise ratio, 
#' we set the variance of the noise terms of all nodes except source nodes 
#' to \code{snrPar} where \eqn{0 < }\code{snrPar}\eqn{ \le 1}. Stepping through the 
#' variables in causal order, for each variable \eqn{X_j} that has parents, we 
#' uniformly rescale the edge weights \eqn{\beta_{j,k}} for \eqn{k = 1, \ldots, p} 
#' in the structural equation of variable \eqn{X_j} such that the variance of 
#' the sum \eqn{\sum_{k=1}^p  \beta_{j,k} X_k + \epsilon_j} is approximately 
#' 1 in the observational setting. In other words, the parameter \code{snrPar}
#'  steers what proportion of the variance stems from the signal given by  
#'  \eqn{\sum_{k=1}^p  \beta_{j,k} X_k} and what proportion stems from the 
#'  noise \eqn{\epsilon_j}. The signal-to-noise ratio can then be computed 
#'  as SNR = (1-\code{snrPar})/\code{snrPar}.
#'  
#' Forth, a cycle is added to the causal graph if \code{cyclic = TRUE}. If the 
#'  causal graph shall contain a cycle, we sample two nodes \eqn{i} and \eqn{j} 
#'  such that adding an edge between them creates a cycle in the causal graph. 
#'  We then compute the largest possible coefficient for this edge such that the 
#'  cycle product is smaller than 1. Subsequently, we sample the sign of the 
#'  coefficient and set the magnitude by scaling the largest possible coefficient 
#'  by \code{strengthCycle} where \eqn{0 < }\code{strengthCycle}\eqn{< 1}.
#'  
#' Fifth, we rescale the noise variables to obtain a \eqn{t}-distribution with 
#'  \code{dfNoise} degrees of freedom. \eqn{X} is then generated as 
#'  \eqn{X  = (I-A)^{-1}\epsilon} in the observational case; under a shift 
#'  interventions \eqn{X} can be generated as \eqn{X  = (I-A)^{-1}(\epsilon + Z)} 
#'  where the coordinates of \eqn{Z} are only non-zero for the variables 
#'  that are intervened on. Under a do-intervention on \eqn{X_j}, \eqn{\beta_{j,k}}
#'   for \eqn{k = 1, \ldots, p} are set to 0 to yield \eqn{A'} and \eqn{\epsilon_j}
#'   is set to \eqn{Z_j} to yield \eqn{\epsilon_j'}. We then obtain \eqn{X} as 
#'   \eqn{X  = (I-A')^{-1}\epsilon'}.
#'   
#'  Lastly, if \code{modelMis = TRUE} a model misspecification is added to the 
#'  data by marginally transforming all variables as \code{tanh(modelMisPar*x)/modelMisPar)}.
#'  
simulateInterventions <- function(n, p, df, rhoNoise, snrPar,
                                  sparse, doInterv,
                                  numberInt, strengthInt,
                                  cyclic, strengthCycle,
                                  modelMis = FALSE, modelMisPar = 1,
                                  seed =1){
  ###### set seed
  set.seed(seed)
  
  if(snrPar == 0 | snrPar > 1){
    stop("snrPar needs to be larger than 0 and smaller or equal to 1.")
  }
  
  if(cyclic & strengthCycle >= 1){
    stop("strengthCycle needs to be smaller than 1.")
  }
  
  if(numberInt <= 0){
    stop("numberInt needs to be at least 1.")
  }
  
  # generate A
  ## skeleton
  cont <- TRUE
  while(cont){
    AS <- 0*diag(p)
    for (k in 1:p){
      for (k2 in 1:p){
        if(k2>k) AS[k,k2] <- rbinom(1,1,sparse)
      }
    }
    if(sum(AS)!=0) cont <- FALSE
  }
    
  A <- AS * matrix( runif(p^2,-1,1),nrow=p)
  
  
  ####### initialize
  Perturb <- matrix(0,nrow=n,ncol=p)

  ### simulate environments
  whereInt <- list()
  if(numberInt == 1){
    whereInt[[1]] <- integer(0)
    environment <- rep(1, n)
    interventions <- lapply(1:n, function(i) integer(0))
  }else{
    environment_var <- sample(1:numberInt, p, replace=TRUE)
    
    for(nic in 1:numberInt){
      whereInt[[nic]] <- which(environment_var == nic)  
    }
    
    environment <- sample(1:numberInt, n, replace=TRUE)
    interventions <- list()
    for (i in 1:n){
      interventions[[i]] <- whereInt[[environment[i]]]
    }
  }
  
  
  ###### simulate shift interventions 
  for(intervEnv in 1:numberInt){
    rowsSelect <- environment == intervEnv
    colsSelect <- whereInt[[intervEnv]]
    Perturb[rowsSelect,  whereInt[[intervEnv]]] <-  
      strengthInt*rt( length(colsSelect) * sum(rowsSelect) ,df=df)
  }
  
  ###### simulate noise
  noise <- matrix(rnorm(n*p),nrow=n)
  Sigma <- matrix(rhoNoise,nrow=p,ncol=p)
  diag(Sigma) <- 1
  CC <- chol(Sigma)
  noise <- noise %*% CC
  noise[,which(colSums(A) != 0)] <- sqrt(snrPar)*noise[,which(colSums(A) != 0)] 
  
  # change A to achieve given SNR in observational case
  A <- setSNR(noise = noise, A)
    
  if(cyclic){
    cont <- TRUE
    Atmp <- A
  
    while(cont){
      # add cycle to A
      # sample uniformly at random one connection backwards
      backConnect <- sample(1:p, 2, replace = FALSE)
      Atmp[max(backConnect), min(backConnect)] <- 1
      
      if(sum(A) == 0){
        Atmp[min(backConnect), max(backConnect)] <- A[min(backConnect), max(backConnect)] <- runif(1,-1,1)
      }
      
      # find cyclic path from max(backConnect) back to max(backConnect)
      # and get largest coefficient path can have (if it exists)
      cpmat <- getLargestWeightForCycle(Atmp)$cpMat
      if(cpmat[max(backConnect), max(backConnect)] != 0){
        cont <- FALSE
      }else{
        Atmp <- A
      }
    }
    
    # sample sign of weight of backward connection
    signEntry <- sample(c(-1,1),1)
    # adjust edge weight by strengthCycle parameter
    A[max(backConnect), min(backConnect)] <- signEntry*strengthCycle/
        cpmat[max(backConnect), max(backConnect)]
  }
  
  # apply t-distribution for noise
  noise <- apply(noise, 2, function(x) qt(pnorm(x), df = df))
  
  if(doInterv & strengthInt > 0){
    X <- matrix(0, nrow = n, ncol = p)
    # for do interventions cycle through environments and adjust A 
    # as well as noise + Perturb accordingly
    for(env in 1:numberInt){
      # data points from setting env
      inds <- which(environment == env) 
      # modify A: intervention on var j means to cut all incoming connections
      Amod <- A
      Amod[,whereInt[[env]]] <- 0
      # get inverse
      inv <- solve(diag(p) - Amod) 
      # replace columns of noise where intervention was applied
      noiseMat <- noise[inds,,drop = FALSE]
      noiseMat[,whereInt[[env]]] <- Perturb[inds,whereInt[[env]], drop = FALSE]
      X[inds,] <- noiseMat%*%inv
    }
    # noise <- noiseMat
  }else{
    inv <- solve(diag(p) - A)
    X <- (noise + Perturb)%*%inv
  }
  
  
  if(modelMis){
    X <- apply(X,2,function(x) tanh(modelMisPar*x)/modelMisPar)
  }
  
  
  config.options <- list(trueA = A, 
                         n = n, 
                         p = p,
                         df = df, 
                         rhoNoise = rhoNoise, 
                         snrPar = snrPar,
                         sparse = sparse,
                         doInterv = doInterv,
                         numberInt = numberInt,
                         strengthInt = strengthInt,
                         cyclic = cyclic, 
                         strengthCycle = strengthCycle,
                         modelMis = modelMis,
                         modelMisPar = modelMisPar,
                         seed = seed)
                         

  list(X = X, 
       environment = environment,
       interventions = interventions,
       whereInt = whereInt,
       noise = noise,
       configs = config.options)
}
christinaheinze/CompareCausalNetworks documentation built on Feb. 22, 2020, 12:37 p.m.