R/simulations.R
In fake: Flexible Data Simulation Using the Multivariate Normal Distribution

Documented in HugeAdjacency SimulateAdjacency SimulateClustering SimulateComponents SimulateCorrelation SimulateGraphical SimulatePrecision SimulateRegression SimulateStructural

#' Data simulation for Gaussian Graphical Modelling
#'
#' Simulates data from a Gaussian Graphical Model (GGM).
#'
#' @param n number of observations in the simulated dataset.
#' @param pk vector of the number of variables per group in the simulated
#'   dataset. The number of nodes in the simulated graph is \code{sum(pk)}. With
#'   multiple groups, the simulated (partial) correlation matrix has a block
#'   structure, where blocks arise from the integration of the \code{length(pk)}
#'   groups. This argument is only used if \code{theta} is not provided.
#' @param theta optional binary and symmetric adjacency matrix encoding the
#'   conditional independence structure.
#' @param implementation function for simulation of the graph. By default,
#'   algorithms implemented in \code{\link[huge]{huge.generator}} are used.
#'   Alternatively, a user-defined function can be used. It must take \code{pk},
#'   \code{topology} and \code{nu} as arguments and return a
#'   \code{(sum(pk)*(sum(pk)))} binary and symmetric matrix for which diagonal
#'   entries are all equal to zero. This function is only applied if
#'   \code{theta} is not provided.
#' @param topology topology of the simulated graph. If using
#'   \code{implementation=HugeAdjacency}, possible values are listed for the
#'   argument \code{graph} of \code{\link[huge]{huge.generator}}. These are:
#'   "random", "hub", "cluster", "band" and "scale-free".
#' @param nu_within probability of having an edge between two nodes belonging to
#'   the same group, as defined in \code{pk}. If \code{length(pk)=1}, this is
#'   the expected density of the graph. If \code{implementation=HugeAdjacency},
#'   this argument is only used for \code{topology="random"} or
#'   \code{topology="cluster"} (see argument \code{prob} in
#'   \code{\link[huge]{huge.generator}}). Only used if \code{nu_mat} is not
#'   provided.
#' @param nu_between probability of having an edge between two nodes belonging
#'   to different groups, as defined in \code{pk}. By default, the same density
#'   is used for within and between blocks (\code{nu_within}=\code{nu_between}).
#'   Only used if \code{length(pk)>1}. Only used if \code{nu_mat} is not
#'   provided.
#' @param nu_mat matrix of probabilities of having an edge between nodes
#'   belonging to a given pair of node groups defined in \code{pk}.
#' @param output_matrices logical indicating if the true precision and (partial)
#'   correlation matrices should be included in the output.
#' @param v_within vector defining the (range of) nonzero entries in the
#'   diagonal blocks of the precision matrix. These values must be between -1
#'   and 1 if \code{pd_strategy="min_eigenvalue"}. If \code{continuous=FALSE},
#'   \code{v_within} is the set of possible precision values. If
#'   \code{continuous=TRUE}, \code{v_within} is the range of possible precision
#'   values.
#' @param v_between vector defining the (range of) nonzero entries in the
#'   off-diagonal blocks of the precision matrix. This argument is the same as
#'   \code{v_within} but for off-diagonal blocks. It is only used if
#'   \code{length(pk)>1}.
#' @param v_sign vector of possible signs for precision matrix entries. Possible
#'   inputs are: \code{-1} for positive partial correlations, \code{1} for
#'   negative partial correlations, or \code{c(-1, 1)} for both positive and
#'   negative partial correlations.
#' @param continuous logical indicating whether to sample precision values from
#'   a uniform distribution between the minimum and maximum values in
#'   \code{v_within} (diagonal blocks) or \code{v_between} (off-diagonal blocks)
#'   (if \code{continuous=TRUE}) or from proposed values in \code{v_within}
#'   (diagonal blocks) or \code{v_between} (off-diagonal blocks) (if
#'   \code{continuous=FALSE}).
#' @param pd_strategy method to ensure that the generated precision matrix is
#'   positive definite (and hence can be a covariance matrix). If
#'   \code{pd_strategy="diagonally_dominant"}, the precision matrix is made
#'   diagonally dominant by setting the diagonal entries to the sum of absolute
#'   values on the corresponding row and a constant u. If
#'   \code{pd_strategy="min_eigenvalue"}, diagonal entries are set to the sum of
#'   the absolute value of the smallest eigenvalue of the precision matrix with
#'   zeros on the diagonal and a constant u.
#' @param ev_xx expected proportion of explained variance by the first Principal
#'   Component (PC1) of a Principal Component Analysis. This is the largest
#'   eigenvalue of the correlation (if \code{scale_ev=TRUE}) or covariance (if
#'   \code{scale_ev=FALSE}) matrix divided by the sum of eigenvalues. If
#'   \code{ev_xx=NULL} (the default), the constant u is chosen by maximising the
#'   contrast of the correlation matrix.
#' @param scale_ev logical indicating if the proportion of explained variance by
#'   PC1 should be computed from the correlation (\code{scale_ev=TRUE}) or
#'   covariance (\code{scale_ev=FALSE}) matrix. If \code{scale_ev=TRUE}, the
#'   correlation matrix is used as parameter of the multivariate normal
#'   distribution.
#' @param u_list vector with two numeric values defining the range of values to
#'   explore for constant u.
#' @param tol accuracy for the search of parameter u as defined in
#'   \code{\link[stats]{optimise}}.
#' @param scale logical indicating if the true mean is zero and true variance is
#'   one for all simulated variables. The observed mean and variance may be
#'   slightly off by chance.
#' @param ... additional arguments passed to the graph simulation function
#'   provided in \code{implementation}.
#'
#' @seealso \code{\link{SimulatePrecision}}, \code{\link{MakePositiveDefinite}}
#'
#' @family simulation functions
#'
#' @return A list with: \item{data}{simulated data with \code{n} observation and
#'   \code{sum(pk)} variables.} \item{theta}{adjacency matrix of the simulated
#'   graph.} \item{omega}{simulated (true) precision matrix. Only returned if
#'   \code{output_matrices=TRUE}.} \item{phi}{simulated (true) partial
#'   correlation matrix. Only returned if \code{output_matrices=TRUE}.}
#'   \item{sigma}{simulated (true) covariance matrix. Only returned if
#'   \code{output_matrices=TRUE}.} \item{u}{value of the constant u used for the
#'   simulation of \code{omega}. Only returned if \code{output_matrices=TRUE}.}
#'
#' @details The simulation is done in two steps with (i) generation of a graph,
#'   and (ii) sampling from multivariate Normal distribution for which nonzero
#'   entries in the partial correlation matrix correspond to the edges of the
#'   simulated graph. This procedure ensures that the conditional independence
#'   structure between the variables corresponds to the simulated graph.
#'
#'   Step 1 is done using \code{\link{SimulateAdjacency}}.
#'
#'   In Step 2, the precision matrix (inverse of the covariance matrix) is
#'   simulated using \code{\link{SimulatePrecision}} so that (i) its nonzero
#'   entries correspond to edges in the graph simulated in Step 1, and (ii) it
#'   is positive definite (see \code{\link{MakePositiveDefinite}}). The inverse
#'   of the precision matrix is used as covariance matrix to simulate data from
#'   a multivariate Normal distribution.
#'
#'   The outputs of this function can be used to evaluate the ability of a
#'   graphical model to recover the conditional independence structure.
#'
#' @references \insertRef{ourstabilityselection}{fake}
#'
#' @examples
#' oldpar <- par(no.readonly = TRUE)
#' par(mar = rep(7, 4))
#'
#' # Simulation of random graph with 50 nodes
#' set.seed(1)
#' simul <- SimulateGraphical(n = 100, pk = 50, topology = "random", nu_within = 0.05)
#' print(simul)
#' plot(simul)
#'
#' # Simulation of scale-free graph with 20 nodes
#' set.seed(1)
#' simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free")
#' plot(simul)
#'
#' # Extracting true precision/correlation matrices
#' set.seed(1)
#' simul <- SimulateGraphical(
#'   n = 100, pk = 20,
#'   topology = "scale-free", output_matrices = TRUE
#' )
#' str(simul)
#'
#' # Simulation of multi-block data
#' set.seed(1)
#' pk <- c(20, 30)
#' simul <- SimulateGraphical(
#'   n = 100, pk = pk,
#'   pd_strategy = "min_eigenvalue"
#' )
#' mycor <- cor(simul$data)
#' Heatmap(mycor,
#'   col = c("darkblue", "white", "firebrick3"),
#'   legend_range = c(-1, 1), legend_length = 50,
#'   legend = FALSE, axes = FALSE
#' )
#' for (i in 1:2) {
#'   axis(side = i, at = c(0.5, pk[1] - 0.5), labels = NA)
#'   axis(
#'     side = i, at = mean(c(0.5, pk[1] - 0.5)),
#'     labels = ifelse(i == 1, yes = "Group 1", no = "Group 2"),
#'     tick = FALSE, cex.axis = 1.5
#'   )
#'   axis(side = i, at = c(pk[1] + 0.5, sum(pk) - 0.5), labels = NA)
#'   axis(
#'     side = i, at = mean(c(pk[1] + 0.5, sum(pk) - 0.5)),
#'     labels = ifelse(i == 1, yes = "Group 2", no = "Group 1"),
#'     tick = FALSE, cex.axis = 1.5
#'   )
#' }
#'
#' # User-defined function for graph simulation
#' CentralNode <- function(pk, hub = 1) {
#'   theta <- matrix(0, nrow = sum(pk), ncol = sum(pk))
#'   theta[hub, ] <- 1
#'   theta[, hub] <- 1
#'   diag(theta) <- 0
#'   return(theta)
#' }
#' simul <- SimulateGraphical(n = 100, pk = 10, implementation = CentralNode)
#' plot(simul) # star
#' simul <- SimulateGraphical(n = 100, pk = 10, implementation = CentralNode, hub = 2)
#' plot(simul) # variable 2 is the central node
#'
#' # User-defined adjacency matrix
#' mytheta <- matrix(c(
#'   0, 1, 1, 0,
#'   1, 0, 0, 0,
#'   1, 0, 0, 1,
#'   0, 0, 1, 0
#' ), ncol = 4, byrow = TRUE)
#' simul <- SimulateGraphical(n = 100, theta = mytheta)
#' plot(simul)
#'
#' # User-defined adjacency and block structure
#' simul <- SimulateGraphical(n = 100, theta = mytheta, pk = c(2, 2))
#' mycor <- cor(simul$data)
#' Heatmap(mycor,
#'   col = c("darkblue", "white", "firebrick3"),
#'   legend_range = c(-1, 1), legend_length = 50, legend = FALSE
#' )
#'
#' par(oldpar)
#'
#' @export
SimulateGraphical <- function(n = 100, pk = 10, theta = NULL,
                              implementation = HugeAdjacency, topology = "random",
                              nu_within = 0.1, nu_between = NULL, nu_mat = NULL,
                              v_within = c(0.5, 1), v_between = c(0.1, 0.2),
                              v_sign = c(-1, 1), continuous = TRUE,
                              pd_strategy = "diagonally_dominant", ev_xx = NULL, scale_ev = TRUE,
                              u_list = c(1e-10, 1), tol = .Machine$double.eps^0.25,
                              scale = TRUE, output_matrices = FALSE, ...) {
  # Defining number of nodes
  p <- sum(pk)
  if (!is.null(theta)) {
    if (ncol(theta) != p) {
      p <- pk <- ncol(theta)
    }
  }

  # Defining the between-block density
  if (is.null(nu_between)) {
    nu_between <- nu_within
  }

  # Building adjacency matrix
  if (is.null(theta)) {
    theta <- SimulateAdjacency(
      pk = pk,
      implementation = implementation, topology = topology,
      nu_within = nu_within, nu_between = nu_between, nu_mat = nu_mat, ...
    )
  }

  # Simulation of a precision matrix
  out <- SimulatePrecision(
    pk = pk, theta = theta,
    v_within = v_within, v_between = v_between,
    v_sign = v_sign, continuous = continuous,
    pd_strategy = pd_strategy, ev_xx = ev_xx, scale = scale_ev,
    u_list = u_list, tol = tol
  )
  omega <- out$omega

  # Computing the covariance matrix
  if (scale) {
    sigma <- stats::cov2cor(solve(omega))
  } else {
    sigma <- solve(omega)
  }

  # Computing the partial correlation matrix
  if (output_matrices) {
    phi <- -stats::cov2cor(omega) + 2 * diag(ncol(omega))
  }

  # Simulating data from multivariate normal distribution
  x <- MASS::mvrnorm(n, rep(0, p), sigma)
  colnames(x) <- paste0("var", 1:ncol(x))
  rownames(x) <- paste0("obs", 1:nrow(x))

  # Defining the class of theta
  class(theta) <- c("matrix", "adjacency_matrix")

  if (output_matrices) {
    out <- list(
      data = x, theta = theta,
      omega = omega, phi = phi, sigma = sigma,
      u = out$u
    )
  } else {
    out <- list(data = x, theta = theta)
  }

  # Defining the class
  class(out) <- "simulation_graphical_model"

  return(out)
}


#' Simulation of a correlation matrix
#'
#' Simulates a correlation matrix. This is done in three steps with (i) the
#' simulation of an undirected graph encoding conditional independence, (ii) the
#' simulation of a (positive definite) precision matrix given the graph, and
#' (iii) the re-scaling of the inverse of the precision matrix.
#'
#' @inheritParams SimulateGraphical
#'
#' @details In Step 1, the conditional independence structure between the
#'   variables is simulated. This is done using \code{\link{SimulateAdjacency}}.
#'
#'   In Step 2, the precision matrix is simulated using
#'   \code{\link{SimulatePrecision}} so that (i) its nonzero entries correspond
#'   to edges in the graph simulated in Step 1, and (ii) it is positive definite
#'   (see \code{\link{MakePositiveDefinite}}).
#'
#'   In Step 3, the covariance is calculated as the inverse of the precision
#'   matrix. The correlation matrix is then obtained by re-scaling the
#'   covariance matrix (see \code{\link[stats]{cov2cor}}).
#'
#' @family simulation functions
#'
#' @seealso \code{\link{SimulatePrecision}}, \code{\link{MakePositiveDefinite}}
#'
#' @return A list with: \item{sigma}{simulated correlation matrix.}
#'   \item{omega}{simulated precision matrix. Only returned if
#'   \code{output_matrices=TRUE}.} \item{theta}{adjacency matrix of the
#'   simulated graph. Only returned if \code{output_matrices=TRUE}.}
#'
#' @examples
#' oldpar <- par(no.readonly = TRUE)
#' par(mar = rep(7, 4))
#'
#' # Random correlation matrix
#' set.seed(1)
#' simul <- SimulateCorrelation(pk = 10)
#' Heatmap(simul$sigma,
#'   col = c("navy", "white", "darkred"),
#'   text = TRUE, format = "f", digits = 2,
#'   legend_range = c(-1, 1)
#' )
#'
#' # Correlation matrix with homogeneous block structure
#' set.seed(1)
#' simul <- SimulateCorrelation(
#'   pk = c(5, 5),
#'   nu_within = 1,
#'   nu_between = 0,
#'   v_sign = -1,
#'   v_within = 1
#' )
#' Heatmap(simul$sigma,
#'   col = c("navy", "white", "darkred"),
#'   text = TRUE, format = "f", digits = 2,
#'   legend_range = c(-1, 1)
#' )
#'
#' # Correlation matrix with heterogeneous block structure
#' set.seed(1)
#' simul <- SimulateCorrelation(
#'   pk = c(5, 5),
#'   nu_within = 0.5,
#'   nu_between = 0,
#'   v_sign = -1
#' )
#' Heatmap(simul$sigma,
#'   col = c("navy", "white", "darkred"),
#'   text = TRUE, format = "f", digits = 2,
#'   legend_range = c(-1, 1)
#' )
#'
#' par(oldpar)
#'
#' @export
SimulateCorrelation <- function(pk = 10, theta = NULL,
                                implementation = HugeAdjacency, topology = "random",
                                nu_within = 0.1, nu_between = NULL, nu_mat = NULL,
                                v_within = c(0.5, 1), v_between = c(0.1, 0.2),
                                v_sign = c(-1, 1), continuous = TRUE,
                                pd_strategy = "diagonally_dominant", ev_xx = NULL, scale_ev = TRUE,
                                u_list = c(1e-10, 1), tol = .Machine$double.eps^0.25,
                                output_matrices = FALSE, ...) {
  # Defining number of nodes
  p <- sum(pk)
  if (!is.null(theta)) {
    if (ncol(theta) != p) {
      p <- pk <- ncol(theta)
    }
  }

  # Defining the between-block density
  if (is.null(nu_between)) {
    nu_between <- nu_within
  }

  # Building adjacency matrix
  if (is.null(theta)) {
    theta <- SimulateAdjacency(
      pk = pk,
      implementation = implementation, topology = topology,
      nu_within = nu_within, nu_between = nu_between, nu_mat = nu_mat, ...
    )
  }

  # Simulation of a precision matrix
  out <- SimulatePrecision(
    pk = pk, theta = theta,
    v_within = v_within, v_between = v_between,
    v_sign = v_sign, continuous = continuous,
    pd_strategy = pd_strategy, ev_xx = ev_xx, scale = scale_ev,
    u_list = u_list, tol = tol
  )
  omega <- out$omega

  # Computing the correlation matrix
  sigma <- stats::cov2cor(solve(omega))

  # Preparing the output
  if (output_matrices) {
    out <- list(
      sigma = sigma,
      omega = omega,
      theta = theta
    )
  } else {
    out <- list(sigma = sigma)
  }

  return(out)
}


#' Data simulation for sparse Principal Component Analysis
#'
#' Simulates data with with independent groups of variables.
#'
#' @inheritParams SimulateGraphical
#' @param adjacency optional binary and symmetric adjacency matrix encoding the
#'   conditional graph structure between observations. The clusters encoded in
#'   this argument must be in line with those indicated in \code{pk}. Edges in
#'   off-diagonal blocks are not allowed to ensure that the simulated orthogonal
#'   components are sparse. Corresponding entries in the precision matrix will
#'   be set to zero.
#'
#' @details The data is simulated from a centered multivariate Normal
#'   distribution with a block-diagonal covariance matrix. Independence between
#'   variables from the different blocks ensures that sparse orthogonal
#'   components can be generated.
#'
#'   The block-diagonal partial correlation matrix is obtained using a graph
#'   structure encoding the conditional independence between variables. The
#'   orthogonal latent variables are obtained from eigendecomposition of the
#'   true correlation matrix. The sparse eigenvectors contain the weights of the
#'   linear combination of variables to construct the latent variable (loadings
#'   coefficients). The proportion of explained variance by each of the latent
#'   variable is computed from eigenvalues.
#'
#'   As latent variables are defined from the true correlation matrix, the
#'   number of sparse orthogonal components is not limited by the number of
#'   observations and is equal to \code{sum(pk)}.
#'
#' @return A list with: \item{data}{simulated data with \code{n} observation and
#'   \code{sum(pk)} variables.} \item{loadings}{loadings coefficients of the
#'   orthogonal latent variables (principal components).} \item{theta}{support
#'   of the loadings coefficients.} \item{ev}{proportion of explained variance
#'   by each of the orthogonal latent variables.} \item{adjacency}{adjacency
#'   matrix of the simulated graph.} \item{omega}{simulated (true) precision
#'   matrix. Only returned if \code{output_matrices=TRUE}.} \item{phi}{simulated
#'   (true) partial correlation matrix. Only returned if
#'   \code{output_matrices=TRUE}.} \item{C}{ simulated (true) correlation
#'   matrix. Only returned if \code{output_matrices=TRUE}.}
#'
#' @seealso \code{\link{MakePositiveDefinite}}
#'
#' @family simulation functions
#'
#' @references \insertRef{ourstabilityselection}{fake}
#'
#' @examples
#' \donttest{
#' # Simulation of 3 components with high e.v.
#' set.seed(1)
#' simul <- SimulateComponents(pk = c(5, 3, 4), ev_xx = 0.4)
#' print(simul)
#' plot(simul)
#' plot(cumsum(simul$ev), ylim = c(0, 1), las = 1)
#'
#' # Simulation of 3 components with moderate e.v.
#' set.seed(1)
#' simul <- SimulateComponents(pk = c(5, 3, 4), ev_xx = 0.25)
#' print(simul)
#' plot(simul)
#' plot(cumsum(simul$ev), ylim = c(0, 1), las = 1)
#'
#' # Simulation of multiple components with low e.v.
#' pk <- sample(3:10, size = 5, replace = TRUE)
#' simul <- SimulateComponents(
#'   pk = pk,
#'   nu_within = 0.3, v_within = c(0.8, 0.5), v_sign = -1, ev_xx = 0.1
#' )
#' plot(simul)
#' plot(cumsum(simul$ev), ylim = c(0, 1), las = 1)
#' }
#' @export
SimulateComponents <- function(n = 100, pk = c(10, 10),
                               adjacency = NULL, nu_within = 1,
                               v_within = c(0.5, 1), v_sign = -1, continuous = TRUE,
                               pd_strategy = "min_eigenvalue", ev_xx = 0.1, scale_ev = TRUE,
                               u_list = c(1e-10, 1), tol = .Machine$double.eps^0.25,
                               scale = TRUE, output_matrices = FALSE) {
  # Using multi-block simulator with unconnected blocks
  out <- SimulateGraphical(
    n = n, pk = pk, theta = adjacency,
    implementation = HugeAdjacency,
    topology = "random",
    nu_within = nu_within, # fully connected components by default
    nu_between = 0, # need unconnected blocks
    v_within = v_within,
    v_between = 0,
    v_sign = v_sign,
    continuous = continuous,
    pd_strategy = pd_strategy, ev_xx = ev_xx, scale_ev = scale_ev,
    u_list = u_list, tol = tol,
    scale = scale, output_matrices = TRUE
  )

  # Eigendecomposition of the covariance
  eig <- eigen(out$sigma)

  # Definition of membership
  membership <- NULL
  for (i in 1:length(pk)) {
    membership <- c(membership, rep(i, each = pk[i]))
  }
  names(membership) <- colnames(out$data)
  out$membership <- membership

  # Re-naming the outputs
  out$adjacency <- out$theta

  # Definition of sparse principal components
  out$loadings <- round(eig$vectors, digits = 10)
  out$theta <- ifelse(out$loadings != 0, yes = 1, no = 0)
  rownames(out$theta) <- rownames(out$loadings) <- colnames(out$adjacency)
  colnames(out$theta) <- colnames(out$loadings) <- paste0("comp", 1:ncol(out$theta))

  # Definition of proportion of explained variance
  ev <- eig$values / sum(eig$values)
  names(ev) <- colnames(out$theta)
  out$ev <- ev

  # Re-arranging the output
  out <- out[c("data", "loadings", "theta", "ev", "membership", "omega", "phi", "C", "u")]
  if (!output_matrices) {
    out <- out[c("data", "loadings", "theta", "ev", "membership")]
  }

  # Defining the class
  class(out) <- "simulation_components"

  return(out)
}


#' Data simulation for multivariate regression
#'
#' Simulates data with outcome(s) and predictors, where only a subset of the
#' predictors actually contributes to the definition of the outcome(s).
#'
#' @param n number of observations in the simulated dataset. Not used if
#'   \code{xdata} is provided.
#' @param pk number of predictor variables. A subset of these variables
#'   contribute to the outcome definition (see argument \code{nu_xy}). Not used
#'   if \code{xdata} is provided.
#' @param xdata optional data matrix for the predictors with variables as
#'   columns and observations as rows. A subset of these variables contribute to
#'   the outcome definition (see argument \code{nu_xy}).
#' @param family type of regression model. Possible values include
#'   \code{"gaussian"} for continuous outcome(s) or \code{"binomial"} for binary
#'   outcome(s).
#' @param q number of outcome variables.
#' @param theta binary matrix with as many rows as predictors and as many
#'   columns as outcomes. A nonzero entry on row \eqn{i} and column \eqn{j}
#'   indicates that predictor \eqn{i} contributes to the definition of outcome
#'   \eqn{j}.
#' @param nu_xy vector of length \code{q} with expected proportion of predictors
#'   contributing to the definition of each of the \code{q} outcomes.
#' @param beta_abs vector defining the range of nonzero regression coefficients
#'   in absolute values. If \code{continuous=FALSE}, \code{beta_abs} is the set
#'   of possible precision values. If \code{continuous=TRUE}, \code{beta_abs} is
#'   the range of possible precision values. Note that regression coefficients
#'   are re-scaled if \code{family="binomial"} to ensure that the desired
#'   concordance statistic can be achieved (see argument \code{ev_xy}) so they
#'   may not be in this range.
#' @param beta_sign vector of possible signs for regression coefficients.
#'   Possible inputs are: \code{1} for positive coefficients, \code{-1} for
#'   negative coefficients, or \code{c(-1, 1)} for both positive and negative
#'   coefficients.
#' @param continuous logical indicating whether to sample regression
#'   coefficients from a uniform distribution between the minimum and maximum
#'   values in \code{beta_abs} (if \code{continuous=TRUE}) or from proposed
#'   values in \code{beta_abs} (if \code{continuous=FALSE}).
#' @param ev_xy vector of length \code{q} with expected goodness of fit measures
#'   for each of the \code{q} outcomes. If \code{family="gaussian"}, the vector
#'   contains expected proportions of variance in each of the \code{q} outcomes
#'   that can be explained by the predictors. If \code{family="binomial"}, the
#'   vector contains expected concordance statistics (i.e. area under the ROC
#'   curve) given the true probabilities.
#'
#' @return A list with: \item{xdata}{input or simulated predictor data.}
#'   \item{ydata}{simulated outcome data.} \item{beta}{matrix of true beta
#'   coefficients used to generate outcomes in \code{ydata} from predictors in
#'   \code{xdata}.} \item{theta}{binary matrix indicating the predictors from
#'   \code{xdata} contributing to the definition of each of the outcome
#'   variables in \code{ydata}.}
#'
#' @family simulation functions
#'
#' @references \insertRef{ourstabilityselection}{fake}
#'
#' @examples
#' \donttest{
#' ## Independent predictors
#'
#' # Univariate continuous outcome
#' set.seed(1)
#' simul <- SimulateRegression(pk = 15)
#' summary(simul)
#'
#' # Univariate binary outcome
#' set.seed(1)
#' simul <- SimulateRegression(pk = 15, family = "binomial")
#' table(simul$ydata)
#'
#' # Multiple continuous outcomes
#' set.seed(1)
#' simul <- SimulateRegression(pk = 15, q = 3)
#' summary(simul)
#'
#'
#' ## Blocks of correlated predictors
#'
#' # Simulation of predictor data
#' set.seed(1)
#' xsimul <- SimulateGraphical(pk = rep(5, 3), nu_within = 0.8, nu_between = 0, v_sign = -1)
#' Heatmap(cor(xsimul$data),
#'   legend_range = c(-1, 1),
#'   col = c("navy", "white", "darkred")
#' )
#'
#' # Simulation of outcome data
#' simul <- SimulateRegression(xdata = xsimul$data)
#' print(simul)
#' summary(simul)
#'
#'
#' ## Choosing expected proportion of explained variance
#'
#' # Data simulation
#' set.seed(1)
#' simul <- SimulateRegression(n = 1000, pk = 15, q = 3, ev_xy = c(0.9, 0.5, 0.2))
#' summary(simul)
#'
#' # Comparing with estimated proportion of explained variance
#' summary(lm(simul$ydata[, 1] ~ simul$xdata))
#' summary(lm(simul$ydata[, 2] ~ simul$xdata))
#' summary(lm(simul$ydata[, 3] ~ simul$xdata))
#'
#'
#' ## Choosing expected concordance (AUC)
#'
#' # Data simulation
#' set.seed(1)
#' simul <- SimulateRegression(
#'   n = 500, pk = 10,
#'   family = "binomial", ev_xy = 0.9
#' )
#'
#' # Comparing with estimated concordance
#' fitted <- glm(simul$ydata ~ simul$xdata,
#'   family = "binomial"
#' )$fitted.values
#' Concordance(observed = simul$ydata, predicted = fitted)
#' }
#' @export
SimulateRegression <- function(n = 100, pk = 10, xdata = NULL,
                               family = "gaussian", q = 1,
                               theta = NULL, nu_xy = 0.2,
                               beta_abs = c(0.1, 1), beta_sign = c(-1, 1), continuous = TRUE,
                               ev_xy = 0.7) {
  # TODO in future versions: introduce more families ("multinomial" and "cox")
  # Checking that either n and pk or xdata are provided
  if (is.null(xdata)) {
    if (is.null(pk) | is.null(n)) {
      stop("Argument 'xdata' must be provided if 'pk' and 'n' are not provided.")
    }
  }

  # Checking other inputs
  if (length(ev_xy) != q) {
    ev_xy <- rep(ev_xy[1], q)
  }
  if (length(nu_xy) != q) {
    nu_xy <- rep(nu_xy[1], q)
  }

  # Creating objects not provided as input
  if (!is.null(xdata)) {
    n <- nrow(xdata)
    p <- ncol(xdata)
  } else {
    p <- sum(pk)
    xsimul <- SimulateGraphical(
      n = n, pk = pk, theta = NULL,
      implementation = HugeAdjacency, topology = "random",
      nu_within = 0, nu_between = 0, nu_mat = NULL,
      v_within = 0, v_between = 0,
      v_sign = c(-1, 1), continuous = TRUE,
      pd_strategy = "diagonally_dominant", ev_xx = NULL, scale_ev = TRUE,
      u_list = c(1e-10, 1), tol = .Machine$double.eps^0.25,
      scale = TRUE, output_matrices = FALSE
    )
    xdata <- xsimul$data
  }

  # Checking theta if provided
  if (!is.null(theta)) {
    if (q == 1) {
      if (is.vector(theta)) {
        theta <- cbind(theta)
      }
    }
    if (ncol(theta) != q) {
      stop("Arguments 'theta' and 'q' are not compatible. Please provide a matrix 'theta' with 'q' columns.")
    }
    if (nrow(theta) != p) {
      stop("Please provide a matrix 'theta' with as many columns as predictors.")
    }
    theta <- ifelse(theta != 0, yes = 1, no = 0)
  }

  # Sampling true predictors
  if (is.null(theta)) {
    theta <- SamplePredictors(pk = p, q = q, nu = nu_xy, orthogonal = FALSE)
  }

  # Sampling regression coefficients
  beta <- theta
  if (continuous) {
    beta <- beta * matrix(stats::runif(n = nrow(beta) * ncol(beta), min = min(beta_abs), max = max(beta_abs)),
      nrow = nrow(beta), ncol = ncol(beta)
    )
  } else {
    beta <- beta * matrix(base::sample(beta_abs, size = nrow(beta) * ncol(beta), replace = TRUE),
      nrow = nrow(beta), ncol = ncol(beta)
    )
  }
  beta <- beta * matrix(base::sample(beta_sign, size = nrow(beta) * ncol(beta), replace = TRUE),
    nrow = nrow(beta), ncol = ncol(beta)
  )

  # Sampling outcome data
  ydata <- matrix(NA, ncol = q, nrow = nrow(xdata))
  if (family == "gaussian") {
    for (j in 1:q) {
      # Linear combination
      ypred <- xdata %*% beta[, j]

      # Calculating standard deviation that achieves desired proportion of explained variance
      sigma <- sqrt((1 - ev_xy[j]) / ev_xy[j] * stats::var(ypred)) # using var(ypred) is a shortcut (could use true variances but not always available)

      # Sampling from Normal distribution
      ydata[, j] <- stats::rnorm(n = n, mean = ypred, sd = sigma)
    }
  }
  if (family == "binomial") {
    for (j in 1:q) {
      # Linear combination
      crude_log_odds <- xdata %*% beta[, j]

      # Identifying a relevant range of scaling factors (logistic distribution)
      s_max <- max(abs(crude_log_odds)) / log(0.51 / 0.49) # scale that gives all probabilities between 0.49 and 0.51 (expected c stat close to 0.5)
      s_min <- min(abs(crude_log_odds)) / log(0.99 / 0.01) # scale that gives all probabilities above 0.99 or below 0.01 (expected c stat close to 1)

      # Finding scaling factor that gives desired AUC (interval required to ensure optimisation works)
      argmax_scaling_factor <- stats::optimise(
        f = TuneCStatisticLogit,
        crude_log_odds = crude_log_odds,
        auc = ev_xy[j],
        lower = 1 / s_max, upper = 1 / s_min
      )
      scaling_factor <- argmax_scaling_factor$minimum

      # Applying scaling factor
      beta[, j] <- beta[, j] * scaling_factor
      log_odds <- crude_log_odds * scaling_factor

      # Calculating probabilities from log-odds (inverse logit)
      proba <- 1 / (1 + exp(-log_odds))

      # Sampling from Bernouilli distribution
      ydata[, j] <- stats::rbinom(n = n, size = 1, prob = proba)
    }
  }

  # Defining row and column names of ydata
  rownames(ydata) <- rownames(xdata)
  colnames(ydata) <- paste0("outcome", 1:q)

  # Defining row and column names of beta and theta
  rownames(beta) <- rownames(theta) <- colnames(xdata)
  colnames(beta) <- colnames(theta) <- colnames(ydata)

  # Preparing the output
  out <- list(xdata = xdata, ydata = ydata, beta = beta, theta = theta)

  # Defining the class
  class(out) <- "simulation_regression"

  return(out)
}


#' Simulation of data with underlying clusters
#'
#' Simulates mixture multivariate Normal data with clusters of items (rows)
#' sharing similar profiles along (a subset of) attributes (columns).
#'
#' @param n vector of the number of items per cluster in the simulated data. The
#'   total number of items is \code{sum(n)}.
#' @param pk vector of the number of attributes in the simulated data.
#' @param sigma optional within-cluster correlation matrix.
#' @param theta_xc optional binary matrix encoding which attributes (columns)
#'   contribute to the clustering structure between which clusters (rows). If
#'   \code{theta_xc=NULL}, variables contributing to the clustering are sampled
#'   with probability \code{nu_xc}.
#' @param nu_xc expected proportion of variables contributing to the clustering
#'   over the total number of variables. Only used if \code{theta_xc} is not
#'   provided.
#' @param ev_xc vector of expected proportion of variance in each of the
#'   contributing attributes that can be explained by the clustering.
#' @param output_matrices logical indicating if the cluster and attribute
#'   specific means and cluster specific covariance matrix should be included
#'   in the output.
#'
#' @details The data is simulated from a Gaussian mixture where for all \eqn{i
#'   \in {1, \dots, n}}:
#'
#'   \eqn{Z_i i.i.d. ~ M(1, \kappa)}
#'
#'   \eqn{X_i | Z_i indep. ~ N_p(\mu_{Z_i}, \Sigma)}
#'
#'   where \eqn{M(1, \kappa)} is the multinomial distribution with parameters 1
#'   and \eqn{\kappa}, the vector of length \eqn{G} (the number of clusters)
#'   with probabilities of belonging to each of the clusters, and
#'   \eqn{N_p(\mu_{Z_i}, \Sigma)} is the multivariate Normal distribution with a
#'   mean vector \eqn{\mu_{Z_i}} that depends on the cluster membership encoded
#'   in \eqn{Z_i} and the same covariance matrix \eqn{\Sigma} within all \eqn{G}
#'   clusters.
#'
#'   The mean vectors \eqn{\mu_{g}, g \in {1, \dots G}} are simulated so that
#'   the desired proportion of variance in each of attributes explained by the
#'   clustering (argument \code{ev_xc}) is reached.
#'
#'   The covariance matrix \eqn{\Sigma} is obtained by re-scaling a correlation
#'   matrix (argument \code{sigma}) to ensure that the desired proportions of
#'   explained variances by the clustering (argument \code{ev_xc}) are reached.
#'
#' @seealso \code{\link{MakePositiveDefinite}}
#' @family simulation functions
#'
#' @return A list with: \item{data}{simulated data with \code{sum(n)}
#'   observation and \code{sum(pk)} variables} \item{theta}{simulated (true)
#'   cluster membership.} \item{theta_xc}{binary vector encoding variables
#'   contributing to the clustering structure.} \item{ev}{vector of marginal
#'   expected proportions of explained variance for each variable.}
#'   \item{mu_mixture}{simulated (true) cluster-specific means. Only returned if
#'   \code{output_matrices=TRUE}.} \item{sigma}{simulated (true) covariance
#'   matrix. Only returned if \code{output_matrices=TRUE}.}
#'
#' @examples
#' oldpar <- par(no.readonly = TRUE)
#' par(mar = rep(7, 4))
#'
#' ## Example with 3 clusters
#'
#' # Data simulation
#' set.seed(1)
#' simul <- SimulateClustering(
#'   n = c(10, 30, 15),
#'   nu_xc = 1,
#'   ev_xc = 0.5
#' )
#' print(simul)
#' plot(simul)
#'
#' # Checking the proportion of explained variance
#' x <- simul$data[, 1]
#' z <- as.factor(simul$theta)
#' summary(lm(x ~ z)) # R-squared
#'
#'
#' ## Example with 2 variables contributing to clustering
#'
#' # Data simulation
#' set.seed(1)
#' simul <- SimulateClustering(
#'   n = c(20, 10, 15), pk = 10,
#'   theta_xc = c(1, 1, rep(0, 8)),
#'   ev_xc = 0.8
#' )
#' print(simul)
#' plot(simul)
#'
#' # Visualisation of the data
#' Heatmap(
#'   mat = simul$data,
#'   col = c("navy", "white", "red")
#' )
#' simul$ev # marginal proportions of explained variance
#'
#' # Visualisation along contributing variables
#' plot(simul$data[, 1:2], col = simul$theta, pch = 19)
#'
#'
#' ## Example with different levels of separation
#'
#' # Data simulation
#' set.seed(1)
#' simul <- SimulateClustering(
#'   n = c(20, 10, 15), pk = 10,
#'   theta_xc = c(1, 1, rep(0, 8)),
#'   ev_xc = c(0.99, 0.5, rep(0, 8))
#' )
#'
#' # Visualisation along contributing variables
#' plot(simul$data[, 1:2], col = simul$theta, pch = 19)
#'
#'
#' ## Example with correlated contributors
#'
#' # Data simulation
#' pk <- 10
#' adjacency <- matrix(0, pk, pk)
#' adjacency[1, 2] <- adjacency[2, 1] <- 1
#' set.seed(1)
#' sigma <- SimulateCorrelation(
#'   pk = pk,
#'   theta = adjacency,
#'   pd_strategy = "min_eigenvalue",
#'   v_within = 0.6, v_sign = -1
#' )$sigma
#' simul <- SimulateClustering(
#'   n = c(200, 100, 150), pk = pk, sigma = sigma,
#'   theta_xc = c(1, 1, rep(0, 8)),
#'   ev_xc = c(0.9, 0.8, rep(0, 8))
#' )
#'
#' # Visualisation along contributing variables
#' plot(simul$data[, 1:2], col = simul$theta, pch = 19)
#'
#' # Checking marginal proportions of explained variance
#' mymodel <- lm(simul$data[, 1] ~ as.factor(simul$theta))
#' summary(mymodel)$r.squared
#' mymodel <- lm(simul$data[, 2] ~ as.factor(simul$theta))
#' summary(mymodel)$r.squared
#'
#' par(oldpar)
#'
#' @export
SimulateClustering <- function(n = c(10, 10), pk = 10, sigma = NULL,
                               theta_xc = NULL, nu_xc = 1, ev_xc = 0.5,
                               output_matrices = FALSE) {
  # Checking the inputs
  if (!is.null(theta_xc)) {
    if (inherits(theta_xc, "numeric")) {
      theta_xc <- rbind(theta_xc)
    }
    if (nrow(theta_xc) == 1) {
      tmp_theta_xc <- theta_xc
      theta_xc <- NULL
      for (i in 1:length(n)) {
        theta_xc <- rbind(theta_xc, tmp_theta_xc)
      }
    }
    if (is.null(pk)) {
      pk <- ncol(theta_xc)
    } else {
      if (sum(pk) != ncol(theta_xc)) {
        # warning("Arguments 'pk' and 'theta_xc' are not compatible. Argument 'pk' has been set to ncol('theta_xc').")
        pk <- ncol(theta_xc)
      }
    }
  }

  # Using identity matrix if correlation is not provided
  out <- list()
  if (is.null(sigma)) {
    sigma <- diag(sum(pk))
  }
  rownames(sigma) <- colnames(sigma) <- paste0("var", 1:ncol(sigma))

  # Simulating data from multivariate normal distribution
  x <- MASS::mvrnorm(sum(n), rep(0, sum(pk)), sigma)
  colnames(x) <- paste0("var", 1:ncol(x))
  rownames(x) <- paste0("obs", 1:nrow(x))
  out$data <- x

  # Defining number of clusters
  nc <- length(n)

  # Defining variables contributing to the clustering
  if (is.null(theta_xc)) {
    theta_xc <- matrix(
      rep(
        SamplePredictors(pk = sum(pk), q = 1, nu = nu_xc, orthogonal = TRUE)[, 1],
        length(n)
      ),
      nrow = length(n), byrow = TRUE
    )
  }
  rownames(theta_xc) <- paste0("cluster", 1:nrow(theta_xc))

  # Simulating marginal proportions of explained variance
  if (is.null(ev_xc)) {
    ev_xc <- stats::runif(n = sum(pk))
  } else {
    if (length(ev_xc) == 1) {
      ev_xc <- rep(ev_xc, sum(pk))
    }
  }

  # Definition of membership
  theta <- NULL
  for (i in 1:length(n)) {
    theta <- c(theta, rep(i, each = n[i]))
  }
  names(theta) <- rownames(out$data)
  out$theta <- theta

  # Simulating the cluster-specific means
  mu_mixture <- matrix(NA, nrow = length(unique(theta)), ncol = sum(pk))
  if (length(n) > 1) {
    mu_mat <- matrix(NA, nrow = sum(n), ncol = sum(pk))
    for (k in 1:ncol(mu_mat)) {
      # Sampling initial values for cluster-specific means
      mu <- stats::rnorm(n = nc, mean = 0, sd = 1)
      mu <- mu * theta_xc[, k]

      # Attributing cluster-specific means to observations
      for (i in 1:nrow(mu_mat)) {
        mu_mat[i, k] <- mu[theta[i]]
      }

      # # Defining variance to reach expected proportion of e.v.
      # var_mu <- ev_xc[k] * 1 / (1 - ev_xc[k])
      #
      # # Scaling to ensure mean of zero and defined variance
      # mu_mat[, k] <- scale(mu_mat[, k])
      # mu_mat[, k] <- mu_mat[, k] * sqrt(var_mu)
      # # Equivalent to: sqrt(var_mu)*(mu_mat[, k]-mean(mu_mat[,k]))/sqrt(1/(nrow(mu_mat)-1)*sum((mu_mat[, k]-mean(mu_mat[, k]))^2))
      # # Equivalent to: sqrt(var_mu)*(mu_mat[, k]-1/nrow(mu_mat)*sum(table(theta)*mu))/sqrt(1/(nrow(mu_mat)-1)*sum((mu_mat[, k]-mean(mu_mat[, k]))^2))
      # # Equivalent to: sqrt(var_mu)*(mu_mat[, k]-1/nrow(mu_mat)*sum(table(theta)*mu))/sqrt(1/(nrow(mu_mat)-1)*sum(table(theta)*(mu-1/nrow(mu_mat)*sum(table(theta)*mu))^2))
      #
      # # Storing cluster-specific means
      # mu_mixture[, k] <- mu_mat[!duplicated(theta), k]
      #
      # # Adding cluster-specific means
      # out$data[, k] <- out$data[, k] + mu_mat[, k]

      # Ensuring that the grouping structure is going to represent desired proportion of variance
      if (any(theta_xc[, k] != 0)) {
        mu_mat[, k] <- scale(mu_mat[, k]) * sqrt(ev_xc[k]) * sqrt(diag(sigma)[k])
        out$data[, k] <- out$data[, k] * sqrt(1 - ev_xc[k])
      }

      # Adding cluster-specific means
      out$data[, k] <- out$data[, k] + mu_mat[, k]

      # Storing cluster-specific means
      mu_mixture[, k] <- mu_mat[!duplicated(theta), k]
    }
    mu_mat[is.na(mu_mat)] <- 0
  }

  # Updating the within-cluster covariance matrix
  sigma <- sigma * sqrt(cbind(1 - ev_xc) %*% rbind(1 - ev_xc))

  # Definition of contributing variables
  colnames(theta_xc) <- colnames(out$data)
  out$theta_xc <- theta_xc
  out$ev <- ev_xc * ifelse(apply(theta_xc, 2, sum) != 0, yes = 1, no = 0)

  # Returning true cluster-specific means
  if (output_matrices) {
    out$mu_mixture <- mu_mixture
    out$sigma <- sigma
  }

  # Defining the class
  class(out) <- "simulation_clustering"

  return(out)
}


#' Data simulation for Structural Causal Modelling
#'
#' Simulates data from a multivariate Normal distribution where relationships
#' between the variables correspond to a Structural Causal Model (SCM). To
#' ensure that the generated SCM is identifiable, the nodes are organised by
#' layers, with no causal effects within layers.
#'
#' @inheritParams SimulateGraphical
#' @param pk vector of the number of (latent) variables per layer.
#' @param theta optional binary adjacency matrix of the Directed Acyclic Graph
#'   (DAG) of causal relationships. This DAG must have a structure with layers
#'   so that a variable can only be a parent of variable in one of the following
#'   layers (see \code{\link{LayeredDAG}} for examples). The layers must be
#'   provided in \code{pk}.
#' @param n_manifest vector of the number of manifest (observed) variables
#'   measuring each of the latent variables. If \code{n_manifest=NULL}, there
#'   are \code{sum(pk)} manifest variables and no latent variables. Otherwise,
#'   there are \code{sum(pk)} latent variables and \code{sum(n_manifest)}
#'   manifest variables. All entries of \code{n_manifest} must be strictly
#'   positive.
#' @param nu_between probability of having an edge between two nodes belonging
#'   to different layers, as defined in \code{pk}. If \code{length(pk)=1}, this
#'   is the expected density of the graph.
#' @param v_between vector defining the (range of) nonzero path coefficients. If
#'   \code{continuous=FALSE}, \code{v_between} is the set of possible values. If
#'   \code{continuous=TRUE}, \code{v_between} is the range of possible values.
#' @param v_sign vector of possible signs for path coefficients. Possible inputs
#'   are: \code{1} for positive coefficients, \code{-1} for negative
#'   coefficients, or \code{c(-1, 1)} for both positive and negative
#'   coefficients.
#' @param continuous logical indicating whether to sample path coefficients from
#'   a uniform distribution between the minimum and maximum values in
#'   \code{v_between} (if \code{continuous=FALSE}) or from proposed values in
#'   \code{v_between} (if \code{continuous=FALSE}).
#' @param ev vector of proportions of variance in each of the (latent) variables
#'   that can be explained by its parents. If there are no latent variables (if
#'   \code{n_manifest=NULL}), these are the proportions of explained variances
#'   in the manifest variables. Otherwise (if \code{n_manifest} is provided),
#'   these are the proportions of explained variances in the latent variables.
#' @param ev_manifest vector of proportions of variance in each of the manifest
#'   variable that can be explained by its latent parent. Only used if
#'   \code{n_manifest} is provided.
#' @param output_matrices logical indicating if the true path coefficients,
#'   residual variances, and precision and (partial) correlation matrices should
#'   be included in the output.
#'
#' @seealso \code{\link{SimulatePrecision}}, \code{\link{MakePositiveDefinite}},
#'   \code{\link{Contrast}}
#'
#' @family simulation functions
#'
#' @references \insertRef{RegSEM}{fake}
#'
#' @return A list with: \item{data}{simulated data with \code{n} observations
#'   for manifest variables.} \item{theta}{adjacency matrix of the simulated
#'   Directed Acyclic Graph encoding causal relationships.}
#'   \item{Amat}{simulated (true) asymmetric matrix A in RAM notation. Only
#'   returned if \code{output_matrices=TRUE}.} \item{Smat}{simulated (true)
#'   symmetric matrix S in RAM notation. Only returned if
#'   \code{output_matrices=TRUE}.} \item{Fmat}{simulated (true) filter matrix F
#'   in RAM notation. Only returned if \code{output_matrices=TRUE}.}
#'   \item{sigma}{simulated (true) covariance matrix. Only returned if
#'   \code{output_matrices=TRUE}.}
#'
#' @examples
#' \donttest{
#' # Simulation of a layered SCM
#' set.seed(1)
#' pk <- c(3, 5, 4)
#' simul <- SimulateStructural(n = 100, pk = pk)
#' print(simul)
#' summary(simul)
#' plot(simul)
#'
#' # Choosing the proportions of explained variances for endogenous variables
#' set.seed(1)
#' simul <- SimulateStructural(
#'   n = 1000,
#'   pk = c(2, 3),
#'   nu_between = 1,
#'   ev = c(NA, NA, 0.5, 0.7, 0.9),
#'   output_matrices = TRUE
#' )
#'
#' # Checking expected proportions of explained variances
#' 1 - simul$Smat["x3", "x3"] / simul$sigma["x3", "x3"]
#' 1 - simul$Smat["x4", "x4"] / simul$sigma["x4", "x4"]
#' 1 - simul$Smat["x5", "x5"] / simul$sigma["x5", "x5"]
#'
#' # Checking observed proportions of explained variances (R-squared)
#' summary(lm(simul$data[, 3] ~ simul$data[, which(simul$theta[, 3] != 0)]))
#' summary(lm(simul$data[, 4] ~ simul$data[, which(simul$theta[, 4] != 0)]))
#' summary(lm(simul$data[, 5] ~ simul$data[, which(simul$theta[, 5] != 0)]))
#'
#' # Simulation including latent and manifest variables
#' set.seed(1)
#' simul <- SimulateStructural(
#'   n = 100,
#'   pk = c(2, 3),
#'   n_manifest = c(2, 3, 2, 1, 2)
#' )
#' plot(simul)
#'
#' # Showing manifest variables in red
#' if (requireNamespace("igraph", quietly = TRUE)) {
#'   mygraph <- plot(simul)
#'   ids <- which(igraph::V(mygraph)$name %in% colnames(simul$data))
#'   igraph::V(mygraph)$color[ids] <- "red"
#'   igraph::V(mygraph)$frame.color[ids] <- "red"
#'   plot(mygraph)
#' }
#'
#' # Choosing proportions of explained variances for latent and manifest variables
#' set.seed(1)
#' simul <- SimulateStructural(
#'   n = 100,
#'   pk = c(3, 2),
#'   n_manifest = c(2, 3, 2, 1, 2),
#'   ev = c(NA, NA, NA, 0.7, 0.9),
#'   ev_manifest = 0.8,
#'   output_matrices = TRUE
#' )
#' plot(simul)
#'
#' # Checking expected proportions of explained variances
#' (simul$sigma_full["f4", "f4"] - simul$Smat["f4", "f4"]) / simul$sigma_full["f4", "f4"]
#' (simul$sigma_full["f5", "f5"] - simul$Smat["f5", "f5"]) / simul$sigma_full["f5", "f5"]
#' (simul$sigma_full["x1", "x1"] - simul$Smat["x1", "x1"]) / simul$sigma_full["x1", "x1"]
#' }
#' @export
SimulateStructural <- function(n = 100,
                               pk = c(5, 5, 5),
                               theta = NULL,
                               n_manifest = NULL,
                               nu_between = 0.5,
                               v_between = c(0.5, 1),
                               v_sign = c(-1, 1),
                               continuous = TRUE,
                               ev = 0.5,
                               ev_manifest = 0.8,
                               output_matrices = FALSE) {
  # Storing the layers
  layers <- pk

  # Simulation of layered directed acyclic graph between latent variables
  if (is.null(theta)) {
    theta <- SimulateAdjacency(
      pk = pk,
      topology = "random",
      nu_within = 0,
      nu_between = nu_between
    )
    theta[lower.tri(theta)] <- 0
    rownames(theta) <- colnames(theta) <- NULL
  } else {
    if (ncol(theta) != sum(pk)) {
      pk <- ncol(theta)
      # stop("Arguments 'theta' and 'pk' are not compatible. Please make sure that 'theta' is the adjacency matrix of a DAG with layers and that 'pk' encodes this layer structure.")
    }
  }

  # Addition of manifest variables for each latent variable
  if (!is.null(n_manifest)) {
    p_latent <- ncol(theta)

    # Expanding the vector if needed
    if (length(n_manifest) != ncol(theta)) {
      n_manifest <- rep(n_manifest[1], ncol(theta))
    }

    # Storing original variable names
    if (!is.null(colnames(theta))) {
      var_names <- colnames(theta)
    }
    if (is.null(colnames(theta)) | (any(paste0("x", 1:sum(n_manifest)) %in% colnames(theta)))) {
      var_names <- paste0("f", 1:ncol(theta))
    }

    # Adding manifest variables in adjacency matrix
    tmpfactor <- as.factor(rep.int(1:length(n_manifest), times = n_manifest))
    submatrix_manifest <- t(stats::model.matrix(~ tmpfactor - 1))
    theta <- cbind(submatrix_manifest, theta)
    theta <- rbind(matrix(0, ncol = ncol(theta), nrow = ncol(theta) - nrow(theta)), theta)
    rownames(theta) <- colnames(theta) <- c(paste0("x", 1:ncol(submatrix_manifest)), var_names)
  } else {
    p_latent <- 0
  }

  # Checking the length of proportions of explained variances
  if (length(ev) != sum(pk)) {
    ev <- rep(ev[1], sum(pk))
  }
  if (!is.null(n_manifest)) {
    if (length(ev_manifest) != sum(n_manifest)) {
      ev_manifest <- rep(ev_manifest[1], sum(n_manifest))
    }
    ev <- c(ev_manifest, ev)
  }

  # Setting p as the total number of variables (latent and manifest)
  p <- pk <- ncol(theta)

  # Defining row and column names
  if (!is.null(n_manifest)) {
    ids_manifest <- seq(1, sum(n_manifest))
    ids_latent <- seq(sum(n_manifest) + 1, ncol(theta))
    if (is.null(colnames(theta))) {
      rownames(theta)[ids_manifest] <- colnames(theta)[ids_manifest] <- paste0("x", seq(1, length(ids_manifest)))
      rownames(theta)[ids_latent] <- colnames(theta)[ids_latent] <- paste0("f", seq(1, length(ids_latent)))
    }
  } else {
    ids_manifest <- seq(1, ncol(theta))
    if (is.null(colnames(theta))) {
      colnames(theta) <- rownames(theta) <- paste0("x", 1:sum(pk))
    }
  }

  # Simulating path coefficients (no need to be p.d.)
  set.seed(1)
  random_mat <- SimulateSymmetricMatrix(
    pk = p,
    v_within = v_between,
    v_between = v_between,
    v_sign = v_sign,
    continuous = continuous
  )
  Amat <- t(random_mat * abs(theta))

  # Defining identity matrix
  Imat <- diag(p)
  rownames(Imat) <- colnames(Imat) <- colnames(Amat)

  # Initialising residual covariance matrix S
  Smat <- diag(p)
  rownames(Smat) <- colnames(Smat) <- colnames(Amat)

  # Defining residual variances to reach desired proportions of explained variances
  if (is.null(n_manifest)) {
    ids_linked <- ids_manifest
  } else {
    ids_linked <- ids_latent
  }
  for (j in ids_linked) {
    tmpsigma <- solve(Imat - Amat) %*% Smat %*% solve(t(Imat - Amat))
    if (sum(theta[, j] > 0)) {
      tmppreds <- which(theta[, j] == 1)
      tmpsigma <- tmpsigma[tmppreds, tmppreds]
      tmpprod <- t(Amat[j, tmppreds, drop = FALSE]) %*% Amat[j, tmppreds, drop = FALSE]
      var_yhat <- sum(diag(tmpprod) * diag(tmpsigma)) + 2 * sum(tmpprod[upper.tri(tmpprod)] * tmpsigma[upper.tri(tmpsigma)])
      Smat[j, j] <- var_yhat * (1 - ev[j]) / ev[j]
    }
  }

  # Defining residual variances of manifest variables (in the presence of latent variables)
  if (!is.null(n_manifest)) {
    # Computing the covariance matrix for latent variables only
    sigma_latent <- solve(Imat[ids_latent, ids_latent] - Amat[ids_latent, ids_latent]) %*% Smat[ids_latent, ids_latent] %*% solve(t(Imat[ids_latent, ids_latent] - Amat[ids_latent, ids_latent]))

    # Defining residual variances of manifest variables
    diag(Smat)[ids_manifest] <- (apply(Amat[ids_manifest, ], 1, sum)^2) * rep.int(diag(sigma_latent), times = n_manifest) * (1 - ev_manifest) / ev_manifest
  }

  # Defining filter matrix
  Fmat <- Imat[seq(1, p - p_latent), ]

  # Computing corresponding covariance matrix (p.d. by definition)
  sigma_full <- solve(Imat - Amat) %*% Smat %*% solve(t(Imat - Amat))

  # Computing corresponding covariance matrix (p.d. by definition)
  sigma <- Fmat %*% sigma_full %*% t(Fmat)

  # Simulating data from multivariate normal distribution
  x <- MASS::mvrnorm(n, rep(0, ncol(sigma)), sigma)
  colnames(x) <- colnames(theta)[ids_manifest]
  rownames(x) <- paste0("obs", 1:nrow(x))

  # Assigning names to vector of proportions of explained variances
  names(ev) <- colnames(theta)

  # Defining the class of theta
  class(theta) <- c("matrix", "adjacency_matrix")

  # Preparing the output
  if (output_matrices) {
    out <- list(
      data = x, theta = theta, ev = ev,
      pk = layers, n_manifest = n_manifest,
      Amat = Amat, Smat = Smat, Fmat = Fmat,
      sigma = sigma
    )
    if (!is.null(n_manifest)) {
      out <- c(out, list(sigma_full = sigma_full))
    }
  } else {
    out <- list(
      data = x, theta = theta, ev = ev,
      pk = layers, n_manifest = n_manifest
    )
  }

  # Defining the class
  class(out) <- "simulation_structural_causal_model"

  return(out)
}


#' Simulation of undirected graph with block structure
#'
#' Simulates the adjacency matrix of an unweighted, undirected graph with no
#' self-loops. If \code{topology="random"}, different densities in diagonal
#' (\code{nu_within}) compared to off-diagonal (\code{nu_between}) blocks can be
#' used.
#'
#' @inheritParams SimulateGraphical
#'
#' @return A symmetric adjacency matrix encoding an unweighted, undirected graph
#'   with no self-loops, and with different densities in diagonal compared to
#'   off-diagonal blocks.
#'
#' @details Random graphs are simulated using the Erdos-Renyi algorithm.
#'   Scale-free graphs are simulated using a preferential attachment algorithm.
#'   More details are provided in \code{\link[huge]{huge.generator}}.
#'
#' @family simulation functions
#'
#' @references \insertRef{ourstabilityselection}{fake}
#'
#' \insertRef{huge}{fake}
#'
#' @examples
#' # Simulation of a scale-free graph with 20 nodes
#' adjacency <- SimulateAdjacency(pk = 20, topology = "scale-free")
#' plot(adjacency)
#'
#' # Simulation of a random graph with three connected components
#' adjacency <- SimulateAdjacency(
#'   pk = rep(10, 3),
#'   nu_within = 0.7, nu_between = 0
#' )
#' plot(adjacency)
#'
#' # Simulation of a random graph with block structure
#' adjacency <- SimulateAdjacency(
#'   pk = rep(10, 3),
#'   nu_within = 0.7, nu_between = 0.03
#' )
#' plot(adjacency)
#'
#' # User-defined function for graph simulation
#' CentralNode <- function(pk, hub = 1) {
#'   theta <- matrix(0, nrow = sum(pk), ncol = sum(pk))
#'   theta[hub, ] <- 1
#'   theta[, hub] <- 1
#'   diag(theta) <- 0
#'   return(theta)
#' }
#' simul <- SimulateAdjacency(pk = 10, implementation = CentralNode)
#' plot(simul) # star
#' simul <- SimulateAdjacency(pk = 10, implementation = CentralNode, hub = 2)
#' plot(simul) # variable 2 is the central node
#' @export
SimulateAdjacency <- function(pk = 10,
                              implementation = HugeAdjacency,
                              topology = "random",
                              nu_within = 0.1,
                              nu_between = 0,
                              nu_mat = NULL,
                              ...) {
  # Storing all arguments
  args <- c(mget(ls()), list(...))

  # Checking the inputs
  if (topology != "random") {
    if (length(pk) > 1) {
      pk <- sum(pk)
      warning(paste0("Multi-block simulations are only allowed with topology='random'. Argument 'pk' has been set to ", pk, "."))
    }
  }

  # Creating the matrix of probabilities
  if (is.null(nu_mat)) {
    nu_mat <- diag(length(pk)) * nu_within
    nu_mat[upper.tri(nu_mat)] <- nu_between
    nu_mat[lower.tri(nu_mat)] <- nu_between
  } else {
    if ((ncol(nu_mat) != length(pk)) & (nrow(nu_mat) != length(pk))) {
      stop("Arguments 'pk' and 'nu_mat' are not compatible. They correspond to different numbers of communities. The number of rows and columns in 'nu_mat' must be equal to the length of the vector 'pk'.")
    }
  }

  # Creating matrix with block indices
  bigblocks <- BlockMatrix(pk)
  bigblocks_vect <- bigblocks[upper.tri(bigblocks)]

  # Making as factor to allow for groups with 1 variable (for clustering)
  bigblocks_vect <- factor(bigblocks_vect, levels = seq(1, max(bigblocks)))
  block_ids <- unique(as.vector(bigblocks))

  # Creating matrix with block structure
  blocks <- BlockStructure(pk)

  # Identifying relevant arguments
  if (!"..." %in% names(formals(implementation))) {
    ids <- which(names(args) %in% names(formals(implementation)))
    args <- args[ids]
  }

  # Simulation of the adjacency matrix
  if ("nu" %in% names(formals(implementation))) {
    if (length(pk) > 1) {
      # Initialising theta
      theta <- matrix(0, nrow = sum(pk), ncol = sum(pk))
      theta_vect <- theta[upper.tri(theta)]

      # # Allowing for different densities in within and between blocks
      # theta_w <- do.call(implementation, args = c(args, list(nu = nu_within)))
      # theta_w_vect <- theta_w[upper.tri(theta_w)]
      # theta_b <- do.call(implementation, args = c(args, list(nu = nu_between)))
      # theta_b_vect <- theta_b[upper.tri(theta_b)]

      # Filling within and between blocks
      for (k in block_ids) {
        tmpids <- which(blocks == k, arr.ind = TRUE)
        i <- tmpids[1]
        j <- tmpids[2]
        theta_w <- do.call(implementation, args = c(args, list(nu = nu_mat[i, j])))
        theta_w_vect <- theta_w[upper.tri(theta_w)]
        theta_vect[bigblocks_vect == k] <- theta_w_vect[bigblocks_vect == k]
        # if (k %in% unique(diag(bigblocks))) {
        #   theta_vect[bigblocks_vect == k] <- theta_w_vect[bigblocks_vect == k]
        # } else {
        #   theta_vect[bigblocks_vect == k] <- theta_b_vect[bigblocks_vect == k]
        # }
      }
      theta[upper.tri(theta)] <- theta_vect
      theta <- theta + t(theta)
    } else {
      theta <- do.call(implementation, args = c(args, list(nu = nu_within)))
    }
  } else {
    theta <- do.call(implementation, args = c(args))
  }

  # Ensuring the adjacency matrix is symmetric (undirected graph) with no self-loops
  theta <- ifelse(theta + t(theta) != 0, yes = 1, no = 0)
  diag(theta) <- 0

  # Setting variable names
  colnames(theta) <- rownames(theta) <- paste0("var", 1:ncol(theta))

  # Defining the class
  class(theta) <- c("matrix", "adjacency_matrix")

  return(theta)
}


#' Simulation of undirected graph
#'
#' Simulates the adjacency matrix encoding an unweighted, undirected graph with
#' no self-loops.
#'
#' @inheritParams SimulateGraphical
#' @param pk number of nodes.
#' @param nu expected density (number of edges over the number of node pairs) of
#'   the graph. This argument is only used for \code{topology="random"} or
#'   \code{topology="cluster"} (see argument \code{prob} in
#'   \code{\link[huge]{huge.generator}}).
#' @param ... additional arguments to be passed to
#'   \code{\link[huge]{huge.generator}}.
#'
#' @return A symmetric adjacency matrix encoding an unweighted, undirected graph
#'   with no self-loops.
#'
#' @keywords internal
HugeAdjacency <- function(pk = 10, topology = "random", nu = 0.1, ...) {
  # Storing extra arguments
  extra_args <- list(...)

  # Extracting relevant extra arguments
  tmp_extra_args <- MatchingArguments(extra_args = extra_args, FUN = huge::huge.generator)
  tmp_extra_args <- tmp_extra_args[!names(tmp_extra_args) %in% c("n", "d", "prob", "graph", "verbose")]

  # Running simulation model
  mymodel <- do.call(huge::huge.generator, args = c(
    list(
      n = 2, d = sum(pk), prob = nu,
      graph = topology, verbose = FALSE
    ),
    tmp_extra_args
  ))
  theta <- as.matrix(mymodel$theta)

  # Re-organising the variables to avoid having centrality related to variable ID (e.g. for scale-free models)
  ids <- sample(ncol(theta))
  theta <- theta[ids, ids]

  return(theta)
}




#' Simulation of precision matrix
#'
#' Simulates a sparse precision matrix from a binary adjacency matrix
#' \code{theta} encoding conditional independence in a Gaussian Graphical Model.
#'
#' @inheritParams SimulateGraphical
#' @param theta binary and symmetric adjacency matrix encoding the conditional
#'   independence structure.
#' @param scale logical indicating if the proportion of explained variance by
#'   PC1 should be computed from the correlation (\code{scale=TRUE}) or
#'   covariance (\code{scale=FALSE}) matrix.
#'
#' @return A list with: \item{omega}{true simulated precision matrix.}
#'   \item{u}{value of the constant u used to ensure that \code{omega} is
#'   positive definite.}
#'
#' @seealso \code{\link{SimulateGraphical}}, \code{\link{MakePositiveDefinite}}
#'
#' @details Entries that are equal to zero in the adjacency matrix \code{theta}
#'   are also equal to zero in the generated precision matrix. These zero
#'   entries indicate conditional independence between the corresponding pair of
#'   variables (see \code{\link{SimulateGraphical}}).
#'
#'   Argument \code{pk} can be specified to create groups of variables and allow
#'   for nonzero precision entries to be sampled from different distributions
#'   between two variables belonging to the same group or to different groups.
#'
#'   If \code{continuous=FALSE}, nonzero off-diagonal entries of the precision
#'   matrix are sampled from a discrete uniform distribution taking values in
#'   \code{v_within} (for entries in the diagonal block) or \code{v_between}
#'   (for entries in off-diagonal blocks). If \code{continuous=TRUE}, nonzero
#'   off-diagonal entries are sampled from a continuous uniform distribution
#'   taking values in the range given by \code{v_within} or \code{v_between}.
#'
#'   Diagonal entries of the precision matrix are defined to ensure positive
#'   definiteness as described in \code{\link{MakePositiveDefinite}}.
#'
#' @references \insertRef{ourstabilityselection}{fake}
#'
#' @examples
#' # Simulation of an adjacency matrix
#' theta <- SimulateAdjacency(pk = c(5, 5), nu_within = 0.7)
#' print(theta)
#'
#' # Simulation of a precision matrix maximising the contrast
#' simul <- SimulatePrecision(theta = theta)
#' print(simul$omega)
#'
#' # Simulation of a precision matrix with specific ev by PC1
#' simul <- SimulatePrecision(
#'   theta = theta,
#'   pd_strategy = "min_eigenvalue",
#'   ev_xx = 0.3, scale = TRUE
#' )
#' print(simul$omega)
#' @export
SimulatePrecision <- function(pk = NULL, theta,
                              v_within = c(0.5, 1), v_between = c(0, 0.1),
                              v_sign = c(-1, 1), continuous = TRUE,
                              pd_strategy = "diagonally_dominant", ev_xx = NULL, scale = TRUE,
                              u_list = c(1e-10, 1), tol = .Machine$double.eps^0.25) {
  # Checking inputs and defining pk
  if (is.null(pk)) {
    pk <- ncol(theta)
  } else {
    if (sum(pk) != ncol(theta)) {
      stop("Arguments 'pk' and 'theta' are not consistent. The sum of 'pk' entries must be equal to the number of rows and columns in 'theta'.")
    }
  }

  # Checking the choice of pd_strategy
  if (!pd_strategy %in% c("diagonally_dominant", "min_eigenvalue")) {
    stop("Invalid input for argument 'pd_strategy'. Possible values are: 'diagonally_dominant' or 'min_eigenvalue'.")
  }

  # Checking other input values
  if (any((v_within < 0) | (v_within > 1))) {
    stop("Invalid input for argument 'v_within'. Values must be between 0 and 1.")
  }
  if (any((v_between < 0) | (v_between > 1))) {
    stop("Invalid input for argument 'v_between'. Values must be between 0 and 1.")
  }
  if (any(!v_sign %in% c(-1, 1))) {
    stop("Invalid input for argument 'v_sign'. Possible values are -1 and 1.")
  }

  # Ensuring that v values are lower than or equal to 1
  if (any(abs(v_within) > 1)) {
    v_within <- v_within / max(abs(v_within))
    message("The values provided in 'v_within' have been re-scaled to be lower than or equal to 1 in absolute value.")
  }

  # Ensuring that diagonal entries of theta are zero
  diag(theta) <- 0

  # Building v matrix
  v <- SimulateSymmetricMatrix(
    pk = pk, v_within = v_within, v_between = v_between,
    v_sign = v_sign, continuous = continuous
  )

  # Filling off-diagonal entries of the precision matrix
  omega_tilde <- theta * v

  # Ensuring positive definiteness
  omega_pd <- MakePositiveDefinite(
    omega = omega_tilde, pd_strategy = pd_strategy,
    ev_xx = ev_xx, scale = scale, u_list = u_list, tol = tol
  )

  # Returning the output
  return(omega_pd)
}
Any scripts or data that you put into this service are public.
fake documentation built on April 14, 2023, 12:37 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
fake
Flexible Data Simulation Using the Multivariate Normal Distribution

R/simulations.R
In fake: Flexible Data Simulation Using the Multivariate Normal Distribution

Defines functions SimulatePrecision HugeAdjacency SimulateAdjacency SimulateStructural SimulateClustering SimulateRegression SimulateComponents SimulateCorrelation SimulateGraphical

Documented in HugeAdjacency SimulateAdjacency SimulateClustering SimulateComponents SimulateCorrelation SimulateGraphical SimulatePrecision SimulateRegression SimulateStructural

Try the fake package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

fake Flexible Data Simulation Using the Multivariate Normal Distribution

R/simulations.R In fake: Flexible Data Simulation Using the Multivariate Normal Distribution

Defines functions SimulatePrecision HugeAdjacency SimulateAdjacency SimulateStructural SimulateClustering SimulateRegression SimulateComponents SimulateCorrelation SimulateGraphical

Documented in HugeAdjacency SimulateAdjacency SimulateClustering SimulateComponents SimulateCorrelation SimulateGraphical SimulatePrecision SimulateRegression SimulateStructural

Try the fake package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

fake
Flexible Data Simulation Using the Multivariate Normal Distribution

R/simulations.R
In fake: Flexible Data Simulation Using the Multivariate Normal Distribution