multisplit: Multi-sample splitting
In hierinf: Hierarchical Inference

Description Usage Arguments Details Value References See Also Examples

View source: R/multisplit.R

The data is randomly split in two halves w.r.t. the observations and variable selection using Lasso is performed on one half. Whereas the second half and the selected variables are later used for testing by the function test_only_hierarchy. This is repeated multiple times.

multisplit(x, y, clvar = NULL, B = 50, proportion.select = 1/6,
  standardize = FALSE, family = c("gaussian", "binomial"),
  parallel = c("no", "multicore", "snow"), ncpus = 1L, cl = NULL,
  check.input = TRUE)

`x`	a matrix or list of matrices for multiple data sets. The matrix or matrices have to be of type numeric and are required to have column names / variable names. The rows and the columns represent the observations and the variables, respectively.
`y`	a vector, a matrix with one column, or list of the aforementioned objects for multiple data sets. The vector, vectors, matrix, or matrices have to be of type numeric. For `family = "binomial"`, the response is required to be a binary vector taking values 0 and 1.
`clvar`	a matrix or list of matrices of control variables.
`B`	number of sample splits.
`proportion.select`	proportion of variables to be selected by Lasso in the multi-sample splitting step.
`standardize`	a logical value indicating whether the variables should be standardized.
`family`	a character string naming a family of the error distribution; either `"gaussian"` or `"binomial"`.
`parallel`	type of parallel computation to be used. See the 'Details' section.
`ncpus`	number of processes to be run in parallel.
`cl`	an optional parallel or snow cluster used if `parallel = "snow"`. If not supplied, a cluster on the local machine is created.
`check.input`	a logical value indicating whether the function should check the input. This argument is used to call `multisplit` within `test_hierarchy`.

A given data with nobs is randomly split in two halves w.r.t. the observations and nobs * proportion.select variables are selected using Lasso (implemented in glmnet) on one half. Control variables are not penalized if supplied using the argument clvar. This is repeated B times for each data set if multiple data sets are supplied. Those splits (i.e. second halves of observations) and corresponding selected variables are used to perform hierarchical testing by the function test_only_hierarchy.

The multi-sample split step can be run in parallel across the different sample splits (B corresponds to number of sample splits) by specifying the arguments parallel and ncpus. There is an optional argument cl if parallel = "snow". There are three possibilities to set the argument parallel: parallel = "no" for serial evaluation (default), parallel = "multicore" for parallel evaluation using forking, and parallel = "snow" for parallel evaluation using a parallel socket cluster. It is recommended to select RNGkind("L'Ecuyer-CMRG") and set a seed to ensure that the parallel computing of the package hierinf is reproducible. This way each processor gets a different substream of the pseudo random number generator stream which makes the results reproducible if the arguments (as sort.parallel and ncpus) remain unchanged. See the vignette or the reference for more details.

The returned value is an object of class "hierM", consisting of a list with number of elements corresponding to the number of data sets. Each element (corresponding to a data set The first matrix contains the indices of the second half of variables (which were not used to select the variables). The second matrix contains the column names / variable names of the selected variables.

Renaux, C. et al. (2018), Hierarchical inference for genome-wide association studies: a view on methodology with software. (arXiv:1805.02988)

Meinshausen, N., Meier, L. and Buhlmann, P. (2009), P-values for high-dimensional regression, Journal of the American Statistical Association 104, 1671-1681.

cluster_var, cluster_position, test_only_hierarchy, test_hierarchy, and compute_r2.

n <- 200
p <- 500
library(MASS)
set.seed(3)
x <- mvrnorm(n, mu = rep(0, p), Sigma = diag(p))
colnames(x) <- paste0("Var", 1:p)
beta <- rep(0, p)
beta[c(5, 20, 46)] <- 1
y <- x %*% beta + rnorm(n)

set.seed(84)
res.multisplit <- multisplit(x = x, y = y, family = "gaussian")