run_hierarchy: Hierarchical Testing
In hierbase: Enabling Hierarchical Multiple Testing

Description Usage Arguments Details Value References See Also Examples

View source: R/run-hierarchy.R

Hierarchical testing for a given user-specified test function.

run_hierarchy(
  x,
  y,
  dendr,
  test.func,
  clvar = NULL,
  arg.all = list(NULL),
  arg.all.fix = NULL,
  compMOD.same = function(...) NULL,
  compMOD.changing = function(...) NULL,
  alpha = 0.05,
  global.test = TRUE,
  hier.adj = FALSE,
  mt.adj = c("dpBF", "none"),
  agg.method = c("Tippett", "Stouffer"),
  verbose = FALSE,
  sort.parallel = TRUE,
  parallel = c("no", "multicore", "snow"),
  ncpus = 1L,
  cl = NULL
)

`x`	a matrix or list of matrices for multiple data sets. The matrix or matrices have to be of type numeric and are required to have column names / variable names. The rows and the columns represent the observations and the variables, respectively.
`y`	a vector, a matrix with one column, or list of the aforementioned objects for multiple data sets. The vector, vectors, matrix, or matrices have to be of type numeric.
`dendr`	the output of one of the functions `cluster_vars` or `cluster_positions`.
`test.func`	a test function for groups of variables. See the 'Details' and 'Examples' sections.
`clvar`	a matrix or list of matrices of control variables.
`arg.all`	a list with arguments which is passed on to each call of the functions inputted to the arguments `test.func`, `compMOD.same`, and `compMOD.changing`. See the 'Details' section.
`arg.all.fix`	a vector or list with arguments which is passed on to each call of the functions inputted to the arguments `test.func`, `compMOD.same`, and `compMOD.changing`. See the 'Details' section.
`compMOD.same`	a function which is called once at the beginning of the hierarchical testing procedure and its output is passed on to each call of `test.func`. See the 'Details' section.
`compMOD.changing`	a function which is called once at the beginning of the hierarchical testing procedure and its purpose is to serve as a cache or memory if some output does not have to be recalculated for multiple group tests but the cache can be updated while testing. See the 'Details' section.
`alpha`	the significant level at which the FWER is controlled.
`global.test`	a logical value indicating whether the global test should be performed.
`hier.adj`	a logical value indicating whether the p-values can only increase when going down some given branch. Strong FWER control holds as well if the argument is set to FALSE which is the default option.
`mt.adj`	type of multiple testing correction to be used; either `"dpBF"` (depth-wise Bonferroni multiple adjustment) or `"none"` (no adjustment). See the 'Details' section.
`agg.method`	a character string naming an aggregation method which aggregates the p-values over the different data sets for a given cluster; either `"Tippett"` (Tippett's rule) or `"Stouffer"` (Stouffer's rule). This argument is only relevant if multiple data sets are specified in the function call.
`verbose`	a logical value indicating whether the progress of the computation should be printed in the console.
`sort.parallel`	a logical indicating whether the blocks should be sorted with respect to the size of the block. This can reduce the run time for parallel computation.
`parallel`	type of parallel computation to be used. See the 'Details' section.
`ncpus`	number of processes to be run in parallel.
`cl`	an optional parallel or snow cluster used if `parallel = "snow"`. If not supplied, a cluster on the local machine is created.

Hierarchical testing is performed by going top down through the hierarchical tree. Testing in some branch only continues if at least one child of a given cluster is significant. The function run_hierarchy requires the output of one of the functions cluster_vars or cluster_positions as an input (argument dendr).

The user can specify which hierarchical multiple testing adjustment for the hierarchical procedure is applied. The default is "dpBF" (depth-wise Bonferroni multiple adjustment). Alternatively, the user can choose "none" (no adjustment). The hierarchical multiple testing adjustment "dpBF" guarantees strong family-wise error control if the group test, which is applied for testing a given group, controls the type I error.

The group test function has to be specified by the user; see argument test.func. It is required to have the following arguments: x, y, clvar, colnames.cluster, arg.all, arg.all.fix, mod.large, mod.small. Although it is fine if not all of them are used inside the function. Most of the arguments are described in the list of arguments above for the function run_hierarchy since they are as well specified by the user. The argument colnames.cluster contains the column names of the group of variables which should be tested. The arguments mod.large and mod.small are used as a cache or memory since one can often reuse, say, one initial lasso fit for all the subsequent group tests. The argument mod.large corresponds to some result of some initial calculation which is reused later. The argument mod.small can be updated continuously through the hierarchical sequential testing procedure. The test function is required to output or return a list with two elements, i.e. the p-value ("pval") calculated for the given group and some object ("mod.small"), which serves as a cache or memory and is passed on to the next function calls in the same branch; see example below. The latter can just be NULL if this feature is not used. Compare as well with the above arguments compMOD.same and compMOD.changing for the function run_hierarchy, and see the example below.

If the argument block was supplied for the building of the hierarchical tree (i.e. in the function call of either cluster_vars or cluster_positions), i.e. the second level of the hierarchical tree was given, the hierarchical testing step can be run in parallel across the different blocks by specifying the arguments parallel and ncpus. There is an optional argument cl if parallel = "snow". There are three possibilities to set the argument parallel: parallel = "no" for serial evaluation (default), parallel = "multicore" for parallel evaluation using forking, and parallel = "snow" for parallel evaluation using a parallel socket cluster. It is recommended to select RNGkind("L'Ecuyer-CMRG") and set a seed to ensure that the parallel computing of the package hierbase is reproducible. This way each processor gets a different substream of the pseudo random number generator stream which makes the results reproducible if the arguments (as sort.parallel and ncpus) remain unchanged. See the vignette or the reference for more details.

Note that if Tippett's aggregation method is applied for multiple data sets, then very small p-values are set to machine precision. This is due to rounding in floating point arithmetic.

The returned value is an object of class "hierBase", consisting a data.frame with the result of the hierarchical testing.

The data.frame has the following columns:

`block`	`NA` or the name of the block if the significant cluster is a subcluster of the block or is the block itself.
`p.value`	The p-value of the significant cluster.
`significant.cluster`	The column names of the members of the significant cluster.

There is a print method for this class; see print.hierBase.

Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2), 265-278. Renaux, C., Buzdugan, L., Kalisch, M., and Bühlmann, P. (2020). Hierarchical inference for genome-wide association studies: a view on methodology with software. Computational Statistics, 35(1), 1-40.

cluster_vars, cluster_positions, and advance_hierarchy.

# low-dimensional example with user specified test function
n <- 200
p <- 100
library(MASS)
set.seed(3)
x <- mvrnorm(n, mu = rep(0, p), Sigma = diag(p))
colnames(x) <- paste0("Var", 1:p)
beta <- rep(0, p)
beta[c(5, 20, 46)] <- 1
y <- x %*% beta + rnorm(n)

dendr1 <- cluster_vars(x = x)

# Define own test function: partial F-test
# Many of the arguments of the function below might not be 
# used but the function is required to have those arguments.
my.test <- function(x, y, clvar, colnames.cluster, arg.all, 
                    arg.all.fix, mod.large, mod.small) {
                      
                       # generate design matrices
                       data.large <- cbind(clvar, x)
                       setdiff.cluster <- setdiff(colnames(x), colnames.cluster)
                       data.small <- cbind(clvar, x[, setdiff.cluster]) 
                       # Replace data.small if set of indices setdiff.cluster is empty.
                       if (ncol(data.small) == 0) {data.small <- rep(1, length(y))}
                       
                       # compare the models using a partial F test
                       pval <- anova(lm(y ~ data.small),
                                     lm(y ~ data.large),
                                     test = "F")$P[2]
                       
                       return(list("pval" = pval, "mod.small" = NULL))
                    }

# run hierarchical testing
set.seed(76)
sign.clusters1 <- run_hierarchy(x = x, y = y, dendr = dendr1,
                                test.func = my.test)

## With block
# I.e. second level of the hierarchical tree is specified by 
# the user. This would allow to run the code in parallel; see the 'Details'
# section.
# The column names of the data frame block are optional.
block <- data.frame("var.name" = paste0("Var", 1:p),
                    "block" = rep(c(1, 2), each = p/2))
dendr2 <- cluster_vars(x = x, block = block)
set.seed(76)
sign.clusters2 <- run_hierarchy(x = x, y = y, dendr = dendr2,
                                test.func = my.test)

# Access part of the return object or result
sign.clusters2[, "block"]
sign.clusters2[, "p.value"]
# Column names or variable names of the significant cluster in the first row.
sign.clusters2[[1, "significant.cluster"]]