splitFlip: Resampling-Based Multisplit
In annavesely/splitFlip: Permutation-Based Multisplit

View source: R/splitFlip.R

splitFlip

R Documentation

Resampling-Based Multisplit

Description

This function computes resampling-based standardized scores for high-dimensional linear regression.

Usage

splitFlip(X, Y, Q = 50, B = 200, target = NULL, varSel = selLasso, varSelArgs = NULL, exact = FALSE, maxRepeat = 20, seed = NULL)

Arguments

`X`	numeric design matrix (including the intercept), where columns correspond to variables, and rows to observations.
`Y`	numeric response vector.
`Q`	numer of data splits.
`B`	number of sign flips.
`target`	maximum number of variables to be selected.
`varSel`	a function to perform variable selection. It must have at least three arguments: `X` (design matrix), `Y` (response vector) and `target` (maximum number of selected variables). Additional arguments are passed through `varSelArgs`. Return value is a numeric vector containing the indices of the selected variables.
`varSelArgs`	named list of further arguments for `varSel`.
`exact`	logical, `TRUE` for the exact method, `FALSE` for the approximate method.
`maxRepeat`	maximum number of split trials.
`seed`	seed.

Details

The data are iteratively split into two subsets of equal size for Q times. For each split, the first subset is used to perform variable selection, while the second is used to compute the effective scores for each variable and B random sign flips (including the identity). If a variable is not selected, its score is set to zero. For each variable and each sign flip, the standardized score is defined as (an approximation of) the sum of the effective scores over the Q splits, divided by its variance.

If too many variables are selected in a split (more than half the sample size), a warning is returned and the data is randomly split again. After maxRepeat trials where too many variables are selected, the function returns an error message.

Value

splitFlip returns a numeric matrix of standardized scores, where columns correspond to variables, and rows to B random sign flips. The first flip is the identity.

Author(s)

Anna Vesely.

Examples

# generate linear regression data with 20 variables and 10 observations
res <- simData(m1=2, m=20, n=10, rho=0.5, type="toeplitz", SNR=5, seed=42)
X <- res$X # design matrix
Y <- res$Y # response vector
active <- res$active # indices of active variables

# choose target as twice the number of active variables
target <- 2*length(active)

# standardized scores using the approximate method with Lasso selection of target variables
G1 <- splitFlip(X, Y, target=target, seed=42)

# maxT algorithm
maxT(G1, alpha=0.1)

# standardized scores using the exact method with oracle selection of target variables
G2 <- splitFlip(X, Y, target=target, varSel=selOracle, varSelArgs=list(toSel=active), seed=42)

# maxT algorithm
maxT(G2, alpha=0.1)

annavesely/splitFlip documentation built on July 27, 2024, 4:23 a.m.