splitFlip: Resampling-Based Multisplit

View source: R/splitFlip.R

splitFlipR Documentation

Resampling-Based Multisplit

Description

This function computes resampling-based standardized scores for high-dimensional linear regression.

Usage

splitFlip(X, Y, Q = 50, B = 200, target = NULL, varSel = selLasso, varSelArgs = NULL, exact = FALSE, maxRepeat = 20, seed = NULL)

Arguments

X

numeric design matrix (including the intercept), where columns correspond to variables, and rows to observations.

Y

numeric response vector.

Q

numer of data splits.

B

number of sign flips.

target

maximum number of variables to be selected.

varSel

a function to perform variable selection. It must have at least three arguments: X (design matrix), Y (response vector) and target (maximum number of selected variables). Additional arguments are passed through varSelArgs. Return value is a numeric vector containing the indices of the selected variables.

varSelArgs

named list of further arguments for varSel.

exact

logical, TRUE for the exact method, FALSE for the approximate method.

maxRepeat

maximum number of split trials.

seed

seed.

Details

The data are iteratively split into two subsets of equal size for Q times. For each split, the first subset is used to perform variable selection, while the second is used to compute the effective scores for each variable and B random sign flips (including the identity). If a variable is not selected, its score is set to zero. For each variable and each sign flip, the standardized score is defined as (an approximation of) the sum of the effective scores over the Q splits, divided by its variance.

If too many variables are selected in a split (more than half the sample size), a warning is returned and the data is randomly split again. After maxRepeat trials where too many variables are selected, the function returns an error message.

Value

splitFlip returns a numeric matrix of standardized scores, where columns correspond to variables, and rows to B random sign flips. The first flip is the identity.

Author(s)

Anna Vesely.

Examples

# generate linear regression data with 20 variables and 10 observations
res <- simData(m1=2, m=20, n=10, rho=0.5, type="toeplitz", SNR=5, seed=42)
X <- res$X # design matrix
Y <- res$Y # response vector
active <- res$active # indices of active variables

# choose target as twice the number of active variables
target <- 2*length(active)

# standardized scores using the approximate method with Lasso selection of target variables
G1 <- splitFlip(X, Y, target=target, seed=42)

# maxT algorithm
maxT(G1, alpha=0.1)

# standardized scores using the exact method with oracle selection of target variables
G2 <- splitFlip(X, Y, target=target, varSel=selOracle, varSelArgs=list(toSel=active), seed=42)

# maxT algorithm
maxT(G2, alpha=0.1)

annavesely/splitFlip documentation built on July 27, 2024, 4:23 a.m.