grpss: Group screening and selection
In deman007/grpss: Group Screening and Selection

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Performs the grouped variable screening and selection.

grpss(...)

## Default S3 method:
grpss(X, y, group, threshold = NULL,
  scale = c("standardize", "normalize", "none"), criterion = c("gSIS",
  "gHOLP", "gAR2", "gDC"), family = c("gaussian", "binomial", "poisson"),
  select = FALSE, penalty = c("grSCAD", "grLasso", "grMCP", "gel", "cMCP"),
  cross.validation = FALSE, norm = c("L1", "L2", "Linf"), q = 1,
  perm.seed = 1, nfolds = 10, cv.seed = NULL, parallel = FALSE,
  cl = NULL, cores = NULL, ...)

## S3 method for class 'formula'
grpss(formula, data, group, ...)

`...`	Optional arguments passed to `grpreg`.
`X`	A matrix of grouped predictors.
`y`	A numeric vector of response.
`group`	A vector of describing the grouping of the predictors. Numeric and consectutive group indices are recommended.
`threshold`	A threshold meaning how many groups are screened out. The default is `NULL`. See details.
`scale`	The type of scaling of the predictors. The default is "`standardize`".
`criterion`	The screening criterion. The default is "`gSIS`".
`family`	A description of the error distribution and link function to be used in the model. The default is "`gaussian`".
`select`	A logical value indicating whether to perform the grouped variable selection. The default is `FALSE`.
`penalty`	The penalty to be applied to the screened model. The default is "`grSCAD`". Only valid when `select = TRUE`.
`cross.validation`	A logical value indicating whether to perform the k-fold cross-validation when conducting the grouped variable selection. Only valid when `select = TRUE`. The default is `FALSE`.
`norm`	The type of norm applied to `criterion = gSIS` and `criterion = gHOLP`. The default is `L1` norm.
`q`	A quantile for calculating the data-driven threshold in the permutation-based grouped screening. The default is `1`. (i.e., the maximum absolute value of the permuted estimates). See details for more information.
`perm.seed`	A seed of the random number generator used for the permutation-based screening to obtain the threshold. See details.
`nfolds`	The number of folds to perform the cross-validation. The default is `10`.
`cv.seed`	A seed of the random number generator used for the cross-validation.
`parallel`	A logical value indicating whether to use the parallel computing. The default is `FALSE`.
`cl`	A cluster object as returned by makeCluster, or the number of nodes to be created in the cluster.
`cores`	The number of cores to use for parallel execution. If not specified, the number of core is set to be 3.
`formula`	An object of class "`formula`".
`data`	An optional data frame.

The grouped variable selection will have big challenges or even fail in the presence of ultra-high dimension of groups. To solve these issues, we implement a two-stage procedure. At the first stage, a grouped screening procedure is applied to reduce the dimensions of groups from ultra-high to moderate or even small one, then we can use the grouped variable selection for the screened data without facing the big challenges at the second stage. At the first stage, the sure screening property ensures that the screening procedure can retain all important groups with overwhelming probability.

This function is used to accomplish this two-stage procedure. At the first stage, we apply different screening criteria for grouped variables by calculating the grouped screening values that measures the strength of relationship between response and entire predictors of each group. See grp.criValues for the details of calculating the grouped screening values. For the family = "gaussian" case, we select the groups which have the largest threshold values of screening criterion indices. On the contrary, for the family = "binomial" or "poisson" case, we keep the groups which have the smallest threshold values of screening criterion values.

If threshold = NULL, we use the random permutation strategy to gain the threshold (threshold), which is called the data-driven threshold. The details can be seen in Fan, Feng and Song (2011). Larger threshold (threshold) will lead to larger probability of containing the true important groups, but may result in more intense computation in grouped variable selection and larger false positive rate.

At the second stage, we use the function grpreg in grpreg package developed by Patrick Breheny to fit the penalized regression model for the grouped variables that are screened out at the first stage. More details of the grouped variable selection can be refered to the details of grpreg.

Also, we use the parallel computation in this function by importing the doParallel package to improve the computation efficiency.

If select = FALSE, a list with class "grpss" containing the following components:

`call`	The function call.
`y`	The response.
`X`	The screened predictors.
`group.screen`	The indices of screened groups.
`threshold`	The threshold.
`criterion`	The screening criterion.

If select = TRUE, a list with class "grpreg" or "cv.grpreg" (when cross.validation = TRUE) containing the similar components as in function grpreg or cv.grpreg, plus the following three elements:

`call`	Same as above.
`group.screen`	Same as above.
`criterion`	Same as above.

The missing values are removed before the analysis.

Debin Qiu, Jeongyoun Ahn

Fan J, Feng Y, Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 106:544-557.

grpreg, cv.grpreg, grp.criValues

library(MASS)
set.seed(23)
n <- 30 # sample size
p <- 3  # number of predictors in each group
J <- 50  # group size
group <- rep(1:J,each = 3)  # group indices
##autoregressive correlation
Sigma <- 0.6^abs(matrix(1:(p*J),p*J,p*J) - t(matrix(1:(p*J),p*J,p*J)))
X <- mvrnorm(n,seq(0,5,length.out = p*J),Sigma)
betaTrue <- runif(12,-2,5)
mu <- X%*%matrix(c(betaTrue,rep(0,p*J-12)),ncol = 1)

# normal distribution
y <- mu + rnorm(n)

# only conduct screening procedure
(gss01 <- grpss(X,y,group)) # gSIS

# perform both screening and selection procedures
## use grpss.default with cross-validation
gss11 <- grpss(X,y,group,select = TRUE,cross.validation = TRUE)
summary(gss11)
## without cross-validation
gss12 <- grpss(X,y,threshold = 10,group,select = TRUE,criterion = "gHOLP")
summary(gss12)

## binomial distribution
y1 <- rbinom(n,1,1/(1 + exp(-mu)))
(gss21 <- grpss(X,y1,group, criterion = "gAR2")) # use AIC
(gss22 <- grpss(X,y1,group, criterion = "gDC"))  # use gDC

## poisson distribution
y2 <- rpois(n,lambda = exp(mu))
(gss31 <- grpss(X,y2,group, criterion = "gAR2"))
(gss22 <- grpss(X,y2,group, criterion = "gDC"))