grpss: Group screening and selection

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Description

Performs the grouped variable screening and selection.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
grpss(...)

## Default S3 method:
grpss(X, y, group, threshold = NULL,
  scale = c("standardize", "normalize", "none"), criterion = c("gSIS",
  "gHOLP", "gAR2", "gDC"), family = c("gaussian", "binomial", "poisson"),
  select = FALSE, penalty = c("grSCAD", "grLasso", "grMCP", "gel", "cMCP"),
  cross.validation = FALSE, norm = c("L1", "L2", "Linf"), q = 1,
  perm.seed = 1, nfolds = 10, cv.seed = NULL, parallel = FALSE,
  cl = NULL, cores = NULL, ...)

## S3 method for class 'formula'
grpss(formula, data, group, ...)

Arguments

...

Optional arguments passed to grpreg.

X

A matrix of grouped predictors.

y

A numeric vector of response.

group

A vector of describing the grouping of the predictors. Numeric and consectutive group indices are recommended.

threshold

A threshold meaning how many groups are screened out. The default is NULL. See details.

scale

The type of scaling of the predictors. The default is "standardize".

criterion

The screening criterion. The default is "gSIS".

family

A description of the error distribution and link function to be used in the model. The default is "gaussian".

select

A logical value indicating whether to perform the grouped variable selection. The default is FALSE.

penalty

The penalty to be applied to the screened model. The default is "grSCAD". Only valid when select = TRUE.

cross.validation

A logical value indicating whether to perform the k-fold cross-validation when conducting the grouped variable selection. Only valid when select = TRUE. The default is FALSE.

norm

The type of norm applied to criterion = gSIS and criterion = gHOLP. The default is L1 norm.

q

A quantile for calculating the data-driven threshold in the permutation-based grouped screening. The default is 1. (i.e., the maximum absolute value of the permuted estimates). See details for more information.

perm.seed

A seed of the random number generator used for the permutation-based screening to obtain the threshold. See details.

nfolds

The number of folds to perform the cross-validation. The default is 10.

cv.seed

A seed of the random number generator used for the cross-validation.

parallel

A logical value indicating whether to use the parallel computing. The default is FALSE.

cl

A cluster object as returned by makeCluster, or the number of nodes to be created in the cluster.

cores

The number of cores to use for parallel execution. If not specified, the number of core is set to be 3.

formula

An object of class "formula".

data

An optional data frame.

Details

The grouped variable selection will have big challenges or even fail in the presence of ultra-high dimension of groups. To solve these issues, we implement a two-stage procedure. At the first stage, a grouped screening procedure is applied to reduce the dimensions of groups from ultra-high to moderate or even small one, then we can use the grouped variable selection for the screened data without facing the big challenges at the second stage. At the first stage, the sure screening property ensures that the screening procedure can retain all important groups with overwhelming probability.

This function is used to accomplish this two-stage procedure. At the first stage, we apply different screening criteria for grouped variables by calculating the grouped screening values that measures the strength of relationship between response and entire predictors of each group. See grp.criValues for the details of calculating the grouped screening values. For the family = "gaussian" case, we select the groups which have the largest threshold values of screening criterion indices. On the contrary, for the family = "binomial" or "poisson" case, we keep the groups which have the smallest threshold values of screening criterion values.

If threshold = NULL, we use the random permutation strategy to gain the threshold (threshold), which is called the data-driven threshold. The details can be seen in Fan, Feng and Song (2011). Larger threshold (threshold) will lead to larger probability of containing the true important groups, but may result in more intense computation in grouped variable selection and larger false positive rate.

At the second stage, we use the function grpreg in grpreg package developed by Patrick Breheny to fit the penalized regression model for the grouped variables that are screened out at the first stage. More details of the grouped variable selection can be refered to the details of grpreg.

Also, we use the parallel computation in this function by importing the doParallel package to improve the computation efficiency.

Value

If select = FALSE, a list with class "grpss" containing the following components:

call

The function call.

y

The response.

X

The screened predictors.

group.screen

The indices of screened groups.

threshold

The threshold.

criterion

The screening criterion.

If select = TRUE, a list with class "grpreg" or "cv.grpreg" (when cross.validation = TRUE) containing the similar components as in function grpreg or cv.grpreg, plus the following three elements:

call

Same as above.

group.screen

Same as above.

criterion

Same as above.

Note

The missing values are removed before the analysis.

Author(s)

Debin Qiu, Jeongyoun Ahn

References

Fan J, Feng Y, Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 106:544-557.

See Also

grpreg, cv.grpreg, grp.criValues

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
library(MASS)
set.seed(23)
n <- 30 # sample size
p <- 3  # number of predictors in each group
J <- 50  # group size
group <- rep(1:J,each = 3)  # group indices
##autoregressive correlation
Sigma <- 0.6^abs(matrix(1:(p*J),p*J,p*J) - t(matrix(1:(p*J),p*J,p*J)))
X <- mvrnorm(n,seq(0,5,length.out = p*J),Sigma)
betaTrue <- runif(12,-2,5)
mu <- X%*%matrix(c(betaTrue,rep(0,p*J-12)),ncol = 1)

# normal distribution
y <- mu + rnorm(n)

# only conduct screening procedure
(gss01 <- grpss(X,y,group)) # gSIS

# perform both screening and selection procedures
## use grpss.default with cross-validation
gss11 <- grpss(X,y,group,select = TRUE,cross.validation = TRUE)
summary(gss11)
## without cross-validation
gss12 <- grpss(X,y,threshold = 10,group,select = TRUE,criterion = "gHOLP")
summary(gss12)

## binomial distribution
y1 <- rbinom(n,1,1/(1 + exp(-mu)))
(gss21 <- grpss(X,y1,group, criterion = "gAR2")) # use AIC
(gss22 <- grpss(X,y1,group, criterion = "gDC"))  # use gDC

## poisson distribution
y2 <- rpois(n,lambda = exp(mu))
(gss31 <- grpss(X,y2,group, criterion = "gAR2"))
(gss22 <- grpss(X,y2,group, criterion = "gDC"))

deman007/grpss documentation built on May 15, 2019, 3:22 a.m.