Description Usage Arguments Details Value Note Author(s) References See Also Examples
Performs the grouped variable screening and selection.
1 2 3 4 5 6 7 8 9 10 11 12 13 | grpss(...)
## Default S3 method:
grpss(X, y, group, threshold = NULL,
scale = c("standardize", "normalize", "none"), criterion = c("gSIS",
"gHOLP", "gAR2", "gDC"), family = c("gaussian", "binomial", "poisson"),
select = FALSE, penalty = c("grSCAD", "grLasso", "grMCP", "gel", "cMCP"),
cross.validation = FALSE, norm = c("L1", "L2", "Linf"), q = 1,
perm.seed = 1, nfolds = 10, cv.seed = NULL, parallel = FALSE,
cl = NULL, cores = NULL, ...)
## S3 method for class 'formula'
grpss(formula, data, group, ...)
|
... |
Optional arguments passed to |
X |
A matrix of grouped predictors. |
y |
A numeric vector of response. |
group |
A vector of describing the grouping of the predictors. Numeric and consectutive group indices are recommended. |
threshold |
A threshold meaning how many groups are screened out. The default is |
scale |
The type of scaling of the predictors. The default is " |
criterion |
The screening criterion. The default is " |
family |
A description of the error distribution and link function to be used
in the model. The default is " |
select |
A logical value indicating whether to perform the grouped variable selection.
The default is |
penalty |
The penalty to be applied to the screened model. The default is
" |
cross.validation |
A logical value indicating whether to perform the k-fold
cross-validation when conducting the grouped variable selection. Only valid when
|
norm |
The type of norm applied to |
q |
A quantile for calculating the data-driven threshold in the permutation-based
grouped screening. The default is |
perm.seed |
A seed of the random number generator used for the permutation-based screening to obtain the threshold. See details. |
nfolds |
The number of folds to perform the cross-validation. The default is |
cv.seed |
A seed of the random number generator used for the cross-validation. |
parallel |
A logical value indicating whether to use the parallel computing. The
default is |
cl |
A cluster object as returned by makeCluster, or the number of nodes to be created in the cluster. |
cores |
The number of cores to use for parallel execution. If not specified, the number of core is set to be 3. |
formula |
An object of class " |
data |
An optional data frame. |
The grouped variable selection will have big challenges or even fail in the presence of ultra-high dimension of groups. To solve these issues, we implement a two-stage procedure. At the first stage, a grouped screening procedure is applied to reduce the dimensions of groups from ultra-high to moderate or even small one, then we can use the grouped variable selection for the screened data without facing the big challenges at the second stage. At the first stage, the sure screening property ensures that the screening procedure can retain all important groups with overwhelming probability.
This function is used to accomplish this two-stage procedure. At the first stage,
we apply different screening criteria for grouped variables by calculating the grouped
screening values that measures the strength of relationship between response and entire
predictors of each group. See grp.criValues
for the details of calculating
the grouped screening values.
For the family = "gaussian"
case, we select the groups which
have the largest threshold
values of screening criterion indices.
On the contrary, for the family = "binomial"
or "poisson"
case, we keep the groups which have the smallest threshold
values of screening criterion
values.
If threshold = NULL
, we use the random permutation strategy to gain the threshold
(threshold
), which is called the data-driven threshold. The details can be seen in
Fan, Feng and Song (2011). Larger threshold (threshold
) will lead to larger probability
of containing the true important groups, but may result in more intense computation in
grouped variable selection and larger false positive rate.
At the second stage, we use the function grpreg
in grpreg
package
developed by Patrick Breheny to fit the penalized regression model for the grouped
variables that are screened out at the first stage. More details of the grouped variable
selection can be refered to the details of grpreg
.
Also, we use the parallel computation in this function by importing the
doParallel
package to improve the computation efficiency.
If select = FALSE
, a list with class "grpss
" containing the following
components:
call |
The function call. |
y |
The response. |
X |
The screened predictors. |
group.screen |
The indices of screened groups. |
threshold |
The threshold. |
criterion |
The screening criterion. |
If select = TRUE
, a list with class "grpreg
" or "cv.grpreg
"
(when cross.validation = TRUE
) containing the similar components as in function
grpreg
or cv.grpreg
, plus the following
three elements:
call |
Same as above. |
group.screen |
Same as above. |
criterion |
Same as above. |
The missing values are removed before the analysis.
Debin Qiu, Jeongyoun Ahn
Fan J, Feng Y, Song R (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 106:544-557.
grpreg
, cv.grpreg
,
grp.criValues
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | library(MASS)
set.seed(23)
n <- 30 # sample size
p <- 3 # number of predictors in each group
J <- 50 # group size
group <- rep(1:J,each = 3) # group indices
##autoregressive correlation
Sigma <- 0.6^abs(matrix(1:(p*J),p*J,p*J) - t(matrix(1:(p*J),p*J,p*J)))
X <- mvrnorm(n,seq(0,5,length.out = p*J),Sigma)
betaTrue <- runif(12,-2,5)
mu <- X%*%matrix(c(betaTrue,rep(0,p*J-12)),ncol = 1)
# normal distribution
y <- mu + rnorm(n)
# only conduct screening procedure
(gss01 <- grpss(X,y,group)) # gSIS
# perform both screening and selection procedures
## use grpss.default with cross-validation
gss11 <- grpss(X,y,group,select = TRUE,cross.validation = TRUE)
summary(gss11)
## without cross-validation
gss12 <- grpss(X,y,threshold = 10,group,select = TRUE,criterion = "gHOLP")
summary(gss12)
## binomial distribution
y1 <- rbinom(n,1,1/(1 + exp(-mu)))
(gss21 <- grpss(X,y1,group, criterion = "gAR2")) # use AIC
(gss22 <- grpss(X,y1,group, criterion = "gDC")) # use gDC
## poisson distribution
y2 <- rpois(n,lambda = exp(mu))
(gss31 <- grpss(X,y2,group, criterion = "gAR2"))
(gss22 <- grpss(X,y2,group, criterion = "gDC"))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.