cv.smog: Cross-valiation for smog
In smog: Structural Modeling by using Overlapped Group Penalty

Description Usage Arguments Details Value Author(s) References See Also Examples

cv.smog conducts a greedy-search for optimal lambda's and yields a sparse model given a provided model selection criterion. When type is ”nloglike”, the method allows the nfolds to be processed in parallel for speeding up the cross-validation.

cv.smog(x, y, g, v, label, type = "nloglike", family = "gaussian",
  lambda.max = NULL, nlambda.max = 20, delta = 0.9, nfolds = 10,
  parallel = FALSE, ncores = NULL, ...)

## S3 method for class 'cv.smog'
print(x, ...)

`x`	a model matrix, or a data frame of dimensions n by p, in which the columns represents the predictor variables.
`y`	response variable, corresponds to the family description. When family is ”gaussian” or ”binomial”, `y` ought to be a numeric vector of observations of length n; when family is ”coxph”, `y` represents the survival objects, containing the survival time and the censoring status. See `Surv`.
`g`	a vector of group labels for the predictor variables.
`v`	a vector of binary values, represents whether or not the predictor variables are penalized. Note that 1 indicates penalization and 0 for not penalization.
`label`	a character vector, represents the type of predictors in terms of treatment, prognostic, and predictive effects by using ”t”, ”prog”, and ”pred”, respectively.
`type`	model selction criterion, should choose from ”nloglike”, ”cAIC”, ”AIC”, ”BIC”, and ”GCV”, respectively.
`family`	a description of the distribution family for the response variable variable. For continuous response variable, family is ”gaussian”; for multinomial or binary response variable, family is ”binomial”; for survival response variable, family is ”coxph”, respectively.
`lambda.max`	the maximum value for lambda's. If `NULL`, the default `lambda.max` is 1/λ_{min}(x'x).
`nlambda.max`	the maximum number of lambdas' shrunk down from the maximum lambda `lambda.max`. Default is 20.
`delta`	the damping rate for lambda's such that λ_k = δ^kλ_0. Default is 0.9.
`nfolds`	number of folds. One fold of the observations in the data are used as the testing, and the remaining are fitted for model training. Default is 5.
`parallel`	Whether or not process the `nfolds` cross-validations in parallel. If `TRUE`, use `foreach` to do each cross-validation in parallel. Default is `FALSE`.
`ncores`	number of cpu's for parallel computing. See `makeCluster` and `registerDoParallel`. Default is `NULL`.
`...`	other arguments that can be supplied to `smog`.

When the type is ”nloglike”, it requires doing nfolds cross-validations. For each cross-validation, evenly split the whole data into nfolds, and one fold of the observations are used as the testing data, and the remaining are used for model training. After calculating the (deviance) residuals for each fold of testing data, return the average of the (deviance) residuals. Note that we keep lambda2=0 during the greedy search for lambda's.

Model selection criteria

Besides the n-fold cross-validation, cv.smog provides several AIC based model selection criteria.

cAIC: \frac{n}{2}log(|2*log-likelihood|) + \frac{n}{2} (\frac{1+k/n}{1-k+2/n})
AIC: log(|2*log-likelihood|/n) + 2\frac{k}{n}
BIC: log(|2*log-likelihood |/n) + 2\frac{k}{n}log(n)
GCV: |2*log-likelihood| / (n(1-k/n)^2)

Where k is the degrees of freedom DF, which is related to the penalty parameters λ's.

Includes the profile containing a path of lambda's and the corresponding model selectio criterion value, the optimal lambda's, and the optimal model, respectively. The type comes from a list of model selection criteria values, includes the average of the negative log-likelihood values and the correction AIC for each fold of the data.

cvfit: the fitted model based on the optimal lambda's.
lhat: the optimal lambda's which has the minimum model selection criterion.
profile: a data frame contains the path of lambda's and the corresponding model selection criterion, which is determined by the type.

Chong Ma, chongma8903@gmail.com.

\insertRef

ma2019structuralsmog

smog.default, smog.formula, predict.smog, plot.smog.

# generate design matrix x
set.seed(2018)
n=100;p=20
s=10
x=matrix(0,n,1+2*p)
x[,1]=sample(c(0,1),n,replace = TRUE)
x[,seq(2,1+2*p,2)]=matrix(rnorm(n*p),n,p)
x[,seq(3,1+2*p,2)]=x[,seq(2,1+2*p,2)]*x[,1]

g=c(p+1,rep(1:p,rep(2,p)))  # groups 
v=c(0,rep(1,2*p))           # penalization status
label=c("t",rep(c("prog","pred"),p))  # type of predictor variables

# generate beta
beta=c(rnorm(13,0,2),rep(0,ncol(x)-13))
beta[c(2,4,7,9)]=0

# generate y
data=x%*%beta
noise=rnorm(n)
snr=as.numeric(sqrt(var(data)/(s*var(noise))))
y=data+snr*noise

cvfit=cv.smog(x,y,g,v,label,type = "GCV", family="gaussian")