cv.EBglmnet: Cross Validation (CV) Function to Determine Hyperparameters...

Description Usage Arguments Details Value Author(s) References Examples

Description

The degree of shrinkage, or equivalently, the number of non-zero effects selected by EBglmnet are controlled by the hyperparameters in the prior distribution, which can be obtained via Cross Validation (CV). This function performs k-fold CV for hyperparameter selection, and outputs the model fit results using the optimal parameters. Therefore, this function runs EBglmnet for (k x n_parameters + 1) times. By default, EBlasso-NE tests 20 λs , EBEN tests an additional 10 αs (thus a total of 200 pair of hyperparameters), and EBlasso-NEG tests up to 25 pairs of (a,b).

Usage

1
2
3
cv.EBglmnet(x, y, family=c("gaussian","binomial"),
		prior= c("lassoNEG","lasso","elastic net"), nfolds=5, 
		foldId, Epis = FALSE, group = FALSE, verbose = 0)

Arguments

x

input matrix of dimension n x p; each row is an observation vector, and each column is a candidate variable. When epistasis is considered, users do not need to create a giant matrix including both main and interaction terms. Instead, x should always be the matrix corresponding to the p main effects, and cv.EBglmnet will generate the interaction terms dynamically during running time.

y

response variable. Continuous for family="gaussian", and binary for family="binomial". For binary response variable, y can be a Boolean or numeric vector, or factor type array.

family

model type taking values of "gaussian" (default) or "binomial".

prior

prior distribution to be used. Taking values of "lassoNEG"(default), "lasso", and "elastic net". All priors will produce a sparse outcome of the regression coefficients; see Details for choosing priors.

nfolds

number of n-fold CV. nfolds typically >=3. Although nfolds can be as large as the sample size (leave-one-out CV), it will be computationally intensive for large datasets. Default value is nfolds=5.

foldId

an optional vector of values between 1 and nfolds identifying which fold each observation is assigned to. If not supplied, each of the n samples will be assigned to the nfolds randomly.

Epis

Boolean parameter for including two-way interactions. By default, Epis = FALSE. When Epis = TRUE, EBglmnet will take all pair-wise interaction effects into consideration. EBglmnet does not create a giant matrix for all the p(p+1)/2 effects. Instead, it dynamically allocates the memory for the nonzero effects identified in the model, and reads the corresponding variables from the original input matrix x.

group

Boolean parameter for group EBlasso (currently only available for the "lassoNEG" prior). This parameter is only valid when Epis = TRUE, and is set to FALSE by default. When Epis = TRUE and group = TRUE, the hyperparameter controlling degree of shrinkage will be further scaled such that the scale hyperparameter for interaction terms is different with that of main effects by a factor of √{p(p-1)/2}. When p is large, eg., several thousands of genetic markers, the total number of effects can easily be more than 10 millions, and group EBlasso helps to reduce the interference of spurious correlation and noise accumulation.

verbose

parameter that controls the level of message output from EBglment. It takes values from 0 to 5; larger verbose displays more messages. 0 is recommended for CV to avoid excessive outputs. Default value for verbose is minimum message output.

Details

The three priors in EBglmnet all contain hyperparameters that control how heavy the tail probabilities are. Different values of the hyperparameters will yield different number of non-zero effects retained in the model. Appropriate selection of their values is required to obtain optimal results, and CV is the most oftenly used method. For Gaussian model, CV determines the optimal hyperparameter values that yield the minimum square error. In Binomial model, CV calculates the mean logLikelihood in each of the left out fold, and chooses the values that yield the maximum mean logLikelihood value of the k-folds. See EBglmnet for the details of hyperparameters in each prior distribution.

Value

CrossValidation

matrix of CV result with columns of:
column 1: hyperparameter1
column 2: hyperparameter2
column 3: prediction metrics/Criteria
column 4: standard error in the k-fold CV.

Prediction metrics is the mean square error (MSE) for Gaussian model and mean log likelihood (logL) for the binomial model.

optimal hyperparameter

the hyperparameters that yield the smallest MSE or the largest logL.

fit

model fit using the optimal parameters computed by CV. See EBglmnet for values in this item.

WaldScore

the Wald Score for the posterior distribution. See (Huang A., Martin E., et al., 2014b) for using Wald Score to identify significant effect set.

Intercept

model intercept. This parameter is not shrunk (assumes uniform prior).

residual variance

the residual variance if the Gaussian family is assumed in the GLM

logLikelihood

the log Likelihood if the Binomial family is assumed in the GLM

hyperparameters

the hyperparameter(s) used to fit the model

family

the GLM family specified in this function call

prior

the prior used in this function call

call

the call that produced this object

nobs

number of observations

nfolds

number of folds in CV

Author(s)

Anhui Huang and Dianting Liu
Dept of Electrical and Computer Engineering, Univ of Miami, Coral Gables, FL

References

Cai, X., Huang, A., and Xu, S. (2011). Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping. BMC Bioinformatics 12, 211.

Huang A, Xu S, Cai X. (2013). Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping. BMC genetics 14(1):5.

Huang, A., Xu, S., and Cai, X. (2014a). Empirical Bayesian elastic net for multiple quantitative trait locus mapping. Heredity 10.1038/hdy.2014.79

uang, A., E. Martin, et al. (2014b). Detecting genetic interactions in pathway-based genome-wide association studies. Genet Epidemiol 38(4): 300-309.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rm(list = ls())
library(EBglmnet)
#Use R built-in data set state.x77
y= state.x77[,"Life Exp"]
xNames = c("Population","Income","Illiteracy", "Murder","HS Grad","Frost","Area")
x = state.x77[,xNames]
#
#Gaussian Model
#lassoNEG prior as default
out = cv.EBglmnet(x,y)
out$fit
#lasso prior
out = cv.EBglmnet(x,y,prior= "lasso")
out$fit
#elastic net prior
out = cv.EBglmnet(x,y,prior= "elastic net")
out$fit
#
#Binomial Model
#create a binary response variable
yy = y>mean(y);
out = cv.EBglmnet(x,yy,family="binomial")
out$fit
#with epistatic effects
out = cv.EBglmnet(x,yy,family="binomial",prior= "elastic net",Epis =TRUE)
out$fit

EBglmnet documentation built on May 2, 2019, 2:46 a.m.