Description Usage Arguments Details Value Author(s) References Examples
Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index simultaneously for all values of the penalization parameter via the Coordinate Descent (CD) algorithm.
1 2 3 4 5 6 7 8 9 10 11 | SSI(y, X = NULL, b = NULL, Z = NULL, K, indexK = NULL,
h2 = NULL, trn = seq_along(y), tst = seq_along(y),
subset = NULL, alpha = 1, lambda = NULL, nLambda = 100,
method = c("CD1","CD2"), tol = 1E-4, maxIter = 500,
saveAt = NULL, name = NULL, mc.cores = 1, verbose = TRUE)
SSI_CV(y, X = NULL, b = NULL, Z = NULL, K, indexK = NULL,
h2 = NULL, trn.CV = seq_along(y), alpha = 1,
lambda = NULL, nLambda = 100, nCV = 1, nFolds = 5,
seed = NULL, method = c("CD1","CD2"), tol = 1E-4,
maxIter = 500, name = NULL, mc.cores = 1, verbose = TRUE)
|
y |
Response variable |
X |
Design matrix for the fixed effects. When |
b |
Vector of fixed effects. When |
Z |
Design matrix for the random effects. When |
K |
Kinship relationships matrix. This can be a name of a binary file where the matrix is stored |
indexK |
Vector of integers indicating which columns and rows will be read when |
h2 |
Heritability of the response variable. When |
trn |
Vector of integers indicating which individuals are in training set. Default |
tst |
Vector of integers indicating which individuals are in testing set (prediction set). Default |
trn.CV |
Vector of integers indicating which individuals are to be considered for the cross-validation (CV).
Default is |
subset |
A two-elements numeric vector c(m,M) to fit the model only for the mth
subset out of M subsets that the testing set will be divided into. Results can be automatically saved when |
alpha |
Numeric between 0 and 1 indicating the weights given to the L1 and L2-penalties |
lambda |
Penalization parameter sequence vector. Default is alpha=0 the grid is generated starting from a maximum equal to 5
|
nLambda |
Number of lambdas generated when |
nFolds |
Number of non-overlaping folds in which the data is split into to do cross-validation. Options: 2,3,5,10 or 'n'. When |
seed |
Numeric seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed |
nCV |
Number of CV repetitions to be performed. Default is |
method |
One of:
Default is |
mc.cores |
Number of cores used. The analysis is run in parallel when |
tol |
Maximum error between two consecutive solutions of the iterative algorithm to declare convergence |
maxIter |
Maximum number of iterations to run at each lambda step before convergence is reached |
saveAt |
Prefix name that will be added to the output files name to be saved, this may include a path. Regression coefficients
are saved as a binary file (single-precision: 32 bits, 7 significant digits). Default |
name |
Name given to the output for tagging purposes. Default |
verbose |
|
The basic linear mixed model that relates phenotypes with genetic values is of the form
where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices conecting the fixed and genetic effects with replicates. Genetic values are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and environmental terms are assumed e ~ N(0,σ2eI).
The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The values utst = (ui), i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as
where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function
where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set, Gtrn,trn is the sub-matrix corresponding to the training set, and λ0 = (1 - h2)/h2 is a shrinkage parameter expressed in terms of the heritability, h2 = σ2u/(σ2u + σ2e), λ is the penalization parameter, and J(βi) is a penalty function given by
where 0 ≤ α ≤ 1, and ||βi||1 = ∑|βij| and ||βi||22 = ∑βij2 are the L1 and (squared) L2-norms, respectively.
Each individual solution is found using the 'solveEN' function (see help(solveEN)
) by setting the parameter P
equal to
Gtrn,trn + λ0I
and cov
equal to
Gtrn,tst(i).
For cross-validation, data specified in trn.CV
is divided into k folds and the SSI is sequentially calculated for (all individuals in) one fold (testing set) using information from the remaining folds (training set).
Function 'SSI' returns a list object of the class 'SSI' for which methods coef
, fitted
, plot
, and summary
exist. Functions plotNet
and plotPath
can be also used. It contains the elements:
b
: solutions for the fixed effects including the intercept.
file_beta
: path where regression coefficients were saved.
alpha
: value for the elastic-net weights used.
lambda
: matrix with the sequence of values of lambda used (for each individual in rows).
df
: degrees of freedom (averaged across individuals), number of non-zero predictors at each solution.
Function 'SSI_CV' returns a list object of the class 'SSI_CV' for which methods plot
and summary
exist, and contains the elements:
b
: solutions for the fixed effects including the intercept for each fold using information from the remainig folds.
h2
: heritability estimated within each fold using information from the remainig folds.
folds
: matrix containing the folds used for the cross-validation.
accuracy
: matrix with the correlation between observed and predicted values (in testing set) within each fold (in rows).
MSE
: matrix with the mean squared error of prediction (in testing set) within each fold (in rows).
lambda
: matrix with the sequence of values of lambda used (averaged across individuals) within each fold (in rows).
df
: matrix with the degrees of freedom (averaged across individuals) within each fold (in rows).
Marco Lopez-Cruz (lopezcru@msu.edu) and Gustavo de los Campos
Efron B, Hastie T, Johnstone I, Tibshirani R (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Friedman J, Hastie T, Höfling H, Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
Hoerl AE, Kennard RW (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
Lush JL (1947). Family merit an individual merit as bases for selection. Part I. The American Naturalist, 81(799), 241–261.
Tibshirani R (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B, 58(1), 267–288.
VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.
Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2), 301–320
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | require(SFSI)
data(wheatHTP)
X = scale(X[1:300,]) # Subset and scale markers
G = tcrossprod(X)/ncol(X) # Genomic relationship matrix
y = scale(Y[1:300,"YLD"]) # Subset response variable
# Training and testing sets
tst = sample(seq_along(y),ceiling(0.3*length(y)))
trn = (seq_along(y))[-tst]
# Calculate heritability using all data
fm = fitBLUP(y,K=G)
h2 = fm$varU/(fm$varU + fm$varE)
# Sparse selection index
fm = SSI(y,K=G,h2=h2,trn=trn,tst=tst)
#---------------------------------------------------
# Predicting a testing set using a value of lambda
# obtained from cross-validation in a traning set
#---------------------------------------------------
# Run the cross validation in training set
fm1 = SSI_CV(y,K=G,h2=h2,trn.CV=trn,nFolds=5)
lambda = summary(fm1)$optCOR["mean","lambda"]
# Fit the index with the obtained lambda
fm2 = SSI(y,K=G,h2=h2,trn=trn,tst=tst,lambda=lambda)
summary(fm2)$accuracy # Testing set accuracy
# Compare the accuracy with that of the non-sparse index
fm3 = SSI(y,K=G,h2=h2,trn=trn,tst=tst,lambda=0)
summary(fm3)$accuracy
## Not run:
# Obtain an 'optimal' lambda by repeating the CV several times
fm12 = SSI_CV(y,K=G,h2=h2,trn.CV=trn,nCV=10)
plot(fm12,fm1)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.