SSI: Sparse Selection Index

Description Usage Arguments Details Value Author(s) References Examples

View source: R/SSI.R

Description

Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index simultaneously for all values of the penalization parameter via the Coordinate Descent (CD) algorithm.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SSI(y, X = NULL, b = NULL, Z = NULL, K, indexK = NULL,
    h2 = NULL, trn = seq_along(y), tst = seq_along(y), 
    subset = NULL, alpha = 1, lambda = NULL, nLambda = 100,
    method = c("CD1","CD2"), tol = 1E-4, maxIter = 500, 
    saveAt = NULL, name = NULL, mc.cores = 1, verbose = TRUE)
    
SSI_CV(y, X = NULL, b = NULL, Z = NULL, K, indexK = NULL, 
      h2 = NULL, trn.CV = seq_along(y), alpha = 1, 
      lambda = NULL, nLambda = 100, nCV = 1, nFolds = 5, 
      seed = NULL, method = c("CD1","CD2"), tol = 1E-4,
      maxIter = 500, name = NULL, mc.cores = 1, verbose = TRUE)

Arguments

y

Response variable

X

Design matrix for the fixed effects. When X=NULL, a vector of ones is constructed only for the intercept (default)

b

Vector of fixed effects. When b=NULL, only the intercept is estimated from training data using generalized least squares (default)

Z

Design matrix for the random effects. When Z=NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used

K

Kinship relationships matrix. This can be a name of a binary file where the matrix is stored

indexK

Vector of integers indicating which columns and rows will be read when K is the name of a binary file. Default indexK=NULL will read the whole matrix

h2

Heritability of the response variable. When h2=NULL (default), the heritability is calculated from training data from variance components estimated using Maximum Likelihood (ML)

trn

Vector of integers indicating which individuals are in training set. Default trn=seq_along(y) will consider all individuals as training

tst

Vector of integers indicating which individuals are in testing set (prediction set). Default tst=seq_along(y) will consider all individuals as testing

trn.CV

Vector of integers indicating which individuals are to be considered for the cross-validation (CV). Default is trn.CV=seq_along(y) will consider all individuals

subset

A two-elements numeric vector c(m,M) to fit the model only for the mth subset out of M subsets that the testing set will be divided into. Results can be automatically saved when saveAt parameter is provided and can be retrieved later using function collect (see help(collect)). Default is subset=NULL for no subsetting, then the model is fitted for all data

alpha

Numeric between 0 and 1 indicating the weights given to the L1 and L2-penalties

lambda

Penalization parameter sequence vector. Default is lambda=NULL, in this case a decreasing grid of 'nLambda' lambdas will be generated starting from a maximum equal to

max(abs(G[trn,tst])/alpha)
to a minimum equal to zero. If alpha=0 the grid is generated starting from a maximum equal to 5
nLambda

Number of lambdas generated when lambda=NULL

nFolds

Number of non-overlaping folds in which the data is split into to do cross-validation. Options: 2,3,5,10 or 'n'. When nFolds='n' leave-one-out CV is performed

seed

Numeric seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed

nCV

Number of CV repetitions to be performed. Default is nCV=1

method

One of:

  • 'CD1': Computes the coefficients for a provided grid of lambdas common to all individuals in testing set.

  • 'CD2': Similar to 'CD1' but using a grid of lambdas specific to each individual in testing set.

Default is method='CD1'

mc.cores

Number of cores used. The analysis is run in parallel when mc.cores is greater than 1. Default is mc.cores=1

tol

Maximum error between two consecutive solutions of the iterative algorithm to declare convergence

maxIter

Maximum number of iterations to run at each lambda step before convergence is reached

saveAt

Prefix name that will be added to the output files name to be saved, this may include a path. Regression coefficients are saved as a binary file (single-precision: 32 bits, 7 significant digits). Default saveAt=NULL will no save any output

name

Name given to the output for tagging purposes. Default name=NULL will give the name of the method used

verbose

TRUE or FALSE to whether printing each step

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices conecting the fixed and genetic effects with replicates. Genetic values are assumed to follow a Normal distribution as g ~ N(02uK), and environmental terms are assumed e ~ N(02eI).

The resulting vector of genetic values u = Z g will therefore follow u ~ N(02uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The values utst = (ui), i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as

ui = β'i (ytrn - Xtrnb)

where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function

-G'trn,tst(i) βi + 1/2 β'i(Gtrn,trn + λ0I)βi + λ J(βi)

where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set, Gtrn,trn is the sub-matrix corresponding to the training set, and λ0 = (1 - h2)/h2 is a shrinkage parameter expressed in terms of the heritability, h2 = σ2u/(σ2u + σ2e), λ is the penalization parameter, and J(βi) is a penalty function given by

1/2(1-α)||βi||22 + α||βi||1

where 0 ≤ α ≤ 1, and ||βi||1 = ∑|βij| and ||βi||22 = ∑βij2 are the L1 and (squared) L2-norms, respectively.

Each individual solution is found using the 'solveEN' function (see help(solveEN)) by setting the parameter P equal to Gtrn,trn + λ0I and cov equal to Gtrn,tst(i).

For cross-validation, data specified in trn.CV is divided into k folds and the SSI is sequentially calculated for (all individuals in) one fold (testing set) using information from the remaining folds (training set).

Value

Function 'SSI' returns a list object of the class 'SSI' for which methods coef, fitted, plot, and summary exist. Functions plotNet and plotPath can be also used. It contains the elements:

Function 'SSI_CV' returns a list object of the class 'SSI_CV' for which methods plot and summary exist, and contains the elements:

Author(s)

Marco Lopez-Cruz (lopezcru@msu.edu) and Gustavo de los Campos

References

Efron B, Hastie T, Johnstone I, Tibshirani R (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

Friedman J, Hastie T, Höfling H, Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.

Hoerl AE, Kennard RW (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

Lush JL (1947). Family merit an individual merit as bases for selection. Part I. The American Naturalist, 81(799), 241–261.

Tibshirani R (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B, 58(1), 267–288.

VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2), 301–320

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
  require(SFSI)
  data(wheatHTP)
  X = scale(X[1:300,])        # Subset and scale markers
  G = tcrossprod(X)/ncol(X)   # Genomic relationship matrix
  y = scale(Y[1:300,"YLD"])   # Subset response variable
  
  # Training and testing sets
  tst = sample(seq_along(y),ceiling(0.3*length(y)))
  trn = (seq_along(y))[-tst]

  # Calculate heritability using all data
  fm = fitBLUP(y,K=G)
  h2 = fm$varU/(fm$varU + fm$varE)
  
  # Sparse selection index
  fm = SSI(y,K=G,h2=h2,trn=trn,tst=tst)
  
  #---------------------------------------------------
  # Predicting a testing set using a value of lambda
  # obtained from cross-validation in a traning set
  #---------------------------------------------------
  # Run the cross validation in training set
  fm1 = SSI_CV(y,K=G,h2=h2,trn.CV=trn,nFolds=5)
  lambda = summary(fm1)$optCOR["mean","lambda"]
  
  # Fit the index with the obtained lambda
  fm2 = SSI(y,K=G,h2=h2,trn=trn,tst=tst,lambda=lambda)
  summary(fm2)$accuracy        # Testing set accuracy

  # Compare the accuracy with that of the non-sparse index
  fm3 = SSI(y,K=G,h2=h2,trn=trn,tst=tst,lambda=0)
  summary(fm3)$accuracy
  
  ## Not run: 
  # Obtain an 'optimal' lambda by repeating the CV several times
  fm12 = SSI_CV(y,K=G,h2=h2,trn.CV=trn,nCV=10)
  plot(fm12,fm1)
  
## End(Not run)

MarcooLopez/SFSI_data documentation built on April 15, 2021, 10:53 a.m.