1) Introduction

The GGEE package implements the "Gene-Gene Eigen Epistasis" approach to detect epistasis at the gene level in genome wide association studies (GWAS). This approach compute interaction variables for each gene pair then uses a penalized regression method based on group lasso to select the significant main or interaction effects.

The two main functions of this package are BuiltEpiVar and GLmodel. BuiltEpiVar allows to calculate interaction variables under four different interaction variable modeling approaches. The Eigen-epistasis approach find for each gene pair a component defined as the linear combination of gene markers (SNPs) having the highest correlation with the phenotype. The three other modeling approaches are inspired by previous literature proposals, they compute interaction variable using: Principal Component Analysis (PCA), Partial Least squares (PLS) or Canonical-Correlation Analysis (CCA). GLmodel fits a group lasso model on the genetic data set enhanced by interaction variables then uses a screen and clean procedure in order to compute p-values for each group. A group is either made with the SNPs from a given gene or of interaction terms relative to a given gene-pair interaction.

Additionally, the package allows to generate genotype and phenotype data under two phenotypic models.

2) Generating genotype and phenotype data

The GGEE package allows to generate gene structured data and associated continuous phenotype according to the model : $$ \boldsymbol{y}= \boldsymbol{X}^T \boldsymbol{\beta} + \boldsymbol{Z}^T \boldsymbol{\gamma} + \boldsymbol{\epsilon} $$ Where $\boldsymbol{y} \in \mathbb{R}^n$ denotes the vector of trait values for $n$ individuals, $\mathbf{X} \in {1,2,3}^{n \times p}$ represents the SNP matrix, $\mathbf{Z}$ the matrix gathering interaction variables and $\boldsymbol{\epsilon} \in \mathbb{R}^n$ a gaussian error term. The columns of $\mathbf{X}$ are structured on $G$ non overlapping genes. Each gene is described by a given number of SNPs $p_g$ where $\sum_g p_g=p$. The matrix of interaction $\mathbf{Z}$ is structured into $G(G-1)/2$ submatrices each submatrice being the group of interaction variables for a specific pair of genes.

The two functions simGeno and simPheno allows to respectively simulate genotype and phenotype data.

library(GGEE)
sizeGenesMain <- rep(6,2) # 2 genes with 6 SNPs 
sizeGenesPair <- rep(6,2) # 2 genes with 6 SNPs 
sizeGenesRemain <- rep(6,4) # 4 genes with 6 SNPs
SameMainPair <- FALSE  # Specify that genes with interaction effects will not have main effects 
N<- 600
causalSNPnb <- 2
corr <- 0.8
MAFcausalSNP=0.2


Geno <- simGeno(N=N, corr=corr, sizeGenesMain=sizeGenesMain, sizeGenesPair=sizeGenesPair,
     sizeGenesRemain=sizeGenesRemain, SameMainPair=SameMainPair, MAFcausalSNP=MAFcausalSNP,
     causalSNPnb=causalSNPnb)

With these parameters the function simGeno simulate a data set of 6 genes, each one composed of 6 SNPs, for 600 individuals. The 2 first genes are considered to have main effects and the gene 3 and gene 4 to have an interaction effect. For the four causal genes their 2 first SNPs are considered as causal variants. Rather than a defined number of causal SNPs by causal gene, it is possible to use a portion of causal SNPs with the option causalSNPportion. In this case the option causalSNPnb has to be NULL. In both cases, the SNPs considered as causal are the first listed in the gene. The MAF of each SNP is randomly set between the values minMAF and maxMaf (by befault minMAF=0.05 and maxMaf=0.5). For the causal SNPs the MAF value correspond to MAFcausalSNP. The correlation between SNPs belonging to the same gene is set to corr=0.8.

The output of Geno contain the following elements :

Geno$X[1:5,1:8]
Geno$listGenesSNP
Geno$MainEff
Geno$GenePair
Geno$MAF

Once the genotype matrix obtained, phenotype values can be simulated through the function simPheno. The function takes as parameters:

where $\mathcal{C}$ and $\mathcal{C}^2$ are respectively the set of causal SNPs and causal interactions, and $\epsilon_i$ a random Gaussian variable. For each causal gene $g$ a coefficient $\beta_{g}$ is assigned to the standardized sum of the causal SNPs. for the interactions, in the first model "SNPproduct", all the causal SNPs from a causal pair $(r,s)$ are pairwise multiplied and the interaction of the causal pair is represented by the standardized sum of the products. In the second model "PCproduct", the interaction is represented by the standardized product of the first PCA component $\boldsymbol{C}{.1}^r$ of gene $r$ and the first PCA component $\boldsymbol{C}{.1}^s$ of gene $s$. The computation of PCA components is realized on the whole gene and not only on the causal SNPs.

# possible values for coef Beta or Gamma
pvBeta <- c(2,2) 
pvGamma <-  c(2,2)
r2 <- 0.4

Pheno <- simPheno(X=Geno$X, listGenes=Geno$listGenesSNP, MainEff=Geno$MainEff, GenePair=Geno$GenePair,
        model="SNPproduct", pvBeta=pvBeta, pvGamma=pvGamma, r2=r2, causalSNPnb=causalSNPnb)

The outputs of the function simPheno includes

head(Pheno$y)
head(Pheno$G)
head(Pheno$GG)
Pheno[c("R2T","R2I","R2S")]
Pheno$caract

3) The G-GEE method

Once genotype and phenotype data are obtained we can apply the G-GEE approach to seek for interaction effects. The first step is to create interaction variables from each gene couple, the second is to test for potential main or interaction effects.

Interaction variable modeling can be done with the function BuiltEpiVar. The function takes as parameters the matrix of genotype $X$, the vector of phenotypic traits $y$, a list listGenesSNP that indicate the names of the SNPs composing each gene and nbcomp the number of components to consider to compute interaction variables.

Subjects need to be randomly divided into two equal-sized sets for training and testing. The training group is used to construct the interaction variables and to estimate the Group LASSO coefficients. The test group is used for the cleaning step to compute the permuted p-values. idSubs is a list containing the indices of the samples used in the test or training set.

Four different methods can be use to create interaction variables :

Here we show an example using "GGEE"option. As this method can compute only one interaction by gene couple, the parameter nbcomp doesn't need to be used.

# Distribution of the samples in training or test set:
n <- dim(Geno$X)[1]
portionTrain <- 1/2
idTrain <- sample(1:n, size=n*portionTrain)
idTest <- (1:n)[-idTrain]
idSubs <- list(idTrain=idTrain[order(idTrain)], idTest=idTest)


Int <- BuiltEpiVar(Geno$X, Pheno$y, method="GGEE", listGenesSNP=Geno$listGenesSNP, idSubs=idSubs)

Int is a list composed of the interaction variable matrices (XIntTrain and XIntTest) and a vector interLength indicating the number of interaction variables for each couple.

head(Int$XIntTrain)
head(Int$XIntTest)
Int$interLength

Test for potential main or interaction effects is done with the function GLmodel. Parameters include nlambda, the length of the grid of possible lambda values, limitLambda the number of the largest lambda values among the grid to consider for the cross validation and lambda.cri the criteria for lambda selection (minimum or oneSE value).

res <- GLmodel(Geno$X, Pheno$y, XIntTrain=Int$XIntTrain, XIntTest=Int$XIntTest,  idSubs =idSubs, 
               interLength=Int$interLength, listGenesSNP=Geno$listGenesSNP, nlambda=100,
               limitLambda=25, lambda.cri="min")

The outputs of GLmodel contain:

res

The GGEE package contains a plot function plot.GLmodel. The function takes as parameter a GLmodel object and provides a representation of cross validation results. It depicts the value of the cross validation error for each lambda considered and thus allows to identify the optimal lambda values depending of the criteria of interest (minimal or oneSE). This plot allows to verify that enough lambda values was considered for the cross validation. If the curve doesn't show a clear minimal value the parameter limitLambda of the GLmodel has to be enlarged.

plotGLmodel(res)


vstanislas/GGEE documentation built on May 28, 2021, 12:50 p.m.