knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial demonstrates how to use the SEAGLE
package when the genotype data are from GWAS studies (.bed + .bim + .fam) or next generation sequencing studies (.vcf). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/ to generate a .raw file (see https://www.cog-genomics.org/plink/2.0/formats#raw) for each of the SNP sets. The .raw file records the allelic dosage for each SNP.
Examples files containing 1kg_phase1_chr22.bed
, 1kg_phase1_chr22.bim
, 1kg_phase1_chr22.fam
, and 1kg_phase1_chr22.vcf
files and a corresponding gene list in glist-hg19
can be downloaded via the SEAGLE-example3-data.zip
file from http://jocelynchi.com/SEAGLE/SEAGLE-example3-data.zip. To follow along with the PLINK conversion procedures, you will first need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/ and unzip the example SEAGLE-example3-data.zip
file. Afterwards, you can type the PLINK commands below in a command prompt or terminal window in the folder where these files are located. This will produce .raw files that you can read into R to obtain the relevant genetic marker matrix ${\bf G}$ for input into SEAGLE
.
For GWAS data in the .bed + .bim + .fam format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
For next generation sequencing data in the .vcf format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
Now that we have the genotype data in the .raw format, we can load it into R as follows. We'll first load the SEAGLE
package.
library(SEAGLE)
The following code will load the ACR.raw
file produced from the PLINK conversion for GWAS data for single gene analysis described above. Of course, the actual file location will depend on where you stored the results of the PLINK conversion on your local drive.
acr <- read.delim("ACR.raw", sep=" ")
As an example, we've included the ACR.raw
file in the extdata
folder of this package. The following code loads that file so we can use it in this tutorial.
acr_loc <- system.file("extdata", "ACR.raw", package = "SEAGLE") acr <- read.delim(acr_loc, sep=" ")
We can take a quick look at the resulting acr
object. The first 6 columns correspond to identifying information.
dim(acr) head(acr)
The remaining columns contain our genetic marker matrix ${\bf G}$. We'll extract these to input into SEAGLE
.
G <- as.matrix(acr[,-c(1:6)]) dim(G)
As an example, we'll generate synthetic phenotype, ${\bf y}$, and covariate data, ${\bf X}$ and ${\bf E}$.
# Determine number of individuals and loci n <- dim(G)[1] L <- dim(G)[2] # Generate synthetic phenotype and covariate data set.seed(1) y <- 2 * rnorm(n) set.seed(2) X <- rnorm(n) set.seed(3) E <- rnorm(n)
Now that we have ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$, we can prepare our data for input into SEAGLE
. We will input our ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$ into the prep.SEAGLE
function. The intercept = 0
parameter indicates that the first column of ${\bf X}$ is not the all ones vector for the intercept.
This preparation procedure formats the input data for the SEAGLE
function by checking the dimensions of the input data. It also pre-computes a QR decomposition for $\widetilde{\bf X} = \begin{pmatrix} {\bf 1}{n} & {\bf X} & {\bf E} \end{pmatrix}$, where ${\bf 1}{n}$ denotes the all ones vector of length $n$.
objSEAGLE <- prep.SEAGLE(y=y, X=X, intercept=0, E=E, G=G)
Now we can input the prepared data into the SEAGLE
function to compute the score-like test statistic $T$ and its corresponding p-value. The init.tau
and init.sigma
parameters are the initial values for estimating $\tau$ and $\sigma$ in the REML EM algorithm.
res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5) res$T res$pv
The score-like test statistic $T$ for the G$\times$E effect and its corresponding p-value can be found in res$T
and res$pv
, respectively.
Many thanks to Yueyang Huang for his help with generating the example data and PLINK1.9 code for this tutorial.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.