{#id .class width=300 height=300px}
A common problem associated with performing Genome Wide Association Studies (GWAS) is the abundance of false positive and false negative associated with population structure. One way of dealing with this issue is to first calculate Principal Components (PC) and then use those PCs as to account for population structure when running the GWAS. This method has shown to not only reduce false positives but also increase the power of the test. Here we present an R based package that can preform GWAS using PCA and user imputed covariate to calculate SNPs associated with a phenotype called “GWhEAT”. It will also check to see if the PC are in linear dependence to the covariates and remove the PCs if they are. Using the freely availably R studio software this packaged is designed to quickly and efficiently calculate a P-value for every SNP phenotype association. The results from this package are including but not limited to a Manhattan plot, QQ-plot and table with every recorded p-value.
For a full list of functions within GWhEAT see “Reference_Manual.pdf” this file contains not only the full list of functions but also a description of each. The pdf also has each functions arguments listed. And like with all R packages once GWhEAT is installed and loaded you can type ?function_name and that specific function’s full descriptions will appear in the help tab on RStudio.
First if you do not already have R and R studio installed on your computer head over to https://www.r-project.org/ and install the version appropriate for you machine. Once R and R studio are installed you will need to install the GWhEAT package since this is a working package in it’s early stages of development it’s only available through Github. To download files off Github first download and load the library of the packaged “devtools” using the code below.
#code to install package from github #install.packages("devtools") library(devtools)
Next using the code below download and install the package GWhEAT from Github. The bottom two line of code in the chunk below make sure the dependencies GWhEAT relies on are also downloaded and installed.
#install package install_github("stp4freedom/GWhEAT") library(GWhEAT)#package name #Load dependencies if (!require("pacman")) install.packages("pacman") pacman::p_load(ggplot2,knitr,gridExtra)
In order to have an effective tutorial you’ll need some data to play with, this code below downloads and loads into your environment data used for this tutorial.
if (!file.exists("GAPIT_Tutorial_Data.zip")) { download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip") unzip("GAPIT_Tutorial_Data.zip") } download.file("http://zzlab.net/GAPIT/data/CROP545_Covariates.txt", destfile = "CROPS545_Covariates.txt") download.file("http://zzlab.net/GAPIT/data/CROP545_Phenotype.txt", destfile = "CROPS545_Phenotype.txt") # Import the GAPIT demo data genotypes gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1)) classes <- sapply(gt_scan, class) genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t")) GM <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t") CV <- read.table("CROPS545_Covariates.txt", header = T, stringsAsFactors = F, sep = "\t") phenotypes <- read.table("CROPS545_Phenotype.txt", header = T, stringsAsFactors = F, sep = "\t")
There are three data components required to run GWhEAT, the SNP data the phenotype(s) and a map showing the chromosome and positions of each SNP. The SNP data must be numeric coded as “0”, “1” and “2” where the 0 stands for homozygous for parent A, 2 stands for homozygous for parent B and 1 is for heterozygous SNP calls. Below is an example showing the first 5 rows and columns of our tutorial SNP data.
genotypes[1:5,1:5]
The phenotype should be integers, negative or positive that correspond to each individual. Note: Below I have the head of the phenotype printed with the taxa column still included, before running GWhEAT you will need to remove the taxa column leaving only the phenotypic values for the function to work.
phenotypes[1:5,1:2]
The final piece of information needed for GWhEAT to work is a genetic map of the SNPs. This map must have the SNP name and the chromosome it’s on as well as the position on that chromosome. This genetic map is needed in order to produce the Manhattan plot of the results.
GM[1:5,1:3]
The optional imputes for GWhEAT are the covariates which if used can help increases the power of the GWAS. The covariates mush be integers much like the phenotype values. Below is an example from the tutorial data.
CV[1:5,1:3]
Other imputes include number of PCAs to use, with the default being set to three. The significance threshold (Cutoff) which can be set to the exact -log(10) of the p-value you want or the default of 0.05/number of SNPs. There are also some options you can use to suppress the plots automatically generated. To best show these options I will run a few examples below.
Outputs from GWhEAT are a table with information on the PCA, Graphs showing variance explained by the PC, scatter plots of the PC used, a manhattan plot and a QQ-plot. A csv file with all SNPs and associated p-values and a table with the power and type-1 error can also be included as outputs.
phenotypes_n_1=phenotypes[,2]#remember to remove the first column from the phenotype data before running Example1=GWASapply(pheno=phenotypes_n_1, geno=genotypes,Cov=CV, GM=GM, PCA.M = 3, plots=TRUE, messages=TRUE, trait="Example1_phenotype")
In order to fully demonstrate GWhEAT functions, I am going to simulate some QTN effects using the function G2P. All G2Pis doing is generating effects and randomly assigning then to different SNPs. This will allow me to run GWhEAT and be able to determine it’s accuracy. \ \ As we can see in the Manhattan plot and from the printed massages 8 significant SNPs were detected. 4 of them in black are true positives 2 in blue are false positives and the two in red are false negatives. Note some of the lines are very close to each other and can be hard to see which is why it’s nice to have messages = TRUE so that important findings are printed into the R console.
source("http://www.zzlab.net/StaGen/2020/R/G2P.R") NQTN = 6 #specify number of QTN h2 = 0.75 #heritability alpha = 0.6 #alpha value distribution = "normal" #specify distribution set.seed(66) # to get duplicatable results mySim=G2P(genotypes, h2, alpha, NQTN, distribution) Example2=GWASapply(pheno=mySim$y, geno=genotypes, GM=GM, PCA.M=3,QTN.position=mySim$QTN.position, plots=TRUE, messages=TRUE, trait="Example2_phenotype")
As sequencing data becomes more accessible and data sets become larger a necessity of GWAS packages is computational speed. In light of this we created a function within GWhEAT that runs the GWAS in a rapid manor called “GWASapply_rapid”. “GWASapply_rappid” works the same as GWASapply but will not print the Manhattan or other plots and is able to run more quickly.
Example3=GWASapply_rapid(pheno=phenotypes_n_1, geno=genotypes, Cov=CV, PCA.M = 3,cutoff=NULL)
After running GWASapply_rapid you can always create individual plots as shown below. The manhattan and QQ-plot functions allow flexibility in designing your own output.
manhattan_plot(GM,Example3$P.value.res, trait="manhattanPlot_Example3",FP = Example3$False.Positive, TP = Example3$True.Positive) qq_plot(GM,Example3$P.value.res,QTN.position, trait = "qq_plot_Example3")
For more information on individual functions please see the “Reference_Manual.pdf” or type ?FUNCITON_NAME into the R console, this will pull up specific information of each function inside GWhEAT. For example typing ?manhattan_plot will pull of the help page with details about the function that creates the Manhattan plots.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.