knitr::opts_chunk$set(echo = TRUE)
We will need to install the GWhEAT package since this is a working package in it’s early stages of development it’s only available through Github. To download files off Github first download and load the library of the packaged “devtools” using the code below.
#code to install package from github #install.packages("devtools") library(devtools)
Next the code below downloads and installs the package GWhEAT from Github. The bottom two line of code in the chunk below make sure the dependencies GWhEAT relies on are also downloaded and installed.
#install package install_github("stp4freedom/GWhEAT", force =TRUE) library(GWhEAT)#package name #Load dependencies if (!require("pacman")) install.packages("pacman") pacman::p_load(ggplot2,knitr,gridExtra)
The code below downloads and loads into the environment data that can be used for this tutorial.
if (!file.exists("GAPIT_Tutorial_Data.zip")) { download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip") unzip("GAPIT_Tutorial_Data.zip") } download.file("http://zzlab.net/GAPIT/data/CROP545_Covariates.txt", destfile = "CROPS545_Covariates.txt") download.file("http://zzlab.net/GAPIT/data/CROP545_Phenotype.txt", destfile = "CROPS545_Phenotype.txt") # Import the GAPIT demo data genotypes gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1)) classes <- sapply(gt_scan, class) genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t")) #import genetic map, phenotypes, and covariates GM <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t") CV <- read.table("CROPS545_Covariates.txt", header = T, stringsAsFactors = F, sep = "\t") phenotypes <- read.table("CROPS545_Phenotype.txt", header = T, stringsAsFactors = F, sep = "\t") #Create phenotypes with just n by 1 phenotypes_n_1=phenotypes[,2]
The package should contain at least three input: y, X , and C that are R objects of numeric data frame. Their dimensions are n by 1, n by m, and n by t corresponding to phenotype, genotype and covariate data, where n is number of individuals, m is number of markers, and t is number of covariates. The function should return probability values with dimension of 1 by m for the association tests between phenotype and markers. Markers are tested one at a time with covariates in C included as covariates
Our gwas function is called "GWASapply" and the input is "pheno" for the y object, "geno" for the X object, and "Cov" for the C object. They follow the correct dimensions. The function returns various results, but the probability values are contained in the result "P.value.res" and returns an object of 1 by m.
The package should perform PCA and incorporate PCs as cofactors for GWAS. Your package should also automatically exclude the PCs that are in linear dependent to the covariates provided by users.
In our package we use the "prcomp" function to perform PCA on the X "geno" object and use a "fix_Dep" function to test for dependency and then excludethe PCs that are in linear dependence.
Develop a user manual and tutorials. Name your package and create a logo.
Also included is a Reference_Manual.pdf for all our functions in the package which include "fix_Dep", "manhattan_plot", "qq_plot", "power.fdr", and our main functions "GWASapply", and "GWASapply_rapid".
Perform GWAS on the data provided or your own data which must contain cofactors.
Our GWASapply function accepts the tutorial data phenotypes, genotypes, CV, and Genetic Map. It displays the PCA variance table and plots and scatterplots along with a QQ-plot and manhattan plot. The messages displayed give information on certain aspects of our GWAS results. The results returned returned are the probability values, cutoff used, order of snps and a table with the significant SNP information.
For the demo data, our function shows that since there was no cutoff specified, it defaults to a Bonferroni correction that results in a cutoff for significant values of "1.62e-05" with four significant SNPs. The results are shown below in the plots and the information is displayed in table 1. The MAF for each associated SNP ranged from 0.10 to 0.35. If these associated SNP’s MAF were very low like below 0.05 it would be cause of concern, however with the lowest MAF at 0.10 we can have confidence in these SNP associations.
#run gwasapply function and display messages, PCA, and plots p4=GWASapply(phenotypes_n_1,genotypes,CV,GM,PCA.M = 3,plots=TRUE,messages=TRUE,trait="Problem 4")
#final threshold p4$cutoff.final.res
#significant snps table kable(p4$sig.table)
Demonstrate that your method is superior to the competing method (GWASbyCor) through simulation.
In order to accurately compare our GWAS function and "GWASbyCor", we will simulate data and use 10 QTNs in order to calculate false positives and true positives. We will use Manhattan plots and QQ plots to compare our results with the QTNs, false positives, and true positives denoted on the Manhattan plot. Using the replicate function, We replicated the GWAS functions, and then calculated how many false positives and true positives were calculated. We then took the averages and standard deviations of all the replicates. Then we use a t-test for both results in order to use statisitcal inferences and show a significant difference between the two methods. We also plotted ROC curves for Power vs FDR and Power vs Type 1 error.
In comparing the Manhattan plots and qq plots. The red lines display the locations for each QTN simulated. The blue lines show false positives, and the black lines show true positives. If there are red lines, in the results they are technically false negatives since the black lines would cover them if present.Since there was no cutoff specified, it defaults to a Bonferroni correction. We can easily see the large amounts of false positives in the GWASbyCor Manhattan plot, with only one appearing in the GWASapply Manhattan plot. The qqplots also show the GWASapply function results with less deviation from the expected except for the significant SNPS. Both ROC curves both look similar between the two functions. For the replicated results, we see that GWASbyCor results in a very large number of false positives with an averge of 31 and a large standard deviation of 48, whereas, our GWASapply function only results in an average of 4 false positives and a standard deviation of around 6. However, both functions result in a similar average for true positives. The t-tests show a significant difference for false positives but not true positives. The results show the GWASapply function results in a much lower rate of false positives.
source("http://www.zzlab.net/StaGen/2020/R/G2P.R") source("http://www.zzlab.net/StaGen/2020/R/GWASbyCor.R") NQTN = 10 #specify number of QTN h2 = 0.75 #heritability alpha = 0.6 #alpha value distribution = "normal" #specify distribution set.seed(1) #plotting #phenotype simulation mySim=G2P(genotypes, h2, alpha, NQTN, distribution) p=GWASbyCor(X=genotypes,y=mySim$y)#GWAS by using the correlation methods p=t(matrix(p)) #power calculations pres=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) #gwasapply function p5=GWASapply(mySim$y,genotypes,CV,GM,PCA.M = 3,QTN.position=mySim$QTN.position,message=TRUE)
#create manhattan plots gcm=manhattan_plot(GM,p,cutoff=NULL,mySim$QTN.position,FP=pres$FP.fdr.power,TP=pres$TP.fdr.power, trait = "GWASbyCor",messages=FALSE) gtm=manhattan_plot(GM,p5$P.value.res,cutoff=NULL,mySim$QTN.position,FP=p5$False.Positive,TP=p5$True.Positive, trait = "GWASapply",messages=FALSE) grid.arrange(gcm, gtm, nrow = 2, ncol = 1, top = "Manhattan Plots")
#create qq plots gcq=qq_plot(GM,p,mySim$QTN.position, trait = "GWASbyCor") gtq=qq_plot(GM,p5$P.value.res,mySim$QTN.position, trait = "GWASapply") grid.arrange(gcq,gtq, nrow = 2, ncol = 1, top = "QQ-Plots")
#plot power vs fdr for both functions par(mfrow=c(1,2)) plot(pres$power~pres$fdr,xlab="FDR",ylab="Power",main="ROC Curve: GWASbyCor",type="b") plot(p5$power.fdr.type1error.res$power~p5$power.fdr.type1error.res$fdr,xlab="FDR",ylab="Power",main="ROC Curve: GWASapply",type="b")
#plot power vs type1 for both functions par(mfrow=c(1,2)) plot(pres$power~pres$type1error,xlab="Type 1",ylab="Power",main="ROC Curve: GWASbyCor",type="b") plot(p5$power.fdr.type1error.res$power~p5$power.fdr.type1error.res$type1error,xlab="Type 1",ylab="Power",main="ROC Curve: GWASapply",type="b")
gcrep=replicate (30,{ #simulate phenotype mySim=G2P(genotypes, h2, alpha, NQTN, distribution) #phenotype simulation p= GWASbyCor(X=genotypes,y=mySim$y)#GWAS by using the correlation methods p=data.frame(p) #power calculations pres=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) p5_pwr=mean(pres$power) p5_fdr=mean(pres$fdr) FalP=length(pres$FP.fdr.power) TruP=length(pres$TP.fdr.power) num_output=c() num_output=c(p5_pwr,p5_fdr,FalP,TruP) }) means=rowMeans(gcrep) stdevs=apply(gcrep,1,sd) out4 <- rbind(means,stdevs) out4 <- as.data.frame(out4) colnames(out4) <- c("Power","FDR","False Positive","True Positives") rownames(out4)<-c("Averages","Standard Deviations") kable(out4)
garep=replicate(30,{ #simulate phenotype mySim=G2P(genotypes, h2, alpha, NQTN, distribution) #run gwasapply function p5_rep=GWASapply(mySim$y,genotypes,CV,GM,PCA.M = 3,QTN.position=mySim$QTN.position) p5_pwr=mean(p5$power.fdr.type1error.res$power) p5_fdr=mean(p5$power.fdr.type1error.res$fdr) FalP=length(p5_rep$False.Positive) TruP=length(p5_rep$True.Positive) num_output=c() num_output=c(p5_pwr,p5_fdr,FalP,TruP) }) means=rowMeans(garep) stdevs=apply(garep,1,sd) out5 <- rbind(means,stdevs) out5 <- as.data.frame(out5) colnames(out5) <- c("Power","FDR","False Positive","True Positives") rownames(out5) <-c("Averages","Standard Deviations") #output results in a table kable(out5)
#t-test for power t.test(gcrep[1,],garep[1,], paired = T, na.rm=T)
#t-test for FDR t.test(gcrep[2,],garep[2,], paired = T, na.rm=T)
#t-test for false positive t.test(gcrep[3,],garep[3,], paired = T, na.rm=T)
#t-test for true positives t.test(gcrep[4,],garep[4,], paired = T, na.rm=T)
Demonstrate that your package is better than BLINK C version (http://zzlab.net/blink) on either statistical power or speed.
In order to accurately compare our GWAS function and Blink, we will simulate data and use 10 QTNs in order to calculate power, FDR, false positives and true positives, and most importantly compare the speed of our functions. We will use Manhattan plots and QQ plots to compare our results with the QTNs, false positives, and true positives denoted on the Manhattan plot. Using the replicate function, We replicated the GWAS functions, and then calculated how many false positives and true positives were calculated. We then took the averages and standard deviations of all the replicates. Then we use a t-test for both results in order to use statisitcal inferences and show a significant difference between the two methods.We did the same thing for Power and FDR. We then tested the replicated functions on two more datasets.
In comparing the Manhattan plots and qq plots. The red lines display the locations for each QTN simulated. The blue lines show false positives, and the black lines show true positives. If there are red lines, in the results they are technically false negatives since the black lines would cover them if present.Since there was no cutoff specified, it defaults to a Bonferroni correction. Both functions display very few if any false positives, with BLINK producing more TRUE positives. This is confirmed by the replicates and averages for the two functions. The one thing that GWASapply has on BLINK is speed. The average speed is lower and is also significant in the t-test. This proved true for all three datasets.
NQTN = 10 #specify number of QTN h2 = 0.75 #heritability alpha = 0.6 #alpha value distribution = "normal" #specify distribution set.seed(1) #plotting #phenotype simulation mySim=G2P(genotypes, h2, alpha, NQTN, distribution)
#run replicate garapid=replicate(30,{ mySim=G2P(genotypes, h2, alpha, NQTN, distribution) #start time for function loop_start <- proc.time() p6_rep=GWASapply_rapid(mySim$y,genotypes,CV,PCA.M = 3) loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #power calculations p6_power=power.fdr(p6_rep$P.value.res,mySim$QTN.position) p6_pwr=mean(p6_power$power) p6_fdr=mean(p6_power$fdr) FalP=length(p6_power$FP.fdr.power) TruP=length(p6_power$TP.fdr.power) pwr.fdr=c(p6_pwr,p6_fdr,loop_elapsed,FalP,TruP) #timet=c(loop_elapsed[[1]]) }) means=rowMeans(garapid) out6 <- rbind(means[1],means[2],means[3],means[4],means[5]) out6 <- as.data.frame(out6) colnames(out6) <- c("GWASapply Average") rownames(out6) <-c("Power","FDR","Seconds","False Positive","True Postive")
#write map and numeric genotype files compatible with BLINK myGM=GM colnames(myGM)=c("rs","chr","pos") write.table(t(genotypes),file="myData.dat",quote=F,sep="\t",col.name=F,row.name=F) write.table(myGM,file="myData.map",quote=F,sep="\t",col.name=T,row.name=F) colnames(CV)=c("taxa","CV1","CV2") write.table(CV,file="myData.cov",quote=F,sep="\t",col.name=T,row.name=F) #run replicate blinkrep=replicate(30,{ mySim=G2P(genotypes, h2, alpha, NQTN, distribution) myY=cbind(rownames(mySim$y),mySim$y) colnames(myY)=c("taxa","Sim") write.table(myY,file="myData.txt",quote=F,sep="\t",col.name=T,row.name=F) #start time for function loop_start <- proc.time() system("blink --file myData --numeric --gwas") loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #read in results blink.p <- read.table("Sim_GWAS_result.txt", header = T, stringsAsFactors = F, sep = "\t") p=t(blink.p$p_value) #power calculations p6_blink=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) p6_pwr_blink=mean(p6_blink$power) p6_fdr_blink=mean(p6_blink$fdr) FalP=length(p6_blink$FP.fdr.power) TruP=length(p6_blink$TP.fdr.power) pwr.fdr=c(p6_pwr_blink,p6_fdr_blink,loop_elapsed,FalP,TruP) }) means=rowMeans(blinkrep) out7 <- rbind(means[1],means[2],means[3],means[4],means[5]) out7 <- as.data.frame(out7) colnames(out7) <- c("Blink Average") rownames(out7) <-c("Power","FDR","Seconds","False Positive","True Postive") out1=data.frame(out6,out7)
#output results kable(out1)
#t-test for power t.test(garapid[1,],blinkrep[1,], paired = T, na.rm=T)
#t-test for FDR t.test(garapid[2,],blinkrep[2,], paired = T, na.rm=T)
#t-test for time t.test(garapid[3,],blinkrep[3,], paired = T, na.rm=T)
#t-test for false positve t.test(garapid[4,],blinkrep[4,], paired = T, na.rm=T)
#t-test for true positive t.test(garapid[5,],blinkrep[5,], paired = T, na.rm=T)
#simulate phenotypes mySim=G2P(genotypes, h2, alpha, NQTN, distribution) myY=cbind(rownames(mySim$y),mySim$y) colnames(myY)=c("taxa","Sim") #write phenotype table compatabile with BLINK write.table(myY,file="myData.txt",quote=F,sep="\t",col.name=T,row.name=F) #run BLINK numeric GWAS system("blink --file myData --numeric --gwas") blink.p <- read.table("Sim_GWAS_result.txt", header = T, stringsAsFactors = F, sep = "\t") p=t(blink.p$p_value) pres=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) #Run GWASapply and power.fdr functions p6_p=GWASapply_rapid(mySim$y,genotypes,CV,PCA.M = 3) p6_plots=power.fdr(p6_p$P.value.res,mySim$QTN.position)
#plot manhattan plots together gcm=manhattan_plot(GM,p,cutoff=NULL,mySim$QTN.position,FP=pres$FP.fdr.power,TP=pres$TP.fdr.power, trait = "BLINK",messages=FALSE) gtm=manhattan_plot(GM,p6_p$P.value.res,cutoff=NULL,mySim$QTN.position,FP=p6_plots$FP.fdr.power,TP=p6_plots$TP.fdr.power, trait = "GWASapply",messages=FALSE) grid.arrange(gcm, gtm, nrow = 2, ncol = 1, top = "Manhattan Plots")
#plot qq plots together gcq=qq_plot(GM,p,mySim$QTN.position, trait = "BLINK") gtq=qq_plot(GM,p6_p$P.value.res,mySim$QTN.position, trait = "GWASapply") grid.arrange(gcq,gtq, nrow = 2, ncol = 1, top = "QQ-Plots")
gt_scan1 <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric1.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1)) classes1 <- sapply(gt_scan1, class) genotypes1 <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric1.txt", header = T, row.names = 1, colClasses = classes1, stringsAsFactors = F, sep = "\t")) #import genetic map, phenotypes, and covariates GM1 <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information1.txt", header = T, stringsAsFactors = F, sep = "\t")
#run replicate garapid1=replicate(30,{ mySim=G2P(genotypes1, h2, alpha, NQTN, distribution) #start time for function loop_start <- proc.time() p6_rep=GWASapply_rapid(mySim$y,genotypes1,CV,PCA.M = 3) loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #power calculations p6_power=power.fdr(p6_rep$P.value.res,mySim$QTN.position) p6_pwr=mean(p6_power$power) p6_fdr=mean(p6_power$fdr) FalP=length(p6_power$FP.fdr.power) TruP=length(p6_power$TP.fdr.power) pwr.fdr=c(p6_pwr,p6_fdr,loop_elapsed,FalP,TruP) #timet=c(loop_elapsed[[1]]) }) means1=rowMeans(garapid1) out8 <- rbind(means1[1],means1[2],means1[3],means1[4],means1[5]) out8 <- as.data.frame(out8) colnames(out8) <- c("GWASapply Average") rownames(out8) <-c("Power","FDR","Seconds","False Positive","True Postive")
#write map and numeric genotype files compatible with BLINK myGM1=GM1 colnames(myGM1)=c("rs","chr","pos") write.table(t(genotypes1),file="myData1.dat",quote=F,sep="\t",col.name=F,row.name=F) write.table(myGM1,file="myData1.map",quote=F,sep="\t",col.name=T,row.name=F) colnames(CV)=c("taxa","CV1","CV2") write.table(CV,file="myData1.cov",quote=F,sep="\t",col.name=T,row.name=F) #run replicate blinkrep1=replicate(30,{ mySim=G2P(genotypes1, h2, alpha, NQTN, distribution) myY=cbind(rownames(mySim$y),mySim$y) colnames(myY)=c("taxa","Sim") write.table(myY,file="myData1.txt",quote=F,sep="\t",col.name=T,row.name=F) #start time for function loop_start <- proc.time() system("blink --file myData1 --numeric --gwas") loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #read in results blink.p <- read.table("Sim_GWAS_result.txt", header = T, stringsAsFactors = F, sep = "\t") p=t(blink.p$p_value) #power calculations p6_blink=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) p6_pwr_blink=mean(p6_blink$power) p6_fdr_blink=mean(p6_blink$fdr) FalP=length(p6_blink$FP.fdr.power) TruP=length(p6_blink$TP.fdr.power) pwr.fdr=c(p6_pwr_blink,p6_fdr_blink,loop_elapsed,FalP,TruP) }) means1=rowMeans(blinkrep1) out9 <- rbind(means1[1],means1[2],means1[3],means1[4],means1[5]) out9 <- as.data.frame(out9) colnames(out9) <- c("Blink Average") rownames(out9) <-c("Power","FDR","Seconds","False Positive","True Postive") out2=data.frame(out8,out9)
#output results kable(out2)
#t-test for time t.test(garapid1[3,],blinkrep1[3,], paired = T, na.rm=T)
gt_scan2 <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric2.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1)) classes2 <- sapply(gt_scan2, class) genotypes2 <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric2.txt", header = T, row.names = 1, colClasses = classes2, stringsAsFactors = F, sep = "\t")) #import genetic map, phenotypes, and covariates GM2 <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information2.txt", header = T, stringsAsFactors = F, sep = "\t")
#run replicate garapid2=replicate(30,{ mySim=G2P(genotypes2, h2, alpha, NQTN, distribution) #start time for function loop_start <- proc.time() p6_rep=GWASapply_rapid(mySim$y,genotypes2,CV,PCA.M = 3) loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #power calculations p6_power=power.fdr(p6_rep$P.value.res,mySim$QTN.position) p6_pwr=mean(p6_power$power) p6_fdr=mean(p6_power$fdr) FalP=length(p6_power$FP.fdr.power) TruP=length(p6_power$TP.fdr.power) pwr.fdr=c(p6_pwr,p6_fdr,loop_elapsed,FalP,TruP) #timet=c(loop_elapsed[[1]]) }) means2=rowMeans(garapid2) out10 <- rbind(means2[1],means2[2],means2[3],means2[4],means2[5]) out10 <- as.data.frame(out10) colnames(out10) <- c("GWASapply Average") rownames(out10) <-c("Power","FDR","Seconds","False Positive","True Postive")
#write map and numeric genotype files compatible with BLINK myGM2=GM2 colnames(myGM2)=c("rs","chr","pos") write.table(t(genotypes2),file="myData2.dat",quote=F,sep="\t",col.name=F,row.name=F) write.table(myGM2,file="myData2.map",quote=F,sep="\t",col.name=T,row.name=F) colnames(CV)=c("taxa","CV1","CV2") write.table(CV,file="myData2.cov",quote=F,sep="\t",col.name=T,row.name=F) #run replicate blinkrep2=replicate(30,{ mySim=G2P(genotypes2, h2, alpha, NQTN, distribution) myY=cbind(rownames(mySim$y),mySim$y) colnames(myY)=c("taxa","Sim") write.table(myY,file="myData2.txt",quote=F,sep="\t",col.name=T,row.name=F) #start time for function loop_start <- proc.time() system("blink --file myData2 --numeric --gwas") loop_end <- proc.time() #end time for function loop_elapsed <- loop_end[3] - loop_start[3] #read in results blink.p <- read.table("Sim_GWAS_result.txt", header = T, stringsAsFactors = F, sep = "\t") p=t(blink.p$p_value) #power calculations p6_blink=power.fdr(p, QTN.position=mySim$QTN.position,cutoff=NULL) p6_pwr_blink=mean(p6_blink$power) p6_fdr_blink=mean(p6_blink$fdr) FalP=length(p6_blink$FP.fdr.power) TruP=length(p6_blink$TP.fdr.power) pwr.fdr=c(p6_pwr_blink,p6_fdr_blink,loop_elapsed,FalP,TruP) }) means2=rowMeans(blinkrep2) out11 <- rbind(means2[1],means2[2],means2[3],means2[4],means2[5]) out11 <- as.data.frame(out11) colnames(out11) <- c("Blink Average") rownames(out11) <-c("Power","FDR","Seconds","False Positive","True Postive") out3=data.frame(out10,out11)
#output results kable(out3)
#t-test for time t.test(garapid2[3,],blinkrep2[3,], paired = T, na.rm=T)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.