gwscaR
is a collection of functions that are useful for population genomics analyses. They were primarily written while analyzing two datasets: one RAD-seq and morphological dataset from 12 populations of pipefish (Flanagan et al. 2016) and the other containing RAD-seq data from a single population of pipefish, which was analyzed using selection components analysis (Flanagan & Jones 2017).
Selection components analysis compares allele frequencies among individuals in a population to identify genetic regions associated with different components of selection (e.g. comparing mated and nonmated males to estimate sexual selection). It is an approach originally outlined by Christiansen & Frydenberg in 1973 (for use with allozyme and chromosome inversion data) and has recently been applied to next-generation sequencing data in the Mimulus guttatus plant (Monnahan et al. 2015) and pipefish (Flanagan & Jones 2017). Monnahan et al. (2015) used a maximum likelihood appraoch to identify loci experiencing different forms of selection. In my study of pipefish, I compared the maximum likelihood approach to allele frequency comparisons using Fst values, an approach I had tested with a simulation model a couple years before (Flanagan & Jones 2015). The comparison demonstrated that the two approaches find similar patterns.
gwscaR
does not implement the maximum likelihood method. This package only performs the Fst analysis, as well as some other useful population genomics analyses. Using gwscaR
, you can:
The genome-wide selection components analysis uses a vcf file, so make sure your data are in that format. Then read the file into R.
library(gwscaR) vcf.file<-system.file("extdata", "example.vcf.txt",package = "gwscaR") vcf<-parse.vcf(vcf.file)
To run the program, you need to do a bit of file manipulation. First, it's good to create a unique index for each locus and then create a vector with all of the column names that specify the locus information.
vcf$SNP<-paste(vcf$`#CHROM`,vcf$POS,sep=".") head(vcf$SNP) locus.info<-c("#CHROM","POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT","SNP")
Also, you need to create vectors for the groups that you want to compare. In this case, compare males and females. The females have names with "FEM" and males have names with PRM and NPM.
grp1<-grep("FEM",colnames(vcf),value=T) grp2<-c(grep("PRM",colnames(vcf),value=T),grep("NPM",colnames(vcf),value=T))
Then you calculate Fsts between the two groups with fst.one.vcf and create a data frame.
sel<-do.call(rbind,apply(vcf,1,fst.one.vcf,group1=c(locus.info,grp1),group2=c(locus.info,grp2), cov.thresh=0.5,maf=0.05)) sel<-sel[!is.na(sel$Fst),] #Remove the ones that weren't polymorphic head(sel)
To identify the SNPs that are outliers, compare the value of 2NFst(k-1) to the chi-squared distribution with k-1 degrees of freedom, and then adjust for multiple tests using the Benjamini-Hochberg (1995) false discovery rate.
sel$Chi<-2*((sel$Num1+sel$Num2)/2)*sel$Fst sel$Chi.p<-1-pchisq(sel$Chi,1) sel$Chi.p.adj<-p.adjust(sel$Chi.p,method="BH")
For ease of implementation, all of the above steps are contained in the wrapper function gwsca
.
sel<-gwsca(vcf,locus.info,grp1,grp2,prop.ind.thresh=0.5,maf.cutoff=0.05)
The gwscaR
package includes a function to plot genome-wide statistics.
lgs<-seq(1,22) par(mar=c(2,2,2,0),oma=c(1,1,1,1)) sel.plot<-fst.plot(sel, fst.name="Fst",axis.size=0.6, xlabels=lgs,xlab.indices = seq(1,22), chrom.name="Chrom",bp.name="Pos") #plot without the scaffolds sel$Chrom <-as.numeric(as.character(sel$Chrom)) #the lgs and chrom have to be the same class sel.plot<-fst.plot(sel, fst.name="Fst",axis.size = 0.6, chrom.name="Chrom",bp.name="Pos",xlabels=TRUE, scaffs.to.plot=lgs)
If you're comparing different sets of Fsts from the same species, you might have different loci represented in each comparison. But you might want the graphs to have the chromosomes line up nicely...so you can take advantage of fst.plot
's group.boundaries
option
#create a data.frame with the starts and stops for the chromosomes bounds<-data.frame(levels(as.factor(vcf$`#CHROM`)),tapply(as.numeric(as.character(vcf$POS)),vcf$`#CHROM`,max)) par(mar=c(2,2,2,0),oma=c(1,1,1,1)) sel.plot<-fst.plot(sel, fst.name="Fst",axis.size=0.6, chrom.name="Chrom",bp.name="Pos",xlabels=lgs, scaffold.widths = bounds) #Or if you just want to plot the linkage groups lgs<-seq(1,22) sel.plot<-fst.plot(sel, fst.name="Fst",axis.size = 0.6, chrom.name="Chrom",bp.name="Pos",xlabels=T, scaffold.widths = bounds, scaffs.to.plot=lgs) #And you can highlight the points that have adjusted p-values < 0.05 points(sel.plot$plot.pos[sel.plot$Chi.p.adj<0.05],sel.plot$Fst[sel.plot$Chi.p.adj<0.05], col="cornflowerblue",pch=19,cex=0.75) #only one point is significant in this example
Notice that in each of the examples above, the xlabels have been specified in different ways. The fst.plot
function has a number of ways to specify plotting parameters so please see the documentation for more details.
The gwscaR
package can be used in a broader context than just genome-wide selection components analysis. It also contains useful functions for performing typical population genetics analyses and file manipulations. Here I will highlight some of those functions.
In the above examples you've already seen that gwscaR
allows you to easily read a vcf file into a data.frame format. The package also allows you to merge two vcf files using combine.vcfs
in case you have two groups of individuals that were genotyped at the same loci.
Many population genomics tests have the assumption of linkage equilibrium among loci, an assumption which is violated by using multiple SNPs per RAD locus. Therefore, the function choose.one.snp
allows you to randomly select one SNP per RAD locus from a vcf.
The fst.ibd.byloc
function is helpful because it allows you to identify loci that are significantly affected by isolation by distance and distinguish them from loci that are not significantly isolated by distance. This function simply performs a Mantel test for each SNP. It uses data in ped file format and you must give it a list of the populations you're comparing (this ensures that you're comparing the same pairwise relationships represented in your distance matrices).
ibd.by.loc<-fst.ibd.byloc(sub.ped,dist,pop.list)
The popular population structure program, STRUCTURE (Pritchard et al 2000), assigns proportions of each individual's genome to population clusters. A companion program allows you to plot the graphs, but that program didn't allow me the customization options that I desired (including embedding a structure plot in a multi-figure graph; see Flanagan et al 2016 for an example), so I wrote my own plotting function, plotting.structure
Sometimes you will have both genetic data and phenotypic data, and you might be interested in comparing population differentiation at both levels. One way to do that is to calculate Pst, which is analagous to Fst but with regards to phenotypes. gwscaR
contains a funcion to calculate this metric for you, pairwise.pst
. Once you've calculated Pst, you can compare it to Fsts using the funtion fst.pst.byloc
and to a distance matrix with pst.mantel
.
fem.psts<-apply(fem.phenotype.data[,3:10],2,function(x){ pst<-pairwise.pst(data.frame(fem.phenotype.data[,1],x),pop.list) return(pst) }) fem.pst.fst.loc<-fst.pst.byloc(sub.ped,fem.phenotype.data,pop.list,1) fem.dist<-.pst.mantel(fem.phenotype.data,dist,1)
For more detail on this method and to see an example, please see Flanagan et al. (2016).
Christiansen FB and Frydenberg O. 1973. Selection component analysis of natural polymorphisms using population samples including mother-offspring combinations. Theoretical Population Biology 4: 425-44.
Flanagan SP and Jones AG. 2015. Identifying signatures of sexual selection using genomewide selection components analysis. Ecology and Evolution 5: 2722-2744.
Flanagan SP and Jones AG. 2017. Genome-wide selection components analysis in a fish with male pregnancy. Evolution 71: 1096 - 1105.
Flanagan SP, Rose E, and Jones AG. 2016. Population genomics reveals multiple drivers of population differentiation in a sex-role-reversed pipefish. Molecular Ecology 25: 5043-5072.
Monnahan PJ, Colicchio J, and Kelly JK. 2015. A genomic selection component analysis characterizes migration-selection balance. Evolution 69: 1713-1727.
Pritchard JK, Stephens M, and Donnelly P. 2000. Inference of population struccture using multilocus genotype data. Genetics 155: 945-959.
If you run into any problems, find any bugs, or have other comments on gwscaR please contact me: spflanagan.phd@gmail.com.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.