In spflanagan/gwscaR: Genome-wide Selection Components Analysis in R

gwscaR is a collection of functions that are useful for population genomics analyses. They were primarily written while analyzing two datasets: one RAD-seq and morphological dataset from 12 populations of pipefish (Flanagan et al. 2016) and the other containing RAD-seq data from a single population of pipefish, which was analyzed using selection components analysis (Flanagan & Jones 2017).

Selection components analysis compares allele frequencies among individuals in a population to identify genetic regions associated with different components of selection (e.g. comparing mated and nonmated males to estimate sexual selection). It is an approach originally outlined by Christiansen & Frydenberg in 1973 (for use with allozyme and chromosome inversion data) and has recently been applied to next-generation sequencing data in the Mimulus guttatus plant (Monnahan et al. 2015) and pipefish (Flanagan & Jones 2017). Monnahan et al. (2015) used a maximum likelihood appraoch to identify loci experiencing different forms of selection. In my study of pipefish, I compared the maximum likelihood approach to allele frequency comparisons using Fst values, an approach I had tested with a simulation model a couple years before (Flanagan & Jones 2015). The comparison demonstrated that the two approaches find similar patterns.

gwscaR does not implement the maximum likelihood method. This package only performs the Fst analysis, as well as some other useful population genomics analyses. Using gwscaR, you can:

Run a Fst-based genome-wide selection components analysis
Plot genome-wide values (e.g., Fsts)
Manipulate vcf and plink files
Calculate pairwise Fst and Pst between two groups
Test for significant isolation by distance at each locus

Genome-wide selection components analysis

The genome-wide selection components analysis uses a vcf file, so make sure your data are in that format. Then read the file into R.

library(gwscaR)
vcf.file<-system.file("extdata", "example.vcf.txt",package = "gwscaR")
vcf<-parse.vcf(vcf.file)

To run the program, you need to do a bit of file manipulation. First, it's good to create a unique index for each locus and then create a vector with all of the column names that specify the locus information.

vcf$SNP<-paste(vcf$`#CHROM`,vcf$POS,sep=".")
head(vcf$SNP)
locus.info<-c("#CHROM","POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT","SNP")

Also, you need to create vectors for the groups that you want to compare. In this case, compare males and females. The females have names with "FEM" and males have names with PRM and NPM.

grp1<-grep("FEM",colnames(vcf),value=T)
grp2<-c(grep("PRM",colnames(vcf),value=T),grep("NPM",colnames(vcf),value=T))

Then you calculate Fsts between the two groups with fst.one.vcf and create a data frame.

sel<-do.call(rbind,apply(vcf,1,fst.one.vcf,group1=c(locus.info,grp1),group2=c(locus.info,grp2),
    cov.thresh=0.5,maf=0.05))
sel<-sel[!is.na(sel$Fst),] #Remove the ones that weren't polymorphic
head(sel)

To identify the SNPs that are outliers, compare the value of 2NFst(k-1) to the chi-squared distribution with k-1 degrees of freedom, and then adjust for multiple tests using the Benjamini-Hochberg (1995) false discovery rate.

sel$Chi<-2*((sel$Num1+sel$Num2)/2)*sel$Fst
sel$Chi.p<-1-pchisq(sel$Chi,1)
sel$Chi.p.adj<-p.adjust(sel$Chi.p,method="BH")

The wrapper function

For ease of implementation, all of the above steps are contained in the wrapper function gwsca.

sel<-gwsca(vcf,locus.info,grp1,grp2,prop.ind.thresh=0.5,maf.cutoff=0.05)

Plot the results

The gwscaR package includes a function to plot genome-wide statistics.

lgs<-seq(1,22)
par(mar=c(2,2,2,0),oma=c(1,1,1,1))
sel.plot<-fst.plot(sel, fst.name="Fst",axis.size=0.6,
                   xlabels=lgs,xlab.indices = seq(1,22),
                   chrom.name="Chrom",bp.name="Pos")

#plot without the scaffolds
sel$Chrom <-as.numeric(as.character(sel$Chrom)) #the lgs and chrom have to be the same class
sel.plot<-fst.plot(sel, fst.name="Fst",axis.size = 0.6,
             chrom.name="Chrom",bp.name="Pos",xlabels=TRUE,
             scaffs.to.plot=lgs)

If you're comparing different sets of Fsts from the same species, you might have different loci represented in each comparison. But you might want the graphs to have the chromosomes line up nicely...so you can take advantage of fst.plot's group.boundaries option

#create a data.frame with the starts and stops for the chromosomes
bounds<-data.frame(levels(as.factor(vcf$`#CHROM`)),tapply(as.numeric(as.character(vcf$POS)),vcf$`#CHROM`,max))

par(mar=c(2,2,2,0),oma=c(1,1,1,1))
sel.plot<-fst.plot(sel, fst.name="Fst",axis.size=0.6,
             chrom.name="Chrom",bp.name="Pos",xlabels=lgs,
             scaffold.widths = bounds)

#Or if you just want to plot the linkage groups
lgs<-seq(1,22)
sel.plot<-fst.plot(sel, fst.name="Fst",axis.size = 0.6,
             chrom.name="Chrom",bp.name="Pos",xlabels=T,
             scaffold.widths = bounds, scaffs.to.plot=lgs)

#And you can highlight the points that have adjusted p-values < 0.05
points(sel.plot$plot.pos[sel.plot$Chi.p.adj<0.05],sel.plot$Fst[sel.plot$Chi.p.adj<0.05],
       col="cornflowerblue",pch=19,cex=0.75) #only one point is significant in this example

Notice that in each of the examples above, the xlabels have been specified in different ways. The fst.plot function has a number of ways to specify plotting parameters so please see the documentation for more details.

Other Useful Functions

The gwscaR package can be used in a broader context than just genome-wide selection components analysis. It also contains useful functions for performing typical population genetics analyses and file manipulations. Here I will highlight some of those functions.

Merging vcf files

In the above examples you've already seen that gwscaR allows you to easily read a vcf file into a data.frame format. The package also allows you to merge two vcf files using combine.vcfs in case you have two groups of individuals that were genotyped at the same loci.

Choosing a single SNP per RAD locus

Many population genomics tests have the assumption of linkage equilibrium among loci, an assumption which is violated by using multiple SNPs per RAD locus. Therefore, the function choose.one.snp allows you to randomly select one SNP per RAD locus from a vcf.

Comparing each SNP to a distance matrix

The fst.ibd.byloc function is helpful because it allows you to identify loci that are significantly affected by isolation by distance and distinguish them from loci that are not significantly isolated by distance. This function simply performs a Mantel test for each SNP. It uses data in ped file format and you must give it a list of the populations you're comparing (this ensures that you're comparing the same pairwise relationships represented in your distance matrices).

ibd.by.loc<-fst.ibd.byloc(sub.ped,dist,pop.list)

Plotting STRUCTURE output

The popular population structure program, STRUCTURE (Pritchard et al 2000), assigns proportions of each individual's genome to population clusters. A companion program allows you to plot the graphs, but that program didn't allow me the customization options that I desired (including embedding a structure plot in a multi-figure graph; see Flanagan et al 2016 for an example), so I wrote my own plotting function, plotting.structure

Calculating Pst and comparing it to Fst

Sometimes you will have both genetic data and phenotypic data, and you might be interested in comparing population differentiation at both levels. One way to do that is to calculate Pst, which is analagous to Fst but with regards to phenotypes. gwscaR contains a funcion to calculate this metric for you, pairwise.pst. Once you've calculated Pst, you can compare it to Fsts using the funtion fst.pst.byloc and to a distance matrix with pst.mantel.

fem.psts<-apply(fem.phenotype.data[,3:10],2,function(x){
    pst<-pairwise.pst(data.frame(fem.phenotype.data[,1],x),pop.list)
    return(pst)
})

fem.pst.fst.loc<-fst.pst.byloc(sub.ped,fem.phenotype.data,pop.list,1)

fem.dist<-.pst.mantel(fem.phenotype.data,dist,1)

For more detail on this method and to see an example, please see Flanagan et al. (2016).

Citations

Christiansen FB and Frydenberg O. 1973. Selection component analysis of natural polymorphisms using population samples including mother-offspring combinations. Theoretical Population Biology 4: 425-44.

Flanagan SP and Jones AG. 2015. Identifying signatures of sexual selection using genomewide selection components analysis. Ecology and Evolution 5: 2722-2744.

Flanagan SP and Jones AG. 2017. Genome-wide selection components analysis in a fish with male pregnancy. Evolution 71: 1096 - 1105.

Flanagan SP, Rose E, and Jones AG. 2016. Population genomics reveals multiple drivers of population differentiation in a sex-role-reversed pipefish. Molecular Ecology 25: 5043-5072.

Monnahan PJ, Colicchio J, and Kelly JK. 2015. A genomic selection component analysis characterizes migration-selection balance. Evolution 69: 1713-1727.

Pritchard JK, Stephens M, and Donnelly P. 2000. Inference of population struccture using multilocus genotype data. Genetics 155: 945-959.

If you run into any problems, find any bugs, or have other comments on gwscaR please contact me: spflanagan.phd@gmail.com.

spflanagan/gwscaR documentation built on Nov. 14, 2019, 9:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com