This vignette documents how to import or enter genotypic data for the hierfstat
package. Originally this package was written to estimate and test hierarchical F-statistics, but was then further developed and now include almost all features of the Fstat program (no longer maintained), as well as several others.
The data types that hierfstat
can analyse are haploid or diploid, unphased, multilocus genotypes. Note that each data set must be made of only one ploidy level.
The basic data structure required by most Hierfstat
function is a data frame with the first column containing a population identifier (preferably a number), and the next $nl$ columns the genotype at each of $nl$ loci.
In hierfstat
, alleles are encoded as 1, 2 or 3 digits numbers, and genotypes are encoded as numbers with the two alleles collated (as in pasted together). Other type of data can be imported (see below) but for the time being we focus on the primary data type. Thus imagine that you have an individual genotyped at a microsatellite locus with allele length 120
and 124
, the way to encode it for hierfstat
is either 120124
or 124120
. If the data are SNPs, each allele at a locus could be encoded as 1
and 2
, or you may decide to keep the correspondence between nucleotides and alleles (e.g. 1, 2, 3, 4
for A, C, G, T
). Thus, if the two alleles at a SNP locus are A
and T
and an individual is heterozygote, it could be encoded as 14
or 41
.
Example data sets are included in hierfstat
. For instance:
library(hierfstat) #load the library data(diploid) # info about this data set with ?diploid head(diploid)
The first individual (first row of the diploid data frame) belongs to population 1. Its genotype at loc-1
is 44
, thus homozygote for allele 4
. It is heterozygote for alleles 3
and 4
at both loc-2
and loc-3
, and homozygote for allele 3
at loc-4
and finaly homozygote for allele 4
at loc-5
. In fact, loc-1
and loc-4
are monomorphic, meaning that only one allele is present in all individuals from all populations.
If a genotype is missing, it is encoded as NA
. For instance, the fourth individual has not been typed at loc-3
, nor did the 6th individual for the same locus.
The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:
table(diploid[,1])
As another example, we look at dataset cont.isl99
, a data frame where alleles are encoded as 2 digits numbers:
data(cont.isl99) head(cont.isl99)
the first individual is homozygous for allele 74
at the first locus (loc.1
) and heterozygous fore alleles 19
and 55
at the second locus. The genotype could have been written 5519
instead of 1955
, it does not matter. Note the genotype of the 3rd and fourth individual at the first locus. They both carry allele 8
, which is in fact encoded as 08
. When it comes first, the leading 0 disappears, but it must be present in second position. Hence genotype 874
, 0874
and 7408
are the same, but different from genotype 748
who would be understood by hierfstat as an individual heterozygous for alleles 07
and 48
.
Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.
Often the data to be imported are in a text file. If this is the case, the easiest way to import the file into R
is via one of the workhorse of R, the read.table
function.
If the data are in the FSTAT
format, they can be readily imported using the function read.fstat
:
dip<-hierfstat::read.fstat(system.file("extdata","diploid.dat",package="hierfstat")) head(dip)
adegenet
is another population genetics analysis package, with the ability to import from several data format. hierfstat
can import and work directly with genind
objects generated by adegenet
:
library(adegenet,quietly=TRUE) data(nancycats) head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci data(H3N2) head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions data(eHGDP) head(genind2hierfstat(eHGDP))[,1:11]
The example below shows how to estimate gene diversities (expected heterozygosities), observed heterozygosities, or estimate genetic distances directly from a genind
object:
#basic.stats(nancycats) hierfstat::Hs(nancycats) #mean populations gene diversities hierfstat::Ho(nancycats) # mean populations observed heterozygosities #genet.dist(nancycats)
Genomic datasets are composed of (very) large numbers of bi-allelic loci, called Simgle Nucleotide Polymorhisms, or SNP. A convenient and efficient format for storing SNP data is allelic dosage, the number of alternative alleles and individual carries at a locus. For a diploid locus, this number can be 0 if the individual carries two reference alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles. The allelic dosage format is suitable for analyses with several of the hierfstat
functions, as described in the hierfstat
vignette.
To import genlight objects from adegenet
, just wrap the object's name in an as.matrix
:
```{R,message=FALSE} obj <-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)
as.matrix(obj)
## importing VCF files [Variant Call Format (VCF)](https://samtools.github.io/hts-specs/) files have become a standard for genomic data. This is the storage format used for the [1000 genomes](https://www.internationalgenome.org/) for instance. * VCF files can be imported in `hierfstat` using the function `read.VCF`, built from the `gaston` package function `read.vcf`: ```r library(gaston,quietly=TRUE) filepath <-system.file("extdata", "LCT.vcf.gz", package="gaston") x1 <- read.VCF( filepath ) x1 dim(x1) with(x1@snps,table(A1,A2))
by default, read.VCF
keeps only the bi-allelic SNPs (whereas gaston::read.vcf
keeps all variants), but you can choose to keep all variant by setting the argument BiAllelic
to FALSE.
as.matrix(x1)
will then provide allelic dosages.
SNPRelate
package, and its function snpgdsVCF2GDS
(As the function writes to a file, the two next code chunks will not be evaluated) library(SNPRelate) snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only") snpgdsSummary("test1.gds") f<-snpgdsOpen("test1.gds")
snpgdsVCF2GDS
stores the number of reference alleles,
we want the the number of alternate alleles:
x2<-2-snpgdsGetGeno(f) #check that allele frequencies are the same with the two methods all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE)
bed
file using gaston
, and use an external program, plink, to convert the VCF file into a BED file. These are only 3 possibilities that I find convenient and efficient, but many other exist, thus feel free to import dosage data into hierfstat
by your preferred route.
While hierfstat
contains several built-in function to generate genetic data (e.g. sim.genot
), functions exist to import data from external, more sophisticated genetic data simulators, namely QuantiNemo and ms
QuantiNemo is a genetic simulation program for markers and traits. Data generated by Quantinemo
can be imported using the function read.fstat
if the save_genotype
setting in quantinemo
is set to $1$. If the QuantiNemo
output is set to $2$ (extended), 6 extra columns are outputed and these can be read with hierfstat
using the function qn2.read.fstat
. The component $dat
of the object return by this function contains the genotypes of the individuals simulated, while the component $sex
contains its sex, the component $ped
the individuals' pedigree and the component $w
their fitness. For more details about the extended FSTAT
format of QuantiNemo
, see its manual.
dat<-qn2.read.fstat(system.file("extdata","qn2_sex.dat",package="hierfstat")) names(dat) head(dat$sex) head(dat$dat[,1:10]) #sexbias.test(dat[[1]],sex=dat[[2]])
The program ms of Hudson is commonly used to generate genomic data.
I briefly discussed the ms
software. its output looks like this:
ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01
29161
//
segsites: 23
positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380
00100001100000000000000
00001001000000000001000
The first line is the ms
command line, and it instructed the program to simulate 2 populations, with $\theta=2N_0\mu=20$. The 2 populations differ in size and the smallest (the second) is a 100th of the first. The two populations exchange $4Nm=40$ migrants per generation. 100 chromosomes are sampled from each population, and this is repeated a 100 times.
The genetic data itself comes as a serie of 0 and 1, collated one to the other. These are the SNP sites, with 0 being the ancestral state and 1 the derived state.
the function ms2bed
will convert ms output to a bed object:
bed<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat")) head(as.matrix(bed[,1:10])) #first 10 columns of bed matrix
Take some time to explore the structure of the bed
s4 object (str(bed)
), it is very useful. the @ped
slot contains info relevant to the individuals (their names, family id etc...), while the @snps
slot contains info relevant to the SNPs (their name, position, chromosome, etc...). The dosage matrix is extracted with as.matrix(bed)
.This can be used as argument to all functions containing dosage
in their names, such as beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage
. For instance, fs.dosage
will produce estimates of populations specific $F_{IS}$ and $F_{ST}$.
fs.dosage(as.matrix(bed),pop=rep(1:2,each=50)) # population specific FSTs
As a more interesting example, we can reuse the eHGDP
dataset we've encountered previously (see ?eHGDP
), and after having converted it to dosage data with the fstat2dos
function (there is a total of $8170$ alleles in the data set), we can look at inbreeding within populations and population structure using fs.dosage
:
dat.HGDP<-genind2hierfstat(eHGDP) dos.HGDP<-fstat2dos(dat.HGDP[,-1]) fs.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1])
We may be interested in finding which populations have either a large $F_{ST}^P>0.15$ or a low one $F_{ST}^P <0$:
eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]>0.15),] eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]<0.0),]
and you'll see that samples from AFRICA have low Population specific $F_{ST}$, while samples from AMERICA have large population specific $F_{ST}$
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.