Introduction

This vignette documents how to import or enter genotypic data for the hierfstat package. Originally this package was written to estimate and test hierarchical F-statistics, but was then further developed and now include almost all features of the Fstat program (no longer maintained), as well as several others.

Format required by most functions in Hierfstat

The data types that hierfstat can analyse are haploid or diploid, unphased, multilocus genotypes. Note that each data set must be made of only one ploidy level.

The basic data structure required by most Hierfstat function is a data frame with the first column containing a population identifier (preferably a number), and the next $nl$ columns the genotype at each of $nl$ loci.

In hierfstat, alleles are encoded as 1, 2 or 3 digits numbers, and genotypes are encoded as numbers with the two alleles collated (as in pasted together). Other type of data can be imported (see below) but for the time being we focus on the primary data type. Thus imagine that you have an individual genotyped at a microsatellite locus with allele length 120 and 124, the way to encode it for hierfstat is either 120124or 124120. If the data are SNPs, each allele at a locus could be encoded as 1 and 2, or you may decide to keep the correspondence between nucleotides and alleles (e.g. 1, 2, 3, 4 for A, C, G, T). Thus, if the two alleles at a SNP locus are A and T and an individual is heterozygote, it could be encoded as 14or 41.

Example data sets are included in hierfstat. For instance:

library(hierfstat) #load the library
data(diploid) # info about this data set with ?diploid
head(diploid)

The first individual (first row of the diploid data frame) belongs to population 1. Its genotype at loc-1 is 44, thus homozygote for allele 4. It is heterozygote for alleles 3and 4 at both loc-2 and loc-3, and homozygote for allele 3 at loc-4 and finaly homozygote for allele 4 at loc-5. In fact, loc-1 and loc-4 are monomorphic, meaning that only one allele is present in all individuals from all populations.

If a genotype is missing, it is encoded as NA. For instance, the fourth individual has not been typed at loc-3, nor did the 6th individual for the same locus.

The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:

table(diploid[,1])

As another example, we look at dataset cont.isl99, a data frame where alleles are encoded as 2 digits numbers:

data(cont.isl99)
head(cont.isl99)

the first individual is homozygous for allele 74 at the first locus (loc.1) and heterozygous fore alleles 19 and 55 at the second locus. The genotype could have been written 5519 instead of 1955, it does not matter. Note the genotype of the 3rd and fourth individual at the first locus. They both carry allele 8, which is in fact encoded as 08. When it comes first, the leading 0 disappears, but it must be present in second position. Hence genotype 874, 0874 and 7408 are the same, but different from genotype 748 who would be understood by hierfstat as an individual heterozygous for alleles 07 and 48.

Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.

Importing data files

Often the data to be imported are in a text file. If this is the case, the easiest way to import the file into R is via one of the workhorse of R, the read.table function.

Importing FSTAT data files

If the data are in the FSTAT format, they can be readily imported using the function read.fstat:

dip<-hierfstat::read.fstat(system.file("extdata","diploid.dat",package="hierfstat"))
head(dip)

Importing from adegenet: genind objects

adegenet is another population genetics analysis package, with the ability to import from several data format. hierfstat can import and work directly with genind objects generated by adegenet:

library(adegenet,quietly=TRUE)
data(nancycats)
head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci
data(H3N2)
head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions
data(eHGDP)
head(genind2hierfstat(eHGDP))[,1:11] 

The example below shows how to estimate gene diversities (expected heterozygosities), observed heterozygosities, or estimate genetic distances directly from a genind object:

#basic.stats(nancycats)
hierfstat::Hs(nancycats) #mean populations gene diversities
hierfstat::Ho(nancycats) # mean populations observed heterozygosities
#genet.dist(nancycats)

Allelic dosages

Genomic datasets are composed of (very) large numbers of bi-allelic loci, called Single Nucleotide Polymorhisms, or SNP. A convenient and efficient format for storing SNP data is allelic dosage, the number of alternative alleles and individual carries at a locus. For a diploid locus, this number can be 0 if the individual carries two reference alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles. The allelic dosage format is suitable for analyses with several of the hierfstat functions, as described in the hierfstat vignette.

Importing from adegenet: genlight objects

To import genlight objects from adegenet, just wrap the object's name in an as.matrix:

```{R,message=FALSE} obj <-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)

as.matrix(obj)

## importing VCF files 

[Variant Call Format (VCF)](https://samtools.github.io/hts-specs/) files have become a standard for genomic data. This is the storage format used for the [1000 genomes](https://www.internationalgenome.org/) for instance. 

* VCF files can be imported in `hierfstat`  using the function `read.VCF`, built from the `gaston` package function `read.vcf`:


```r
library(gaston,quietly=TRUE)
filepath <-system.file("extdata", "LCT.vcf.gz", package="gaston")
x1 <- read.VCF( filepath )
x1
dim(x1)
with(x1@snps,table(A1,A2))

by default, read.VCF keeps only the bi-allelic SNPs (whereas gaston::read.vcf keeps all variants), but you can choose to keep all variant by setting the argument BiAllelic to FALSE.

as.matrix(x1) will then provide allelic dosages.

library(SNPRelate)
snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
f<-snpgdsOpen("test1.gds")

snpgdsVCF2GDS stores the number of reference alleles, we want the the number of alternate alleles:

x2<-2-snpgdsGetGeno(f) 

#check that allele frequencies are the same with the two methods
all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE) 

These are only 3 possibilities that I find convenient and efficient, but many other exist, thus feel free to import dosage data into hierfstat by your preferred route.

Data generated by external genetic data simulators

While hierfstat contains several built-in function to generate genetic data (e.g. sim.genot), functions exist to import data from external, more sophisticated genetic data simulators, namely QuantiNemo and ms

Importing from Quantinemo

QuantiNemo is a genetic simulation program for markers and traits. Data generated by Quantinemo can be imported using the function read.fstat if the save_genotype setting in quantinemo is set to $1$. If the QuantiNemo output is set to $2$ (extended), 6 extra columns are outputed and these can be read with hierfstatusing the function qn2.read.fstat. The component $dat of the object return by this function contains the genotypes of the individuals simulated, while the component $sexcontains its sex, the component $ped the individuals' pedigree and the component $w their fitness. For more details about the extended FSTAT format of QuantiNemo, see its manual.

dat<-qn2.read.fstat(system.file("extdata","qn2_sex.dat",package="hierfstat"))
names(dat)
head(dat$sex)
head(dat$dat[,1:10])
#sexbias.test(dat[[1]],sex=dat[[2]])

Importing from ms

The program ms of Hudson is commonly used to generate genomic data, and the program msprime has a wrapper for ms named mspms.

I very briefly discussed the output of the ms software below, see its documentation for more informations and details of arguments to be passed to ms.

The output looks like this:

ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01

29161

//

segsites: 23

positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380

00100001100000000000000

00001001000000000001000

The first line is the ms command line. It instructs the program to simulate 2 populations (-I 2) , with $\theta=2N_0\mu=20$ (-t 20). The 2 populations differ in size and the smallest (the second) is a 100th of the first (-n 2 0.01). The two populations exchange $4Nm=40$ migrants per generation and 100 chromosomes are sampled from each population (-I 2 100 100 40), and this is repeated a 100 times (hence ms 200 100).

The second line is a random number allowing to reapeat exactly the same sequence

The third line is a line of comments

The fourth line contains the positions of the different segregating sites

The fifth and following lines contain the genetic data, as a serie of 0 and 1 (0 is the ancestral and 1 the derived allele), collated one to the other. These are the SNP sites, with 0 being the ancestral state and 1 the derived state.

the function ms2bed will convert ms output to a bed object:

bed<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat"))
head(as.matrix(bed[,1:10])) #first 10 columns of bed matrix

Take some time to explore the structure of the bed s4 object (str(bed)), it is very useful. the @ped slot contains info relevant to the individuals (their names, family id etc...), while the @snps slot contains info relevant to the SNPs (their name, position, chromosome, etc...). The dosage matrix is extracted with as.matrix(bed). This can be used as argument to all functions containing dosagein their names, such as beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage. For instance, fs.dosage will produce estimates of populations specific $F_{IS}$ and $F_{ST}$.

fs.dosage(as.matrix(bed),pop=rep(1:2,each=50)) # population specific FSTs

As a more interesting example, we can reuse the eHGDP dataset we've encountered previously (see ?eHGDP), and after having converted it to dosage data with the fstat2dos function (there is a total of $8170$ alleles in the data set), we can look at inbreeding within populations and population structure using fs.dosage:

dat.HGDP<-genind2hierfstat(eHGDP)
dos.HGDP<-fstat2dos(dat.HGDP[,-1])
fs.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1])

We may be interested in finding which populations have either a large $F_{ST}^P>0.15$ or a low one $F_{ST}^P <0$:

eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]>0.15),]
eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]<0.0),]

and you'll see that samples from AFRICA have low Population specific $F_{ST}$, while samples from AMERICA have large population specific $F_{ST}$



jgx65/hierfstat documentation built on April 20, 2023, 8:34 a.m.