Documentation already exists for the Analysis of Multilocus Genotypes and Lineages in poppr using the genind and genclone objects. This can be viewed in the poppr package's vignettes.

RShowDoc('mlg', package='poppr')

Data resulting from high throughput sequencing projects require greater performance than these objects were originally designed for. This document is intended to provide a similar analytical path using objects of class genlight and snpclone which have been designed specifically for high throughput applications.

Multilocus genotypes

A multilocus genotype (mlg) is a unique combination of alleles across two or more loci. In populations of clonal or partially clonal organisms the observation of an mlg in a genotyping project may indicate that the same individual was sampled muliple times (e.g., different hyphae of a mycelium). This typically violates assumptions of population genetics analyses. One way to address this is to 'clone correct' the data by selecting one of each mlg for analysis.

Creation of a snpclone object

The creation of snpclone objects is covered elsewhere. Here we use the pinfsc50 dataset read in as a VCF file and converted to an object of class snpclone.

# Load libraries
library(vcfR)
library(pinfsc50)
library(poppr)

# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
                        package = "pinfsc50")

# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)

# Convert to genlight object
my.genlight <- vcfR2genlight(vcf)

# Convert to snpclone object
my.snpclone <- poppr::as.snpclone(my.genlight)

We now have an object of class snpclone and can query its contents.

my.snpclone

Setting the strata slot

The strata slot on genlight and snpclone objects is intended to hold hierarchical population information. For example, if you sampled from a plot within a field that is within a county and a state you could have columns for each of these hierarchical levels. This can be seen as a place to store the hierarchical strata you will you as well as strata you may think you want to use but are not sure of yet.

The strata slot is really a sort of holding place for information to be populated into the pop slot. This is intended to allow for differing parameterization of the pop slot to be made in a rapid and easy manner. The pop slot can be set manually by providing a vector that is the same length as the number of individuals in the sample. Or it can be setPop method. We'll go over this step by step.

First we'll need to set the strata slot. By default, it appears to be set to NULL.

strata(my.snpclone)

We'll create a data.frame with the same number of rows as we have samples and as many columns as we have levels of hierarchy. For this example we only have one level of hierarchy.

strata.df <- data.frame( matrix( ncol=1, nrow=length( indNames(my.snpclone) ) ) )
rownames(strata.df) <- indNames(my.snpclone)
colnames(strata.df) <- 'Pop'
strata.df$Pop[c(2,4,8,11,14)] <- 'US1'
strata.df$Pop[c(1,3,6,13,16)] <- 'US'
strata.df$Pop[c(5,9,10,12,17,18)] <- 'Pinf'
strata.df$Pop[c(7)] <- 'MX'
strata.df$Pop[15] <- 'Pmir'

We can now set teh strata slot with this data.frame.

strata(my.snpclone) <- strata.df

Once we've populated the strata slot, we can use a formula, based on the column names of the data.frame used for the strata slot, to populate the pop slot.

setPop(my.snpclone) <- ~Pop

This is a fairly simplistic example with only one level of hierarchy. If we had more levels of hierarchy we could imagine a nested design.

setPop(my.snpclone) <- ~State/County/Field/Plot

Or, perhaps:

setPop(my.snpclone) <- ~Species/Metapopulation/Population

One of the feature of using formulae to populate the pop slot is that it manages the formating of the hierarchical terms for you. More information on formulae in R can be found at its man page.

?formula

High throughput MLG

High throughput genotyping results in thousands, tens of thousands, hundreds of thousand or more variants. This means that even within clones we expect to observe some level of variation. Some fraction of this observed variation may be real somatic mutation. A larger fraction is more likely the result of genotyping error. Two forms of mlg appear relevant to high throughput sequencing assays: a distance threshold ("contracted") or a custom specification defined by the user ("custom").

The mlg slot of the snpclone object contains information on mlg specification for the samples. Information can be queried from this slot using the mll() method.

mll(my.snpclone)

The default behaviour appear to be that each sample is assigned to a unique mlg. Below we'll learn how to set this slot to something that is (hopefully) of biological relevance.

Filtered ("contracted") mlg

A filtered mlg uses a distance based threshold to call all individuals that are less than that threshold of distance apart as the same mlg. There are a few ways to determine this threshold.

A rather simplistic attempt to determine a threshold for mlg calling is to create a histogram of distances and look for modality. Here we use the bitwise.dist() function that was designed specifically to deal with the bitwise encoded data found in genlight and snpclone objects.

hist( bitwise.dist(my.snpclone), col="forestgreen", xlab="Proportion of loci different", breaks=seq(0,0.5,by=0.003) )

We can see a break in the data at difference of 20% of the loci. This threshold can be used to sort the samples into mlgs using the mlg.filter() method. Adjusting the widths of the bins (using the by=0.01 parameter) may elucidate other structure in the dataset.

mlg.filter(my.snpclone, threshold=0.2, distance=bitwise.dist)

Thsi is successful at seperating the one sample that is a different species (P. mirabilis) from the rest of the sample (P. infestans). Note that the sample size is 18, so there are 17 comparisons among this one sample and the 17 other samples. This is why the density above 0.3 is greater than one, even though this threshold only seperates one individual. It also explains why the density is greater than 17 below the threshold of 0.2.

Minimum spanning networks

We can also use minimum spanning networks to visualize how a threshold may affect the inference of structure in our data. The function poppr.msn() includes a threshold to aggregate samples into mlgs.

set.seed(1)
g1 <- poppr.msn(gid=my.snpclone, distmat=bitwise.dist(my.snpclone), threshold = 0.08, include.ties = TRUE, 
                vertex.label.color = "firebrick", vertex.label.font = 2)

In our minimum spanning network each node is a mlg. The diameter of each node is proportional to the number of samples included in the node. The nodes are colored by pie charts that colored according to the population membership of the samples. The widths of the edges are proportional to the similarity among mlgs where mlgs that have a high similarity are connected by a wide edge and mlgs that have a low similarity are connected by a narrow edge.

mlg.filter(my.snpclone) <- 0.08
mlg.table(my.snpclone)

Note that teh threshold used here is sensitive to what measure of distance used. Different distance metrics have different ranges and may scale differently as well. One distance metric shouldbe used in order to provide comparisons among these methods. If a new distance metric is desired, all of teh above analyses should be redone.

Custom mlg

Tie breakers (algorithms)

Underlying mlg.filter are three algorithms that decide what genotypes go together (Kamvar, Brooks, and Grünwald 2015):

These algorithms are documented in the poppr vignette on 'Analysis of Multilocus Genotypes and Lineages in poppr.'

RShowDoc('mlg', package='poppr')

In the subsection titled 'Tie breakers ' of the section 'Filtered (“contracted”).' These algorithms should behave the same for high locus datasets as in the data set in the vignette, so the reader is referred to the original vignette. Here we simply demonstrate that we can recreate the plot produced by filter_stats().

pinf_filtered <- filter_stats(my.snpclone, distance = bitwise.dist, plot = TRUE)

Our snpclone object

my.snpclone

Notice that in the mlg slot it indicates the threshold ([t]), distance metric ([d]) and algorithm ([a]) used.

mlg.id(my.snpclone)

Diversity statistics

tab <- mlg.table(my.snpclone, plot=FALSE)
diversity_stats(tab)

Clone correction

Glossary

Multilocus genotype (MLG): Individuals that are identical in state across a panel of molecular markers.

Multilocus lineage (MLL): clusters of multilocus genotypes that belong to the same genet. This cluster of individuals are related due to their being the product of a, predumedly, single event of sexual reproduction. These may differ from one another due to somatic mutation and/or genotyping error.

Clone correction



knausb/vcfR_docs documentation built on May 20, 2019, 12:52 p.m.