knitr::opts_chunk$set(message = FALSE, warning = FALSE, tidy = FALSE) knitr::opts_chunk$set(fig.align = "center", fig.show = 'asis')
A multilocus genotype is a unique combination of alleles across two or more loci. For organisms such as plant pathogens that often reproduce clonally, multilocus genotypes are invaluable for identifying the mode and spread of an organism. This document will describe in detail how you can define multilocus genotypes/lineages and how you can use them for your analyses of both genclone and snpclone objects.
There are three different ways for constructing multilocus genotypes in poppr.
| Method | Description |
|----|----|
| "original"
| Default MLG definition; strict matching |
| "contracted"
| Collapsing MLGs by genetic distance with mlg.filter
|
| "custom"
| User-defined multilocus genotypes |
The first is a simply naïve construction where all alleles must match to make a
unique multilocus genotype. New in version 2.0 is the ability to use genetic
distance to collapse multilocus genotypes or define custom multilocus lineages
based on other information. You can use the function mll()
to display and
switch between different multilocus genotypes/lineages.
library("poppr") data(monpop) monpop head(mll(monpop, "original"), 20) # Showing the definitions for the first 20 samples
Notice how we got a warning. This is because the monpop
data set was created
before mll()
was implemented. Luckily, this warning gives us information on
how to fix it.
mll(monpop) <- "original" monpop
We can see that the description says
r gsub("original", "**original**", capture.output(monpop)[6])
. This is how we
know what the current MLG definition is for our object. Let' see what happens
when we change it.
mll(monpop) <- "custom" monpop
Now it says r gsub("custom", "**custom**", capture.output(monpop)[6])
. Of
course, if we look at the MLGs, we will see that they appear to be the same as
our original definition:
head(mll(monpop, "custom"), 20) # Showing the definitions for the first 20 samples
All of these functions will work for both genclone and snpclone objects. In this section, we will demonstrate how to construct each of these three multilocus genotype definitions in different data sets.
This is the default way poppr calculates multilocus genotypes. You see it
immediately when you load a genclone object. Let's use the example monpop
from
[@everhart2014finescale]. First, we need to reset the data to
our original MLGs.
mll(monpop) <- "original" monpop
You notice that there are 694 samples, yet only 264 multilocus genotypes in the data set over 13 loci. In this sub-section, I will demonstrate how these MLGs are constructed.
The naïve definition simply takes strings of alleles and compares them for equality. This method is quick and easily interpretable, but means that things like genotyping error, hyper-variable loci, and missing data all contribute to a unique genotype that might not be truly unique [@kamvar2015novel].
To give an example, let's imagine that we have five samples with alleles typed at two loci.
grid_example <- matrix(c(1, 4, 1, 1, 5, 1, 9, 1, 9, 4), ncol = 2, byrow = TRUE) rownames(grid_example) <- LETTERS[1:5] colnames(grid_example) <- c("x", "y") grid_example
We notice how they all have different multilocus genotypes. Indeed, if we analyze them, we see that there are five multilocus genotypes.
library("poppr") x <- as.genclone(df2genind(grid_example, ploidy = 1)) tab(x) # Look at the multilocus genotype table nmll(x) # count the number of multilocus genotypes mll(x) # show the multilocus genotype definitions
What we did above was to analyze the tab slot of the object showing the
counts of alleles per sample across the two loci. We can clearly see by eye that
no two samples are alike. The nmll()
function counts the number of
multilocus lineages in the sample. the mll()
function displays the
assignment of the multilocus lineages in the sample.
Note: we used
mll()
to show us our multilocus lineages. Later on, we will use this same function to switch between different definitions. See ?mll for details.
Now let's say we included another sample with partial missing data. Let's say it was typed as allele "5" at the x locus, but missing the allele at the y locus.
x <- as.genclone(df2genind(rbind(grid_example, new = c(5, NA)), ploidy = 1)) tab(x) # Note the missing data at locus 2. nmll(x) mll(x)
Now we see that we have 6 multilocus genotypes even though one of them contains missing data.
Let's also imagine that we included yet another sample that had a low-frequency allele at locus y, "2", but had the allele "5" at locus x.
grid_new <- rbind(grid_example, new = c(5, NA), mut = c(5, 2) ) x <- as.genclone(df2genind(grid_new, ploidy = 1)) tab(x) nmll(x) mll(x)
Again, we get completely different genotypes, but notice how the genotypes we added are closer to the genotype that shares the 5 allele. The reason this happens is because the naïve algorithm reports the rank of the concatenated alleles like so:
(xt <- apply(tab(x), 1, paste, collapse = "")) rank(xt, ties.method = "first")
Even though we have reason to believe that the samples "new" and "mut" might actually have the genotype x.5 and y.1 (the MLG of sample C), the algorithm sees them as different. If we calculate the pairwise euclidean distances between samples, we see that "new", "mut" and, "C" are very similar to each other:
library("ape") raw_dist <- function(x){ dist(genind2df(x, usepop = FALSE)) } (xdis <- raw_dist(x)) plot.phylo(upgma(xdis))
Therefore, we might want to consider "new", "mut" and, "C" to be the same Multilocus Lineage (MLL). In the next section, you will see how to collapse multilocus genotypes by genetic distance.
To remedy the issues with a naïve definition of multilocus genotypes, we can
utilize genetic distance, which will allow us to collapse multilocus genotypes
that are under a specific distance threshold. The main function we will use in
this section is mlg.filter()
. It will create a dissimiliarity distance matrix
from the data and then filter based off of that matrix. You can also utilize
your own supplied distance matrix or distance function.
To use mlg.filter()
, you need to define a thresold. This threshold defines the
minimum distance to consider two genotypes unique, so anything below that will
be clustered into the same MLG. In practice, it's used like this (using the data
and the euclidean distance described above):
x # We have 7 MLGs before filtering mlg.filter(x, distance = xdis) <- 1 + .Machine$double.eps^0.5 x # Now we have 5 MLGs mll(x) <- "original" # We'll reset to the naive definition
Here, we told mlg.filter()
to set the threshold to 1 plus a very tiny number
(this will be explained in the next section) based off of the distance matrix
xdis. This is a way to manipulate the object in place. In this section,
there will be examples of retrieving the MLGs and other statistics from the
function as well as how to use different distance functions and reset the MLGs.
As we saw above, in order to collapse all three new samples into the original 5
MLGs, we had to set a threshold of just above 1. To show you why, we will use
the function to return to us the MLGs it created and the thresholds that were
passed to cluster MLLs (by using the stats
argument).
mll(x, "original") mlg.filter(x, distance = xdis, threshold = 1, stats = c("mlg", "thresholds"))
In our output, we can see that MLG 5 has collapsed into MLG 3. This occured as soon as our threshold passed 0. Take a look at what happens when we add a very small number to 1.
(e <- .Machine$double.eps^0.5) # A very tiny number mlg.filter(x, distance = xdis, threshold = 1 + e, stats = c("mlg", "thresholds"))
Now that we have set the threshold to just above 1, we have 5 unique MLLs since MLG 3 collapsed into MLG 4. Because of the way clustering happens, it's important to point out what happens when you attempt to use a threshold of 0.
mlg.filter(x, distance = xdis, threshold = 0, stats = c("mlg", "thresholds")) mll(x, "original")
Notice that all the MLGs are the same as the originally defined MLGs. Using a threshold of zero allows two MLGs separated by zero genetic distance to remain mutually unique.
As demonstrated at the beginning of this section, if we wanted to reassign our
sample genotypes to the collapsed version, all we would have to do is use the
mlg.filter()
commands above, but use the <-
operator to specify the
threshold. Note that I will be using mlg.table()
to show the distribution of
the multilocus genotypes before and after filtering.
x mlg.table(x) # Before: 7 MLGs mlg.filter(x, distance = xdis) <- 1 + e x mlg.table(x) # After: 5 MLGs
Notice how the information you see has changed. You can see that it's printed out that you have 5 contracted multilocus genotypes, but you have some cryptic code underneath:
| symbol | meaning | |:-------:|:-------------------------------------| | [t] | threshold | | [d] | distance (can be matrix or function) | | [a] | algorithm (see the next section) |
Genclone and snpclone objects will always remember what parameters were used for filtering multilocus genotypes, but the only catch is that, if you use your own supplied distance (matrix or function), you must be very careful not to delete it or change the object name.
This means that you don't always have to specify the distance when assigning a threshold:
mlg.filter(x) <- 4.51 x mlg.table(x)
DANGER! DANGER! While this is a convenient function, if you use a matrix or function that was created by you, you must not delete it or change its name. Only the name of the matrix/function is stored, so if you delete the matrix/function used to inform your filtering, you will get an error:
rm(xdis) # NOOOOOO! try(mlg.filter(x) <- 1 + e)
cat(" Error: cannot evaluate distance function, it might be missing.")
Basically, your object thinks that there should be a matrix called xdis, but
it can't find it anywhere. But don't worry, we can restore it if we have the
function available. We will use the raw_dist()
function that we defined
earlier.
mlg.filter(x, distance = raw_dist) <- 1 + e x
The safest way, perhaps, is to use a function defined in poppr. For example, we'll use Bruvo's distance since it takes into account the real value of the alleles [@Bruvo:2004].
The arguments to your distance function will be stored in the object as well!
bruvo.dist(x, replen = c(1, 1)) mlg.filter(x, distance = bruvo.dist, replen = c(1, 1)) <- 0.44 x
Of course, our multilocus genotypes are not changed forever, they are
just stored in a different place. We can access the original, naïve multilocus
genotypes by using the mll()
function:
mll(x, "original") mll(x) # contracted mll(x) <- "original" mll(x) # original
Underlying mlg.filter
are three algorithms that decide what genotypes go
together [@kamvar2015novel]:
You can specify which algorithm you want to use in clustering with the
algorithm
argument. Each of these algorithms have different behaviors when it
comes to collapsing multilocus genotypes. In short, farthest neighbor is the
most conservative, nearest neighbor can have a chaining effect, and average
neighbor is somewhere in between. Your choice of algorithm really depends on
the biology of your organism.
To help visualize this, there is the function filter_stats()
, which will plot
the output of the filtering algorithm. For simplicity, we will use the Pinf
data set representing Phytophthora infestans samples from Mexico and South America [@goss2009population].
data(Pinf) Pinf pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)) pinf_filtered <- filter_stats(Pinf, distance = bruvo.dist, replen = pinfreps, plot = TRUE)
We can see that the different algorithms behave similarly for small thresholds, but begin to differ over larger thresholds.
This function is useful for finding all thresholds at which MLLs are collapsed, which can help with choosing a threshold that collapses putative clones in our sample into lineages.
After you have chosen a genetic distance and a filtering algorithm, you must then decide on the threshold to represent the minimum genetic distance at which two individuals would be considered from different clonal lineages.
One method described in the literature of choosing a threshold is to look for an initial, small peak in the histogram of pairwise genetic distances and set the threshold to be between that peak and the larger peak `[@arnaud2007standardizing, @bailleul2016rclone]. This initial peak likely represents clones differentiated by a small set of random mutations. You can see this in the figure above at a threshold of ~0.11 for the "farthest neighbor" algorithm.
However, if this peak is not obvious, then another method is to look for the largest gap between all putative thresholds. For this, you can use the cutoff_predictor()
function with the output of filter_stats()
. It should be noted that this method is not a perfect solution. If we take the results from above, we can find the threshold for each algorithm:
print(farthest_thresh <- cutoff_predictor(pinf_filtered$farthest$THRESHOLDS)) print(average_thresh <- cutoff_predictor(pinf_filtered$average$THRESHOLDS)) print(nearest_thresh <- cutoff_predictor(pinf_filtered$nearest$THRESHOLDS))
Now we can define multilocus lineages for P. infestans with the following criteria:
| | | |
|:-------:|:----------|:----------|
| [t] | threshold | r signif(farthest_thresh, 3)
|
| [d] | distance | Bruvo's Distance |
| [a] | algorithm | Farthest neighbor |
mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps, algorithm = "f") <- farthest_thresh Pinf
Note: Please take care to critically evaluate the results and make sure it applies to your data. This function WILL give you an answer whether or not one truely exists. Additionally, For diploid organisms, another method of choosing a threshold is available in the RClone package that involves simulating outcrossing and inbreeding events [@bailleul2016rclone].
Sometimes multilocus genotypes are defined by more than just genetic data and it would be useful to be able to analyze these genotype definitions. Luckily, there is a way to do this. Poppr has support for custom multilocus genotypes. For example, we have a simulated data set that has 50 samples with 26 multilocus genotypes.
data(partial_clone) pc <- as.genclone(partial_clone) mll(pc)
Let's change the MLGs to letters instead of numbers. We will use mll.custom()
to do this.
LETTERS[mll(pc)] # The new MLGs mll.custom(pc) <- LETTERS[mll(pc)] mlg.table(pc)
This even works with minimum spanning networks:
pcpal <- colorRampPalette(c("blue", "gold")) set.seed(9001) pcmsn <- bruvo.msn(pc, replen = rep(1, nLoc(pc)), palette = pcpal, vertex.label.color = "firebrick", vertex.label.font = 2, vertex.label.cex = 1.5)
Let's say that we realized that we have strong evidence for MLG "Q" being the same as MLG "M". If we wanted to change those, we could simply change the factor levels:
mll.levels(pc)[mll.levels(pc) == "Q"] <- "M"
And we can plot again
set.seed(9001) pcmsn <- bruvo.msn(pc, replen = rep(1, nLoc(pc)), palette = pcpal, vertex.label.color = "firebrick", vertex.label.font = 2, vertex.label.cex = 1.5)
Notice how the minimum spanning network architecture stayed the same, but the labels had changed.
In the Data Import and Manipulation vignette, the first command demonstrated
was the poppr()
, command, which produced a table of diversity statistics, most
of which were calculated on counts of multilocus genotypes. These statistics can
be replicated by using the function diversity_stats()
. This function
calculates diversity statistics based off of a table of multilocus genotypes.
Let's analyze the diversity of the peach brown rot pathogen, Monilinia
fructicola, structured by Symptom (Fruit Rot:FR and Blossom Blight:BB) and Year
[@everhart2014finescale]. Let's first look at the distribution of
multilocus genotypes:
data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom montab <- mlg.table(monpop, strata = ~Symptom/Year)
We can see from these plots that the Fruit Rot (FR) have a lot more clones in the canopies than Blossom Blight (BB). This makes biological sense since Fruit Rot infections are clonally propagated, whereas Blossom Blight infections are from sexual propagules [@everhart2014finescale]. Let's look at the diversity metrics.
The function diversity_stats()
is used to get basic diversity statistics:
(monstat <- diversity_stats(montab))
We can get confidence intervals for these statistics using diversity_ci()
:
diversity_ci(montab, n = 100L, raw = FALSE)
You can see that there was a warning about centering the confidence interval.
The boxplots you see are the distribution of the bootstrapped replicates, but
they are known to be biased. We invite users to please read the documentation
for diversity_ci()
by typing ?diversity_ci
. It is very important to be careful
when interpreting these results because sometimes confidence the confidence
intervals exist outside of the possible range, as exemplified by
r paste(rownames(monstat)[monstat[, "E.5"] > 0.9], collapse = " and ")
.
In all of the diversity_*
functions, you can use your own custom diversity
statistics. A common one we get requests for is the clonal fraction,
$\frac{MLG}{N}$ or the number of multilocus genotypes over the number of
samples. You can add it in by writing your own function. Keep in mind, that you
should write it for both a matrix and a vector of counts if you want to be able
to bootstrap it.
myCF <- function(x){ x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ # if it's a matrix res <- rowSums(x > 0)/rowSums(x) } else { # if it's a vector res <- sum(x > 0)/sum(x) } return(res) } (monstat2 <- diversity_stats(montab, CF = myCF))
You can use filtered or custom MLGs to compare diversity. Here, I'm filtering
genotypes in monpop
that are different by only a single mutational step
[@Bruvo:2004].
One mutational step for a single allele in Bruvo's distance is equivalent to 0.5, so a single mutational step for a haploid organism with 13 loci is 0.5/13.
# Repeat lengths are necessary reps <- fix_replen(monpop, c(CHMFc4 = 7, CHMFc5 = 2, CHMFc12 = 4, SEA = 4, SED = 4, SEE = 2, SEG = 6, SEI = 3, SEL = 4, SEN = 2, SEP = 4, SEQ = 2, SER = 4)) # Adding a little bit, so the threshold is included. e <- .Machine$double.eps^0.5 # Using the default farthest neighbor algorithm to collapse genotypes mlg.filter(monpop, distance = bruvo.dist, replen = reps) <- (0.5/13) + e montabf <- mlg.table(monpop, strata = ~Symptom/Year)
monpop@mlg <- new("MLG", monpop@mlg) filts <- c(260L, 179L, 168L, 168L, 167L, 221L, 152L, 133L, 144L, 78L, 78L, 79L, 81L, 44L, 40L, 40L, 40L, 38L, 119L, 120L, 93L, 29L, 10L, 239L, 38L, 93L, 96L, 172L, 114L, 60L, 72L, 82L, 78L, 129L, 138L, 89L, 203L, 120L, 34L, 21L, 21L, 222L, 32L, 104L, 95L, 95L, 203L, 190L, 80L, 95L, 95L, 82L, 82L, 21L, 95L, 95L, 222L, 138L, 51L, 222L, 222L, 222L, 222L, 222L, 104L, 212L, 95L, 222L, 170L, 95L, 251L, 35L, 258L, 151L, 83L, 156L, 25L, 241L, 130L, 210L, 163L, 234L, 196L, 205L, 233L, 159L, 161L, 227L, 216L, 216L, 206L, 161L, 216L, 161L, 161L, 194L, 161L, 47L, 157L, 161L, 70L, 161L, 216L, 161L, 216L, 207L, 204L, 134L, 216L, 204L, 161L, 56L, 136L, 161L, 159L, 216L, 161L, 194L, 161L, 204L, 47L, 227L, 70L, 174L, 161L, 47L, 134L, 70L, 134L, 47L, 216L, 216L, 55L, 70L, 194L, 216L, 161L, 161L, 216L, 216L, 216L, 70L, 216L, 47L, 47L, 110L, 197L, 161L, 42L, 258L, 258L, 235L, 256L, 85L, 18L, 103L, 52L, 14L, 57L, 250L, 213L, 77L, 62L, 195L, 5L, 106L, 53L, 148L, 192L, 112L, 71L, 185L, 19L, 31L, 178L, 153L, 20L, 101L, 96L, 111L, 59L, 54L, 199L, 54L, 99L, 54L, 242L, 212L, 28L, 91L, 65L, 212L, 40L, 175L, 175L, 175L, 184L, 175L, 212L, 176L, 91L, 91L, 122L, 44L, 91L, 91L, 91L, 175L, 91L, 175L, 91L, 28L, 175L, 175L, 65L, 65L, 28L, 63L, 175L, 125L, 91L, 91L, 175L, 126L, 91L, 28L, 91L, 93L, 91L, 91L, 91L, 91L, 27L, 91L, 65L, 91L, 175L, 90L, 184L, 220L, 175L, 175L, 175L, 91L, 91L, 91L, 91L, 65L, 91L, 91L, 93L, 91L, 91L, 91L, 91L, 91L, 28L, 90L, 91L, 222L, 95L, 21L, 95L, 175L, 95L, 95L, 95L, 222L, 122L, 173L, 173L, 222L, 222L, 105L, 222L, 222L, 222L, 222L, 222L, 34L, 222L, 211L, 92L, 80L, 3L, 222L, 92L, 80L, 173L, 222L, 262L, 222L, 261L, 261L, 222L, 95L, 222L, 222L, 222L, 222L, 222L, 222L, 113L, 261L, 73L, 261L, 95L, 261L, 73L, 222L, 172L, 95L, 172L, 80L, 93L, 21L, 95L, 60L, 21L, 21L, 95L, 95L, 95L, 95L, 95L, 95L, 211L, 95L, 80L, 95L, 246L, 211L, 95L, 96L, 95L, 95L, 96L, 124L, 177L, 95L, 222L, 95L, 222L, 82L, 95L, 203L, 120L, 120L, 173L, 222L, 173L, 95L, 173L, 37L, 173L, 124L, 222L, 37L, 173L, 173L, 173L, 173L, 173L, 173L, 173L, 154L, 154L, 173L, 222L, 173L, 95L, 173L, 95L, 222L, 222L, 173L, 120L, 21L, 120L, 120L, 95L, 173L, 173L, 222L, 219L, 104L, 67L, 64L, 67L, 255L, 217L, 255L, 225L, 61L, 258L, 36L, 252L, 181L, 88L, 110L, 50L, 237L, 224L, 33L, 237L, 164L, 110L, 12L, 12L, 68L, 12L, 12L, 25L, 255L, 110L, 76L, 259L, 25L, 164L, 254L, 164L, 258L, 110L, 110L, 25L, 50L, 258L, 110L, 237L, 12L, 68L, 76L, 68L, 12L, 180L, 164L, 76L, 12L, 12L, 255L, 12L, 181L, 251L, 47L, 237L, 67L, 67L, 36L, 237L, 110L, 255L, 255L, 50L, 12L, 47L, 12L, 67L, 237L, 200L, 200L, 68L, 127L, 258L, 237L, 243L, 74L, 187L, 196L, 115L, 209L, 130L, 187L, 187L, 164L, 181L, 100L, 237L, 12L, 196L, 257L, 146L, 140L, 54L, 139L, 242L, 54L, 242L, 87L, 183L, 242L, 149L, 54L, 54L, 59L, 242L, 213L, 2L, 54L, 242L, 139L, 242L, 226L, 59L, 244L, 208L, 242L, 242L, 242L, 242L, 242L, 242L, 22L, 182L, 242L, 242L, 182L, 54L, 87L, 43L, 242L, 242L, 183L, 140L, 140L, 242L, 54L, 199L, 9L, 87L, 242L, 242L, 213L, 242L, 182L, 18L, 18L, 48L, 242L, 242L, 54L, 22L, 191L, 87L, 59L, 242L, 140L, 165L, 19L, 86L, 242L, 242L, 4L, 4L, 54L, 22L, 236L, 54L, 242L, 242L, 242L, 54L, 9L, 96L, 146L, 87L, 87L, 208L, 214L, 18L, 214L, 98L, 189L, 189L, 98L, 98L, 75L, 16L, 189L, 162L, 115L, 110L, 16L, 46L, 110L, 209L, 135L, 25L, 193L, 189L, 84L, 84L, 16L, 245L, 232L, 196L, 196L, 115L, 227L, 115L, 245L, 110L, 110L, 1L, 227L, 196L, 110L, 16L, 187L, 1L, 150L, 196L, 209L, 193L, 16L, 209L, 209L, 193L, 180L, 11L, 196L, 1L, 110L, 16L, 55L, 75L, 115L, 180L, 193L, 1L, 198L, 193L, 110L, 209L, 64L, 110L, 209L, 16L, 209L, 16L, 217L, 209L, 16L, 1L, 189L, 180L, 115L, 110L, 255L, 109L, 16L, 115L, 16L, 115L, 227L, 245L, 110L, 162L, 1L, 189L, 25L, 12L, 187L, 186L, 115L, 115L, 49L, 67L, 209L, 16L, 198L, 110L, 16L, 98L, 127L, 201L, 1L, 16L, 16L, 196L, 74L, 115L, 228L, 1L, 110L, 231L, 110L, 110L, 110L, 215L, 142L, 7L, 218L, 230L, 58L ) monpop@mlg@mlg$contracted <- filts mll(monpop) <- "contracted" montabf <- mlg.table(monpop, strata = ~Symptom/Year)
(monstatf <- diversity_stats(montabf, CF = myCF)) monstat2 - monstatf # Take the difference from the unfiltered
We can see that filtered MLLs tend to be less diverse. This makes intuitive sense as it is creating larger classes of multilocus genotypes.
mll(monpop) <- "original"
The function diversity_ci()
and diversity_boot()
have the option to perform
jack knife rarefaction calculations. This means that your data will be randomly
sub-sampled to either the smallest population size, or whatever is specified
in the parameter n.rare
, whichever is bigger. Here's an example with the
previous data set:
(monrare <- diversity_ci(montab, n = 100L, rarefy = TRUE, raw = FALSE))
This can give you comparable estimates of diversity when not all samples are of equal size.
Clone-correction works hierarchically and only uses the first MLG copy
encountered in the data per population. This is straightforward for naïve MLGs,
but for MLLs collapsed by genetic distance or custom MLLs, this might change the
results very slightly. As an example, let's look at the monpop
filtered
MLLs that we created earlier.
nmll(monpop, "original") nmll(monpop, "contracted") mll(monpop) <- "contracted"
To show how the order of the samples can affect the sampling, we will take the sum of all pairwise distances between clone-corrected samples (corrected without respect to populations):
monpop %>% clonecorrect(strata = NA) %>% # 1. clone correct whole data set dist() %>% # 2. calculate distance sum() # 3. take the sum of the distance
Now, what happens when we randomly sample individuals?
set.seed(999) monpop[sample(nInd(monpop))] %>% # 1. shuffle samples clonecorrect(strata = NA) %>% # 2. clone correct whole data set dist() %>% # 3. calculate distance sum() # 4. take the sum of the distance set.seed(1000) monpop[sample(nInd(monpop))] %>% # 1. shuffle samples clonecorrect(strata = NA) %>% # 2. clone correct whole data set dist() %>% # 3. calculate distance sum() # 4. take the sum of the distance
Notice how we are getting different results based on the order of samples. This does not mean that the procedure doesn't work, it just means that we must be careful when clone-correcting modified multilocus genotypes.
We have demonstrated here new methods for treating multilocus genotypes on
microsatellite data, but it is important to remember that all of these functions
can work with any source of data stored in either a genclone or snpclone
(derived from genlight) object. Especially with mlg.filter()
, these new
functions will allow for a more flexible analysis of WGS data of clonal
organisms where true clones may differ by more than a few
mutations/errors/missing data. As with all analyses, it is important to
understand the algorithms used and take them into account when interpreting
results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.