require(adegenet) require(irelr) require(data.table) require(moments) require(reshape2) require(ggplot2) require(gridExtra)
irelr
is a package that implements the a Monte Carlo approach for reconstructing
pedigrees from genetic data information. It assumes that unbiased population allele
frequency data is available. In some cases, it is possible to obtain this information
from the avaialble genetic data --- however, this assumes that the sample consists
mainly of unrelated
individuals. If your sample consists of many related
individuals, then independently derived population allele frequency data is
necessary for irelr
to be useful.
The most straight forward way of getting under way in using irelr
is by having
the data in a standard genepop
format, with alleles encoded with 3 digits.
The data format specifications can be found
here.
Ultimately, the input format is a genind
object from the R
package adegenet
(by Thibaud Jombart et al. link.
Information on how to convert several data formats into a genind
object can
be found here.
irelr
assumes a single population.
Below are the first 20 lines of the nancycats
dataset available in the package
adegenet
. Once the adegenet
package is loaded, it is possible to find out
more about the data set by typing ?nancycats
in R
's command prompt.
cat(readLines(system.file("files/nancycats.gen",package="adegenet"), n = 20), file = '', sep = "\n")
If the data is in a genepop
format, it can be easily loaded by using the
function read.genepop()
from the adegenet
package (note that the file
must have the .gen
extention):
# set file path location. substitute this for the actual path of your file # note the need for a .gen extention. otherwise, an error will be generated. file_path <- system.file("files/nancycats.gen", package = "adegenet") # having a look at the file path file_path # loading the data into a variable called data # NOTE: Here, I am using a version of read.genepop that is supplied with # irelr. It is a direct copy of the function found in the development version # of adegenet. The current version of adegenet, found on CRAN, does not # parse missing data correctly. data <- irelr::read.genepop(file = file_path, quiet = T) # having a look at data # notice that data is an S4 class of type 'genind' # it includes 237 individuals, genotyped at 9 loci # there are 108 alleles across the genotyped loci data # number of indivduals n_ind <- adegenet::nInd(data) n_ind # number of loci n_loc <- adegenet::nLoc(data) n_loc # number of alleles per locus n_all <- adegenet::nAll(data) n_all # total number of alleles sum(n_all) # total number of populations # irel will ignore the population division and attempt to estimate the # pedigree of all individuals in the data file n_pop <- adegenet::nPop(data) n_pop # irelr also assumes that individuals have unique identifiers, # if your genepop does not have unique identifiers when you # load your file, a warning will be issued telling you that # there individual idenfiers are NOT unique, and that # row IDs will be used instead. Row IDs will be the # row number. Therefore, the individuals will named with sequential numbers # from 1 to total number of individuals # this is the case for the nancycats data adegenet::indNames(data) # this can be fixed by changing the data in the data file so that each # individual has a unique identifier. # alternatively, a list of names can be manually created and assigned to the # the 'genind' object: data(nancycats, package = "adegenet") ind_identifiers <- adegenet::indNames(nancycats) ind_identifiers # the easiest way to do this would be to prepare a list of individual # in the same order as they appear in genpop file using Excel. The # names should be in a single column of the spreadsheet. # save the list as a Tab delimited file. Once you have do so, # you should have a file that looks like this: # ind1 # ind2 # ... # indN # # you can then load your own list of identifiers in the following way #ind_identifiers <- read.table(file = "/path/to/file/identifier.txt", header = F, quote = F, row.names = F, stringsAsFactors = F) # you can then run the following code: adegenet::indNames(data) <- unlist(ind_identifiers) adegenet::indNames(data) # however, by the far the easiest way would be to code your individual names # into the genpop file.
sim_rel()
functionOnce the data is loaded, we can use the function estimate_rel()
to
estimate relatedness indices for all pairs of individuals in the data set.
# estimate relatedness indices rel_estimates <- irelr::estimate_rel(data = data) # create a data.table with the results # this will make it easier down the line to parse out individual indices rel_est_dt <- data.table(rel_estimates) rel_est_dt
The output of the estimate_rel()
function is a data.frame
, which we transform
in to a data.table
(this requires the data.table
package). The output includes
to columns --- ind1
and ind2
--- that indicate for which two individuals the
indices in the row refer to.
Eight indices are currently calculated:
In addition, two other relevant pieces of information are outputted:
prop_alleles_shared
: proportiono of alleles shared between the dyad.
Thus, for instance, if the dyad are typed at 7 loci, there are 14 potential
alleles to be shared.prop_loci
: proportion of loci that share alleles. For instance, if a dyad
is typed at 7 loci, and it shares alleles at 5, then this variable is 5/7.Finally, the function also reports the number of missing loci (n_missing_loci
),
which are the number of loci that did not have an allele call for at least one
of the pair. Thus, relatedness indices are only calculated based on loci that
are genotyped for both individuals in the dyad. The loci with missing genotype
calls for a particular dyad are listed in the missing_loci
column. If multiple
loci are missing, they are separated by a "." (e.g., loc1.loc2
, would mean that
loc1 and loc2 are missing). This will be used later to assign classification error rates
based on the available data for a dyad.
The distribution of particular index can be displayed as follows:
ggplot(rel_est_dt, aes(x = LR99_avg)) + geom_histogram() + xlab("Lynch and Ritland (1999) index") + ylab("Count")
We can divide up the histgram by missing loci (below). Here, the top left panel
shows the distribution of indices for pairs that had complete datasets. Otherwise,
the panel label indicates the missing loci. As we can see, we have one pair
that is missing loci fca8
, fca23
, fca37
, and fca77
(bottom row, second
panel from the left).
ggplot(rel_est_dt, aes(x = LR99_avg)) + geom_histogram() + xlab("Lynch and Ritland (1999) index") + ylab("Count") + facet_wrap(facets = ~ missing_loci, ncol = 4, scale = 'free')
We can easily examine the joint distribution of two indices as well (below).
As we can see, two estimators can often disagree substantially in their
estimates. Their are pairs of individuals with LR99_avg
values that are
over 0.5 (indicating high relatedness), for which their W02_cor
value is
very close to 0 (indicating low relatedness). While we have not carried out
any experiments to test this, this suggests there might be value in using
information from multiple indices when reconstructing pedigrees.
ggplot(rel_est_dt, aes(x = LR99_avg, y = W02_cor)) + geom_point() + geom_density2d(colour = "red") + geom_abline(slope = 1, intercept = 0, size = 2, colour = 'navyblue') + xlim(c(-1,1)) + ylim(c(-1,1)) + xlab("Lynch and Ritland (1999) index") + ylab("Wang (2002) corrected index")
In irelr
one can simulate any relatedness category by simply figuring out
the k
vector. The k
vector is a vector of 3 numbers, that must sum to 1.0.
Each value gives the proportion of loci for which a pair of individuals share
0, 1, and 2 ancestors. Thus, for a pair of unrelated individuals, there 0.0
loci for which they share 2 ancestors, 0.0 for which they share 1 ancestor, and
1.0 loci for which they share 0 ancestors. Thus, the k
-vector is specified
as c(1.0, 0.0, 0.0)
.
The k
-vectors for the most usual relatedness categoris are:
c(1.0, 0.0, 0.0)
c(0.5, 0.5, 0.0)
c(0.25, 0.50, 0.25)
c(0.0, 1.0, 0.0)
The expected relatedness value for each category can be estimated by the
weighted average of number of ancestors by k
-value, normalized by the
total number of possible ancestors (2). Thus, for parent-offspring dyads:
$$ po = 0.5 \times (0 \times 0.0 + 1 \times 1.0 + 2 \times 0.0) $$
A common case where reconstructing the pedigree of pairs of individuals might
come in handy, is to differentiate between half-sibs and full-sibs in samples
taken from different bird nests. This could be used to get an estimate of
the number of nests with extra-pair chicks, and thus give an estimate of number
of extra-pair matings in a season for a particular bird species. Here, we will
just focus on how to classify the data from the nancycats
dataset into half-sibs
and full-sibs, but if one new which pairs belonged to each nest, it would be
possible to classify the whole dataset, and then examine pairs by nest
grouping.
sim_rel()
functionBelow, we demonstrate the use of the irelr
function sim_rel()
to generate
10,000 half-sib dyads and 10,000 full-sib dyads, and calculate for each simulated dyad the relatedness
estimators defined above.
Note: Here, irelr
uses estimates of allele frequency derived directly
from the dataset. As we see above for the Lynch and Ritland (1999) index,
the great majority of pairs have an index close to 0. Therefore, it is
reasonable to assume that the majority of pairs are unrelated, and we can
obtain an estimate of population allele frequencies that is reasonably
unbiased by a large proportion of highly related pairs. Eventually, irelr
will
allow for inputting user defined allele frequencies that are independently
derived from the dataset. These are expected to generate more a robust
pedigree reconstruction.
# set number of dyads to simulate nreps <- 10000 # create a half-sib k-vector hs_kvector <- c(0.5, 0.5, 0.0) # simulate half-sib dyads, and output relatedness values half_sibs <- irelr::sim_rel(data = data, reps = nreps, k_vector = hs_kvector) half_sibs_dt <- data.table::data.table(half_sibs) half_sibs_dt # add a relatedness category column to be used later half_sibs_dt[, rel_cat := rep('hs', nreps)] half_sibs_dt[, sim_ix := 1:nreps] # create a full-sib k-vector fs_kvector <- c(0.25, 0.5, 0.25) # simulate full-sib dyads, and output relatedness values full_sibs <- irelr::sim_rel(data = data, reps = nreps, k_vector = fs_kvector) full_sibs_dt <- data.table::data.table(full_sibs) full_sibs_dt # add a relatedness category column to be used later full_sibs_dt[, rel_cat := rep('fs', nreps)] full_sibs_dt[, sim_ix := 1:nreps] # create a single data.table with all the simulations sims <- rbindlist(list(half_sibs_dt, full_sibs_dt)) sims
Below, we demonstrate how to examine the distribution of the Lynch and
Ritland (1999) index for both simulated relatedness categories. Two options
for visualisation are shown, a density
plot and a boxplot
option. As we
can see, there is substantial overlap in the distribution of these two
indices. This is expected to contribute to a high mis-classification rate.
ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) + geom_density(alpha = 0.5) + xlab("Lynch and Ritland (1999) index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75))
ggplot(sims, aes(y = LR99_avg, x = rel_cat)) + geom_boxplot() + ylab("Lynch and Ritland (1999) index") + xlab("Relatedness Category")
Below, we plot the joint distribution of index values for the full-sibs
(y
axis) and half-sibs
(x
axis). The bold navy blue line is the
equal index value line. The red contour lines indicate the density distribution,
and the grey points are the simulated values. The vertical and horizontal black
lines indicate the expected relatedness index value for the half-sib
and
full-sib
categories, respectively. The figure demonstrates that full-sibs
are expected to have a larger index value than half-sibs
, which is what
one woud expect. It also shows that there is a non-trivial proportion of
half-sibs
that have a larger value than full-sib
pair.
# transform output to a new data.table that only has Lynch and Ritland 1999 values sims_lr99 <- dcast.data.table(data = sims, formula = sim_ix ~ rel_cat, value.var = 'LR99_avg') sims_lr99
ggplot(sims_lr99, aes(x = hs, y = fs)) + geom_point(colour = 'gray50') + ylab("Full-sibs") + xlab("Half-sibs") + xlim(c(-1,1)) + ylim(c(-1,1)) + geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') + stat_density2d(colour = 'red', n = 40, size = 1.5) + geom_vline(xintercept = 0.25) + geom_hline(yintercept = 0.50)
We can calculate the probability that the index value for a full-sib
dyad
is smaller than the index of a half-sib
dyad by estimating the proportion
of simulations for which $LR99_{fs} > LR99_{hs}$:
sims_lr99[,sum(fs > hs)]/nrow(sims_lr99)
So, we can see that there is a ~0.8 probability that the full-sib
index is
larger than the half-sib
index.
We can also easily calculate summary stats for each relatedness category, such as the mean, median, and 2.5% and 97.5% quantiles:
# mean relatedness values per relatedness category sims_lr99[,list(fs_mean = mean(fs), hs_mean = mean(hs))] # distribution summary per relatedness category sims_lr99[,list(quantile = c(0.025, 0.50, 0.975), fs_summ = quantile(fs, p = c(0.025, 0.5, 0.975)), hs_summ = quantile(hs, p = c(0.025, 0.5, 0.975)))]
The Blouin classifier attempts to set a threshold index value to distinguish among different relatedness categories. The classifier specifies that the threshold be point where the probability of an index value given one category is equal to the 1 - the probability of the index value given the other category. This will become clearer below.
The probability of observing a index value that is at most x given that the
relatedness category is full-sib
can be calculated as:
# pick a random threshold threshold_value = 0.40 # the probability that the Lynch and Ritland (1999) index is at most 0.40 # given that the relatedness category is full-sibs for the nancycats # dataset is: sims_lr99[,sum(fs <= threshold_value)] / nrow(sims_lr99) # so, we count how many simulated values under full-sibs relationship # category were 0.4 or less, and divide by the total number of simulations
Now, we ask what is the probability of observing an index value as large as x
or larger given that the relatedness category is half-sibs
:
# the probability that the Lynch and Ritland (1999) index is as large as 0.40 # or larger given that the relatedness category is half-sibs for the nancycats # dataset is: sims_lr99[,sum(hs >= threshold_value)] / nrow(sims_lr99) # so, we count how many simulated values under full-sibs relationship # category were 0.4 or less, and divide by the total number of simulations
As we can see P(x $\le$ 0.4 | cat = fs) > P (x $\ge$ 0.4 | cat = hs) (0.34 > 0.21).
This tells us that this threshold value (0.4) is likely to be too conservative
(i.e., we are mis-classifying too many full-sib
pairs incorrectly as half-sib
pairs). Given that in our hypothetical example we are looking for strong
evidence for half-sibs
in order to determine a minimum level of extra-pair
paternity such a threshold might not be desireable.
The Blouin classifier specifies that the optimal threshold value is where P(X $\le$ x | cat = fs) = P (X $\ge$ x | cat = hs). We can easily build an optmizer function that will give us that value:
blouin_classifier <- function(data, rel_cats) { total_sims = nrow(data) minimize_function <- function(threshold, data, rel_cats, n) { cat1 = rel_cats[1] cat2 = rel_cats[2] p1 = data[,sum(eval(as.symbol(cat1)) <= threshold)] / n p2 = data[,sum(eval(as.symbol(cat2)) >= threshold)] / n return((p1 - p2)^2) } blouin_threshold <- optimise(f = minimize_function, interval = c(0,1), data = data, rel_cats = rel_cats, n = total_sims) return(blouin_threshold$minimum) } blouin_threshold <- blouin_classifier(sims_lr99,rel_cats = c('fs', 'hs')) # checking the value blouin_threshold # ensuring that they are indeed (nearly) equal (to within a reasonable # expected difference given the simulation) sims_lr99[,sum(fs <= blouin_threshold)] / nrow(sims_lr99) sims_lr99[,sum(hs >= blouin_threshold)] / nrow(sims_lr99) all.equal(sims_lr99[,sum(fs <= blouin_threshold)] / nrow(sims_lr99), sims_lr99[,sum(hs >= blouin_threshold)] / nrow(sims_lr99))
We can then plot the simulated values with the estimated densities for the index in question, along with the threshold.
ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) + stat_density(kernel = 'cosine', alpha = 0.5, position = 'dodge') + xlab("Lynch and Ritland (1999) index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75)) + geom_vline(xintercept = blouin_threshold, colour = 'red', size = 2)
Given a particular threshold, it is then possible to calculate the mis-classification error:
calc_error_rate <- function(threshold, data, rel_cats) { cat1 <- rel_cats[1] cat2 <- rel_cats[2] total_sims <- nrow(data) p1 <- data[, sum(eval(as.symbol(cat1)) <= threshold)] / total_sims p2 <- data[, sum(eval(as.symbol(cat2)) >= threshold)] / total_sims res <- matrix(rep(0,4), ncol = 2) res[1,1] <- 1 - p1 res[1,2] <- p1 res[2,1] <- p2 res[2,2] <- 1 - p2 row.names(res) <- c(cat1, cat2) colnames(res) <- c(cat1, cat2) return(res) } calc_error_rate(blouin_threshold, data = sims_lr99, rel_cats = c('fs', 'hs')) # because the Blouin classifier finds the point that minimizes the error, the # error rate is symmetric. In other words, it is the same for both relatedness # categories. Thus, with the Blouin classifier, the proportion of full-sibs # that are expected to be correctly classified is ~0.73, and ~0.27 will be # incorrectly classified as half-sibs. # THIS OF COURSE ASSUMES THAT OUR DATASET IS ENTIRELY COMPRISED OF FULL-SIB AND # HALF-SIB PAIRS. # Furthermore, this is an approximate error estimate. The precision of the estimate # can be increased by simulating a larger number of dyads. The accuracy, however, # depends on the number of sampled loci, and how unbiased the estimate of the # population allele frequency really is.
We could decide, for instance, that the error rate is still too large in favour
of full-sibs
being incorrectly classified as half-sibs
. In this case,
one can examine how the error rate changes with the threshold. In the figure
below, we plot the proportion of simulated half-sib
dyads (in black) and
the proportion of full-sib
dyads (in red) correctly classified given a
certain threshold level. The vertical navy blue line give the Blouin threshold.
We can see that the closer we get to a 0
threshold, the more likely that the
number of correctly classified full-sibs
grows to almost 100%. Meanwhile, the
proportion of correctly classified half-sibs
goes to almost zero. At this level,
we would be the most confident that any pairs classified as half-sibs
are
likely to be truly half-sibs
. In this case, we have set the bar for the
evidence required to classify a pair as half-sibs
as very high.
Again, it should be stressed that this assumes that the sample in the dataset is solely comprised of full-sibs and potential half-sibs.
thresholds = seq(0,1,length.out = 200) change_thresholds <- sims_lr99[, list(thresholds, half_sibs = sapply(thresholds, function(th) sum(hs <= th)) / nrow(sims_lr99), full_sibs = sapply(thresholds, function(th) sum(fs >= th)) / nrow(sims_lr99))] ggplot(change_thresholds, aes(thresholds, half_sibs)) + geom_line() + geom_line(aes(thresholds, full_sibs), colour = 'red') + geom_vline(xintercept = blouin_threshold, colour = 'navyblue') + xlab("Thresholds") + ylab("Proportion correctly classified")
Now that we have set on a threshold (e.g., the Blouin threshold). It is easy to classify the observed values into relatedness categories:
# use the cut function to classify pairs using the Lynch and Ritland (1999) index. # the function will give label 'hs' to all pairs that have an LR99avg index # value that is -1 < index <= Blouin Threshold. And, it will label as 'fs' pairs # with LR99_avg index value that are Blouin Threshold < index <= 1. # it automatically adds the results to the estimate data.table, which can then # facilitate comparisons across estimators rel_est_dt[,lr99_class := cut(LR99_avg, breaks = c(-1,blouin_threshold,1), labels = c("hs", "fs"), include.lowest = T)] # tally the number of pairs classified in each class table(rel_est_dt[,lr99_class]) # plot the number of pais classfied in each class ggplot(rel_est_dt,aes(lr99_class)) + geom_bar() + xlab("Pairs classified using the Lynch and Ritland (1999) index using the Blouin threshold.") + ylab("Count")
For completeness, we can add information on for an unrelated
category,
assuming, perhaps more realistically, that our dataset has not only
full-sib
and half-sib
pairs.
# create a unrelated k-vector un_kvector <- c(1.0, 0.0, 0.0) # simulate unrelated dyads, and output relatedness values unrelated <- irelr::sim_rel(data = data, reps = nreps, k_vector = un_kvector) unrelated_dt <- data.table(unrelated) unrelated_dt # add a relatedness category column to be used later unrelated_dt[, rel_cat := rep('un', nreps)] unrelated_dt[, sim_ix := 1:nreps] # create a single data.table with all the simulations sims <- rbindlist(list(half_sibs_dt, full_sibs_dt, unrelated_dt)) sims
ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) + geom_density(alpha = 0.5) + xlab("Lynch and Ritland (1999) index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75))
ggplot(sims, aes(y = LR99_avg, x = rel_cat)) + geom_boxplot() + ylab("Lynch and Ritland (1999) index") + xlab("Relatedness Category")
# transform output to a new data.table that only has Lynch and Ritland 1999 values sims_lr99 <- dcast.data.table(data = sims, formula = sim_ix ~ rel_cat, value.var = 'LR99_avg') sims_lr99
# here, three plots are created comparing each of the three possible # pairs of three relatedness categories taken in groups of two # plot of full-sibs vs half-sibs simulations p1 = ggplot(sims_lr99, aes(x = hs, y = fs)) + geom_point(colour = 'gray50') + ylab("Full-sibs") + xlab("Half-sibs") + xlim(c(-1,1)) + ylim(c(-1,1)) + geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') + stat_density2d(colour = 'red', n = 40, size = 1.5) + geom_vline(xintercept = 0.25) + geom_hline(yintercept = 0.50) # plot of full-sibs vs unrelated simulations p2 = ggplot(sims_lr99, aes(x = un, y = fs)) + geom_point(colour = 'gray50') + ylab("Full-sibs") + xlab("Unrelated") + xlim(c(-1,1)) + ylim(c(-1,1)) + geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') + stat_density2d(colour = 'red', n = 40, size = 1.5) + geom_vline(xintercept = 0.0) + geom_hline(yintercept = 0.50) # plot of half-sibs vs unrelated simulations p3 = ggplot(sims_lr99, aes(x = un, y = hs)) + geom_point(colour = 'gray50') + ylab("Half-sibs") + xlab("Unrelated") + xlim(c(-1,1)) + ylim(c(-1,1)) + geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') + stat_density2d(colour = 'red', n = 40, size = 1.5) + geom_vline(xintercept = 0.0) + geom_hline(yintercept = 0.25) #plotting the the three plots above in a single figure. grid.arrange(p1, p2, p3, ncol = 2)
The probability that full-sib
pair has an index larger than an unrelated
pair is:
sims_lr99[,sum(fs > un)]/nrow(sims_lr99)
The probability that half-sib
pair has an index larger than an unrelated
pair is:
sims_lr99[,sum(hs > un)]/nrow(sims_lr99)
Here, we apply the Blouin classifier function that we created above to identify the two thresholds needed to differentiate the three simulated relatedness categories of interest:
The density plot of the simulated values is then used to visualize the Blouin thresholds.
rel_cats <- c('fs','hs','un') n_thresholds <- 2 blouin_thresholds <- numeric(n_thresholds) for(i in 1:n_thresholds) { blouin_thresholds[i] <- blouin_classifier(sims_lr99, rel_cats = c(rel_cats[i], rel_cats[(i+1)])) } ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) + geom_density(alpha = 0.5) + xlab("Lynch and Ritland (1999) index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75)) + geom_vline(xintercept = blouin_thresholds, colour = 'red', size = 2)
Once the thresholds have been identified, we can use them to classify observed
pairs, and estimate error probabilities. As we can see in the bar plot below,
adding the unrelated
class dramatically changed the number of identified
half-sibs
.
# we first start be creating a vector of thresholds to separate across # categories # because the thresholds need to be from lowest to highest, we use the # sort() function to make sure they are in increasing order. # then, add the bottom (-1) and top (1) thresholds, as relatedness indices # should not be less than -1 or greater than 1. blouin_thresholds <- c(-1, sort(blouin_thresholds), 1) # we then again use the cut function to classify pairs using # the Lynch and Ritland (1999) index --- this can be repeated with any # simulated index. # the function will give label 'hs' to all pairs that have an LR99avg index # value that is -1 < index <= Blouin Threshold. And, it will label as 'fs' pairs # with LR99_avg index value that are Blouin Threshold < index <= 1. # it automatically adds the results to the estimate data.table, which can then # facilitate comparisons across estimators rel_est_dt[,lr99_class := cut(LR99_avg, breaks = blouin_thresholds, labels = c("un","hs", "fs"), include.lowest = T)] # tally the number of pairs in each class table(rel_est_dt[,lr99_class]) # plot the number of pairs in each class ggplot(rel_est_dt,aes(lr99_class)) + geom_bar() + xlab("Pairs classified using the Lynch and Ritland (1999) index using the Blouin threshold.") + ylab("Count")
Estimating the error becomes a little more complex when there are more than two classes. But, we can use the thresholds vector to our advantage here.
calc_error_rate_general <- function(thresholds, data, rel_cats) { n_cats <- length(rel_cats) total_sims <- nrow(data) res = matrix(rep(0,n_cats^2), ncol = n_cats) for (i in 1:n_cats) { class_sims <- data[, cut(eval(as.symbol(rel_cats[i])), breaks = thresholds, labels = rel_cats, include.lowest = T)] res[i,] <- rev(table(class_sims)/total_sims) } rownames(res) <- rel_cats colnames(res) <- rel_cats return(res) } calc_error_rate_general(blouin_thresholds, data = sims_lr99, rel_cats = c('fs', 'hs', 'un')) # the resuls are a table, where the rows indicate the relatedness category # that was simulated, and the columns the relatedness category that the # simulated dyad was classified given the thresholds used. So, about 0.73 # of the full-sib dyads were correctly classified as full-sibs, while a # little over 0.25 were classified as half-sibs, and less than 0.02 were # classified as unrelated.
Again, we might not be satisfied with these rates. For instance, one might be
interested in identifying pairs of dyads that have high support for being
full-sibs
. To do this, we could say that there should be fewer than 10%
false full-sibs
, where as at the moment, close to 28% (0.2691 + 0.0105) of
pairs identified as full-sibs
are not full-sibs
. To do this, we can
modify our threshold calculator function to find the optimal threshold that
minimizes the missclassification error (i.e., that minimizes the number of
pairs simulated in other categories than full-sibs
, but mistakenly classified
as full-sibs
.
# a function to find the threshold that minimizes the error rate of # full-sibs. It takes as input the simulated data, the minimum error rate # the relatedness category of interest, and a list of other categories. find_threshold <- function(data, error_rate, for_cat, other_cats) { total_sims <- nrow(data) minimize_function <- function(threshold, data, error_rate, for_cat, other_cats, n) { n_cats <- length(other_cats) prop_missclass = 0 for (i in 1:n_cats) { class_sims <- data[, cut(eval(as.symbol(other_cats[i])), breaks = c(-1,threshold,1), labels = c(other_cats[i],for_cat), include.lowest = T)] prop_missclass <- prop_missclass + (table(class_sims)/total_sims)[for_cat] } return((prop_missclass - error_rate)^2) } new_threshold <- optimise(f = minimize_function, interval = c(0,1), data = data, error_rate = error_rate, for_cat = for_cat, other_cats = other_cats, n = total_sims) return(new_threshold$minimum) } new_threshold <- find_threshold(sims_lr99, 0.1, 'fs', c('un','hs')) # the new threshold new_threshold ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) + geom_density(alpha = 0.5) + xlab("Lynch and Ritland (1999) index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75)) + geom_vline(xintercept = new_threshold, colour = 'red', size = 2)
Inferences about a pedigree sould be made using the index that minimizes miss-classification errors. The variance of the distribution of an index can be used to compare across indices. As shown below, we can see that different indices have very different properties depending on the relatedness category. And, as we saw above, the same pair can have very different index values across indices. However, the variance is just one aspect of the distribution of indices. The skewness and kurtosis can also give us valuable information.
Ultimately, the index of choice, will depend on what is wanted. If one wants
to identify full-sibs
, then the index should be such that the bias and
variance is smallest for the full-sib
category, but that it is also
relatively symmetrical about the expected relatedness value
(i.e., has skewness close to 0), and is peaked around the expected
relatedness value (i.e., most of the density is on or just around the
expected value, thus, the kurtosis is highly positive).
# create a vector of possible indices indices <- names(sims)[1:8] indices # create a new data.table sims_dt = list() for (i in indices) { sims_trans <- dcast.data.table(data = sims, formula = sim_ix ~ rel_cat, value.var = i) sims_trans[,index := rep(i,nrow(sims_trans))] sims_dt[[i]] <- sims_trans } # bind the individual index lists together sims_dt <- rbindlist(sims_dt) # rescale by the expected index value for each simulated relatedness category sims_dt[, (rel_cats) := list(fs - 0.5, hs - 0.25, un)] sims_dt # set the simulated related categories, and the ones for which there is # an interest rel_cats <- c('fs', 'hs', 'un') focus_cats <- c('fs','hs','un') # examining bias # as we can see, the index that minimizes the bias changes depending on # relationship category sims_bias_dt <- sims_dt[, lapply(.SD, mean), .SDcols = rel_cats, by = index] sims_bias_dt # indices ordered from best to worst under each relatedness category order_bias <- sapply(focus_cats, function(ix) sims_bias_dt[order(abs(eval(as.symbol(ix))), decreasing = F), index]) order_bias # the mean rank for each index across relatedness categories mean_rank_bias <- sort( sapply(indices, function(ix) mean(apply(order_bias, 2, function(col) which(col == ix))))) # examining variance # as we can see, the index that minimizes the variance changes depending on # relationship category sims_var_dt <- sims_dt[, lapply(.SD, var), .SDcols = rel_cats, by = index] sims_var_dt # indices ordered from best to worst under each relatedness category order_var <- sapply(focus_cats, function(ix) sims_var_dt[order(eval(as.symbol(ix)), decreasing = F), index]) order_var # the mean rank for each index across relatedness categories mean_rank_var <- sort( sapply(indices, function(ix) mean(apply(order_var, 2, function(col) which(col == ix))))) # examining skewness # as we can see, the index that minimizes the skewness changes depending on # relationship category # values closer to zero are indicatives of more symmetric distributions sims_skew_dt <- sims_dt[, lapply(.SD, skewness), .SDcols = rel_cats, by = index] sims_skew_dt # indices ordered from best to worst under each relatedness category order_skew <- sapply(focus_cats, function(ix) sims_skew_dt[order(abs(eval(as.symbol(ix))), decreasing = F), index]) order_skew # the mean rank for each index across relatedness categories mean_rank_skew <- sort( sapply( indices, function(ix) mean(apply(order_skew, 2, function(col) which(col == ix))))) # examining kurtosis # as we can see, the index that maximizes kurtosis changes depending on # relationship category # values larger than zero indicate more peaked distributions sims_kurt_dt <- sims_dt[, lapply(.SD, kurtosis), .SDcols = rel_cats, by = index] sims_kurt_dt # indices ordered from best to worst under each relatedness category order_kurt <- sapply(focus_cats, function(ix) sims_kurt_dt[order(eval(as.symbol(ix)), decreasing = T),index]) order_kurt # the mean rank for each index across relatedness categories mean_rank_kurt <- sort( sapply(indices, function(ix) mean(apply(order_kurt,2, function(col) which(col == ix))))) # the mean rank across the four different distribution descriptors sort( sapply(indices, function(ix) mean( c(mean_rank_bias[ix], mean_rank_var[ix], mean_rank_skew[ix], mean_rank_kurt[ix]))))
From the above, we can see that perhaps QG89_rsxy
is the best relatedness
index given our data. In the mean ranking, it is tied with QG89_xy
, but
the latter is not a symmetrical relatedness index, so we prefer the former.
However, none of the relatedness indices was a clear winner, with the top
ranking indices having a mean rank close to 4.
Thus, in addition to QG89_rsxy
, it is might be useful to also examine
W02_cor
, LR99_avg
, and hk08
. These give us a breadth of different
indices to examine the error rates.
Below, we rank the relatedness indices bases on their bias, variance,
skewness, and kurtosis. We then use the Blouin classifier to find
optimal thresholds to separate unrelated
, half-sib
, and full-sib
pairs.
# creating a sims data.table with all simulated values sims_dt = list() for (i in indices) { sims_trans <- dcast.data.table(data = sims, formula = sim_ix ~ rel_cat, value.var = i) sims_trans[,index := rep(i,nrow(sims_trans))] sims_dt[[i]] <- sims_trans } # bind the individual index lists together sims_dt <- rbindlist(sims_dt) # list of indexes to evaluate indices <- indices[c(4,7,5,8)] indices n_indices <- length(indices) rel_cats <- c('fs','hs','un') n_thresholds <- 2 blouin_thresholds <- matrix(0,ncol = n_thresholds, nrow = n_indices) for(i in 1:n_indices) { for(t in 1:n_thresholds) { blouin_thresholds[i,t] <- blouin_classifier( sims_dt[index == indices[i],list(sim_ix, fs, hs, un)], rel_cats = c(rel_cats[t], rel_cats[(t+1)])) } } rownames(blouin_thresholds) <- indices blouin_thresholds sims_dt_long <- data.table:::melt.data.table(sims_dt, id.vars = c("index", "sim_ix")) ggplot(sims_dt_long[index %in% indices,], aes(x = value, fill = variable)) + geom_density(alpha = 0.5) + xlab("Index") + ylab("Density") + scale_fill_discrete(name = "Relatedness categories", labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) + theme(legend.justification=c(1,0), legend.position=c(1,0.75)) + geom_vline(data = data.frame( index = rownames(blouin_thresholds), blouin_thresholds), aes(xintercept = X1, group = index), colour = 'red', size = 2) + geom_vline(data = data.frame( index = rownames(blouin_thresholds), blouin_thresholds), aes(xintercept = X2, group = index), colour = 'red', size = 2) + facet_wrap(facets = ~ index, ncol = 2, scales = 'free')
Now that we have thresholds, we can estimate the error rates for each
index. As we can see from the figure below, hk08
is the most variable index,
whileQG89_rsxy
is the best index for identifying full-sibs
and half-sibs
.
Although, this comes at a slight disadvantage when identifying unrelated
individuals.
# our error function calc_error_rate_general <- function(thresholds, data, rel_cats) { n_cats <- length(rel_cats) total_sims <- nrow(data) res = matrix(rep(0,n_cats^2), ncol = n_cats) for (i in 1:n_cats) { class_sims <- data[, cut(eval(as.symbol(rel_cats[i])), breaks = thresholds, labels = rel_cats, include.lowest = T)] res[i,] <- rev(table(class_sims)/total_sims) } rownames(res) <- rel_cats colnames(res) <- rel_cats return(res) } # adding the limits and re-ordering the thresholds blouin_thresholds <- cbind( rep(-1,nrow(blouin_thresholds)), blouin_thresholds[,2], blouin_thresholds[,1], rep(1,nrow(blouin_thresholds))) error_rates <- lapply(indices, function(ix) calc_error_rate_general(threshold = blouin_thresholds[ix,], data = sims_dt[index == ix,list(sim_ix,fs, hs, un)], rel_cats = c("fs", "hs", "un")) ) names(error_rates) <- indices prop_class <- sapply(error_rates, diag) prop_class <- melt(prop_class) ggplot(prop_class, aes(x = Var1, y = value, fill = Var2)) + geom_bar(stat = 'identity', position = 'dodge') + xlab("Relatedness category") + ylab("Proportion correctly classified") + scale_fill_discrete(name = "Relatedness Index")
parent-offspring
from full-sibs
Separating parent-offspring
pairs from full-sibs
is not generally possible
with relatedness indices only. At the moment, the best option is to examine
patterns of proportion of loci with at least one shared allele.
More on this topic is forthcoming
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.