In andersgs/irelr: Pedigree Inference using Molecular Marker Data

require(adegenet)
require(irelr)
require(data.table)
require(moments)
require(reshape2)
require(ggplot2)
require(gridExtra)

Introduction

irelr is a package that implements the a Monte Carlo approach for reconstructing pedigrees from genetic data information. It assumes that unbiased population allele frequency data is available. In some cases, it is possible to obtain this information from the avaialble genetic data --- however, this assumes that the sample consists mainly of unrelated individuals. If your sample consists of many related individuals, then independently derived population allele frequency data is necessary for irelr to be useful.

The data

Data format

The most straight forward way of getting under way in using irelr is by having the data in a standard genepop format, with alleles encoded with 3 digits. The data format specifications can be found here.

Ultimately, the input format is a genind object from the R package adegenet (by Thibaud Jombart et al. link. Information on how to convert several data formats into a genind object can be found here.

irelr assumes a single population.

Below are the first 20 lines of the nancycats dataset available in the package adegenet. Once the adegenet package is loaded, it is possible to find out more about the data set by typing ?nancycats in R's command prompt.

cat(readLines(system.file("files/nancycats.gen",package="adegenet"), 
              n = 20), 
    file = '', 
    sep = "\n")

Loading the data

If the data is in a genepop format, it can be easily loaded by using the function read.genepop() from the adegenet package (note that the file must have the .gen extention):

# set file path location. substitute this for the actual path of your file
# note the need for a .gen extention. otherwise, an error will be generated.
file_path <- system.file("files/nancycats.gen", package = "adegenet")
# having a look at the file path
file_path
# loading the data into a variable called data
# NOTE: Here, I am using a version of read.genepop that is supplied with
# irelr. It is a direct copy of the function found in the development version 
# of adegenet. The current version of adegenet, found on CRAN, does not
# parse missing data correctly.
data <- irelr::read.genepop(file = file_path, quiet = T)
# having a look at data
# notice that data is an S4 class of type 'genind'
# it includes 237 individuals, genotyped at 9 loci
# there are 108 alleles across the genotyped loci
data

# number of indivduals
n_ind <- adegenet::nInd(data)
n_ind

# number of loci
n_loc <- adegenet::nLoc(data)
n_loc

# number of alleles per locus
n_all <- adegenet::nAll(data)
n_all

# total number of alleles
sum(n_all)

# total number of populations
# irel will ignore the population division and attempt to estimate the 
# pedigree of all individuals in the data file
n_pop <- adegenet::nPop(data)
n_pop

# irelr also assumes that individuals have unique identifiers,
# if your genepop does not have unique identifiers when you 
# load your file, a warning will be issued telling you that
# there individual idenfiers are NOT unique, and that
# row IDs will be used instead. Row IDs will be the 
# row number. Therefore, the individuals will named with sequential numbers
# from 1 to total number of individuals
# this is the case for the nancycats data
adegenet::indNames(data)

# this can be fixed by changing the data in the data file so that each 
# individual has a unique identifier. 
# alternatively, a list of names can be manually created and assigned to the 
# the 'genind' object:
data(nancycats, package = "adegenet")
ind_identifiers <- adegenet::indNames(nancycats)
ind_identifiers
# the easiest way to do this would be to prepare a list of individual
# in the same order as they appear in genpop file using Excel. The
# names should be in a single column of the spreadsheet.
# save the list as a Tab delimited file. Once you have do so, 
# you should have a file that looks like this:
# ind1
# ind2
# ...
# indN
#
# you can then load your own list of identifiers in the following way
#ind_identifiers <- read.table(file = "/path/to/file/identifier.txt", header = F, quote = F, row.names = F, stringsAsFactors = F)
# you can then run the following code:
adegenet::indNames(data) <- unlist(ind_identifiers)
adegenet::indNames(data)

# however, by the far the easiest way would be to code your individual names
# into the genpop file.

Estimating relatedness indices

The `sim_rel()` function

Once the data is loaded, we can use the function estimate_rel() to estimate relatedness indices for all pairs of individuals in the data set.

# estimate relatedness indices
rel_estimates <- irelr::estimate_rel(data = data)
# create a data.table with the results
# this will make it easier down the line to parse out individual indices
rel_est_dt  <- data.table(rel_estimates)
rel_est_dt

The output of the estimate_rel() function is a data.frame, which we transform in to a data.table (this requires the data.table package). The output includes to columns --- ind1 and ind2 --- that indicate for which two individuals the indices in the row refer to.

Eight indices are currently calculated:

Four forms of the Queller and Goodnight (1989) index:
- QG89_xy: Assymetrical form in which individual 1 is the reference (denominator), and individual 2 is the numerator.
- QG89_yx: same as above, but now individual 2 is the reference.
- QG89_avg: mean of the two above estimators.
- QG89_rsxy: ratio of the sum of the numerators and the sum of the denominators of the first two indices above.
One form of the Lynch and Ritland (1999) estimator:
- LR99_avg: The mean of the for each individual as the reference in turn.
Two forms of the Wang (2002) estimator:
- W02_cor: The corrected estimator.
- W02_unc: The uncorrected estimator.
One form of the Heg and Konovalov (2008) estimator:
- HK08: Unlike the others, is based on population heterozygosity and not allele frequencies.

In addition, two other relevant pieces of information are outputted:

prop_alleles_shared: proportiono of alleles shared between the dyad. Thus, for instance, if the dyad are typed at 7 loci, there are 14 potential alleles to be shared.
prop_loci: proportion of loci that share alleles. For instance, if a dyad is typed at 7 loci, and it shares alleles at 5, then this variable is 5/7.

Finally, the function also reports the number of missing loci (n_missing_loci), which are the number of loci that did not have an allele call for at least one of the pair. Thus, relatedness indices are only calculated based on loci that are genotyped for both individuals in the dyad. The loci with missing genotype calls for a particular dyad are listed in the missing_loci column. If multiple loci are missing, they are separated by a "." (e.g., loc1.loc2, would mean that loc1 and loc2 are missing). This will be used later to assign classification error rates based on the available data for a dyad.

Plotting the results

Distribution of individual indices

The distribution of particular index can be displayed as follows:

ggplot(rel_est_dt, aes(x = LR99_avg)) +
  geom_histogram() +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Count")

We can divide up the histgram by missing loci (below). Here, the top left panel shows the distribution of indices for pairs that had complete datasets. Otherwise, the panel label indicates the missing loci. As we can see, we have one pair that is missing loci fca8, fca23, fca37, and fca77 (bottom row, second panel from the left).

ggplot(rel_est_dt, aes(x = LR99_avg)) +
  geom_histogram() +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Count") +
  facet_wrap(facets = ~ missing_loci, ncol = 4, scale = 'free')

Joint distribution of indices

We can easily examine the joint distribution of two indices as well (below). As we can see, two estimators can often disagree substantially in their estimates. Their are pairs of individuals with LR99_avg values that are over 0.5 (indicating high relatedness), for which their W02_cor value is very close to 0 (indicating low relatedness). While we have not carried out any experiments to test this, this suggests there might be value in using information from multiple indices when reconstructing pedigrees.

ggplot(rel_est_dt, aes(x = LR99_avg, y = W02_cor)) +
  geom_point() +
  geom_density2d(colour = "red") +
  geom_abline(slope = 1, intercept = 0, size = 2, colour = 'navyblue') +
  xlim(c(-1,1)) +
  ylim(c(-1,1)) +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Wang (2002) corrected index")

Simulating data to classify pairs

In irelr one can simulate any relatedness category by simply figuring out the k vector. The k vector is a vector of 3 numbers, that must sum to 1.0. Each value gives the proportion of loci for which a pair of individuals share 0, 1, and 2 ancestors. Thus, for a pair of unrelated individuals, there 0.0 loci for which they share 2 ancestors, 0.0 for which they share 1 ancestor, and 1.0 loci for which they share 0 ancestors. Thus, the k-vector is specified as c(1.0, 0.0, 0.0).

The k-vectors for the most usual relatedness categoris are:

Unrelated: c(1.0, 0.0, 0.0)
Half-sibs: c(0.5, 0.5, 0.0)
Full-sibs: c(0.25, 0.50, 0.25)
Parent-offspring: c(0.0, 1.0, 0.0)

The expected relatedness value for each category can be estimated by the weighted average of number of ancestors by k-value, normalized by the total number of possible ancestors (2). Thus, for parent-offspring dyads:

$$ po = 0.5 \times (0 \times 0.0 + 1 \times 1.0 + 2 \times 0.0) $$

Differentiating between full-sibs and half-sibs

A common case where reconstructing the pedigree of pairs of individuals might come in handy, is to differentiate between half-sibs and full-sibs in samples taken from different bird nests. This could be used to get an estimate of the number of nests with extra-pair chicks, and thus give an estimate of number of extra-pair matings in a season for a particular bird species. Here, we will just focus on how to classify the data from the nancycats dataset into half-sibs and full-sibs, but if one new which pairs belonged to each nest, it would be possible to classify the whole dataset, and then examine pairs by nest grouping.

The `sim_rel()` function

Below, we demonstrate the use of the irelr function sim_rel() to generate 10,000 half-sib dyads and 10,000 full-sib dyads, and calculate for each simulated dyad the relatedness estimators defined above.

Note: Here, irelr uses estimates of allele frequency derived directly from the dataset. As we see above for the Lynch and Ritland (1999) index, the great majority of pairs have an index close to 0. Therefore, it is reasonable to assume that the majority of pairs are unrelated, and we can obtain an estimate of population allele frequencies that is reasonably unbiased by a large proportion of highly related pairs. Eventually, irelr will allow for inputting user defined allele frequencies that are independently derived from the dataset. These are expected to generate more a robust pedigree reconstruction.

# set number of dyads to simulate
nreps <- 10000

# create a half-sib k-vector
hs_kvector <- c(0.5, 0.5, 0.0)

# simulate half-sib dyads, and output relatedness values
half_sibs <- irelr::sim_rel(data = data, reps = nreps, k_vector = hs_kvector)
half_sibs_dt  <- data.table::data.table(half_sibs)
half_sibs_dt

# add a relatedness category column to be used later
half_sibs_dt[, rel_cat := rep('hs', nreps)]
half_sibs_dt[, sim_ix := 1:nreps]

# create a full-sib k-vector
fs_kvector <- c(0.25, 0.5, 0.25)

# simulate full-sib dyads, and output relatedness values
full_sibs <- irelr::sim_rel(data = data, reps = nreps, k_vector = fs_kvector)
full_sibs_dt  <- data.table::data.table(full_sibs)
full_sibs_dt

# add a relatedness category column to be used later
full_sibs_dt[, rel_cat := rep('fs', nreps)]
full_sibs_dt[, sim_ix := 1:nreps]

# create a single data.table with all the simulations
sims <- rbindlist(list(half_sibs_dt, full_sibs_dt))
sims

Examining the simulated values for a single index

Below, we demonstrate how to examine the distribution of the Lynch and Ritland (1999) index for both simulated relatedness categories. Two options for visualisation are shown, a density plot and a boxplot option. As we can see, there is substantial overlap in the distribution of these two indices. This is expected to contribute to a high mis-classification rate.

ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) +
  geom_density(alpha = 0.5) +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75))

ggplot(sims, aes(y = LR99_avg, x = rel_cat)) +
  geom_boxplot() +
  ylab("Lynch and Ritland (1999) index") +
  xlab("Relatedness Category")

The joint distribution of indices of different relatedness categories

Below, we plot the joint distribution of index values for the full-sibs (y axis) and half-sibs (x axis). The bold navy blue line is the equal index value line. The red contour lines indicate the density distribution, and the grey points are the simulated values. The vertical and horizontal black lines indicate the expected relatedness index value for the half-sib and full-sib categories, respectively. The figure demonstrates that full-sibs are expected to have a larger index value than half-sibs, which is what one woud expect. It also shows that there is a non-trivial proportion of half-sibs that have a larger value than full-sib pair.

# transform output to a new data.table that only has Lynch and Ritland 1999 values
sims_lr99 <- dcast.data.table(data = sims, 
                              formula = sim_ix ~ rel_cat, 
                              value.var = 'LR99_avg')
sims_lr99

ggplot(sims_lr99, aes(x = hs, y = fs)) +
  geom_point(colour = 'gray50') +
  ylab("Full-sibs") +
  xlab("Half-sibs") +
  xlim(c(-1,1)) + 
  ylim(c(-1,1)) +
  geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') +
  stat_density2d(colour = 'red', n = 40, size = 1.5) +
  geom_vline(xintercept = 0.25) +
  geom_hline(yintercept = 0.50)

We can calculate the probability that the index value for a full-sib dyad is smaller than the index of a half-sib dyad by estimating the proportion of simulations for which $LR99_{fs} > LR99_{hs}$:

sims_lr99[,sum(fs > hs)]/nrow(sims_lr99)

So, we can see that there is a ~0.8 probability that the full-sib index is larger than the half-sib index.

We can also easily calculate summary stats for each relatedness category, such as the mean, median, and 2.5% and 97.5% quantiles:

# mean relatedness values per relatedness category
sims_lr99[,list(fs_mean = mean(fs), hs_mean = mean(hs))]

# distribution summary per relatedness category
sims_lr99[,list(quantile = c(0.025, 0.50, 0.975),
                fs_summ = quantile(fs, p = c(0.025, 0.5, 0.975)), 
                hs_summ = quantile(hs, p = c(0.025, 0.5, 0.975)))]

The Blouin classifier

The Blouin classifier attempts to set a threshold index value to distinguish among different relatedness categories. The classifier specifies that the threshold be point where the probability of an index value given one category is equal to the 1 - the probability of the index value given the other category. This will become clearer below.

The probability of observing a index value that is at most x given that the relatedness category is full-sib can be calculated as:

# pick a random threshold
threshold_value = 0.40

# the probability that the Lynch and Ritland (1999) index is at most 0.40 
# given that the relatedness category is full-sibs for the nancycats 
# dataset is:

sims_lr99[,sum(fs <= threshold_value)] / nrow(sims_lr99)

# so, we count how many simulated values under full-sibs relationship 
# category were 0.4 or less, and divide by the total number of simulations

Now, we ask what is the probability of observing an index value as large as x or larger given that the relatedness category is half-sibs:

# the probability that the Lynch and Ritland (1999) index is as large as 0.40 
# or larger given that the relatedness category is half-sibs for the nancycats 
# dataset is:

sims_lr99[,sum(hs >= threshold_value)] / nrow(sims_lr99)

# so, we count how many simulated values under full-sibs relationship 
# category were 0.4 or less, and divide by the total number of simulations

As we can see P(x $\le$ 0.4 | cat = fs) > P (x $\ge$ 0.4 | cat = hs) (0.34 > 0.21). This tells us that this threshold value (0.4) is likely to be too conservative (i.e., we are mis-classifying too many full-sib pairs incorrectly as half-sib pairs). Given that in our hypothetical example we are looking for strong evidence for half-sibs in order to determine a minimum level of extra-pair paternity such a threshold might not be desireable.

The Blouin classifier specifies that the optimal threshold value is where P(X $\le$ x | cat = fs) = P (X $\ge$ x | cat = hs). We can easily build an optmizer function that will give us that value:

blouin_classifier <- function(data, rel_cats) {
  total_sims = nrow(data)
  minimize_function <- function(threshold, data, rel_cats, n) {
    cat1 = rel_cats[1]
    cat2 = rel_cats[2]
    p1 = data[,sum(eval(as.symbol(cat1)) <= threshold)] / n
    p2 = data[,sum(eval(as.symbol(cat2)) >= threshold)] / n
    return((p1 - p2)^2)
  }
  blouin_threshold  <- optimise(f = minimize_function, 
           interval = c(0,1), 
           data = data, 
           rel_cats = rel_cats, 
           n = total_sims)
  return(blouin_threshold$minimum)
}
blouin_threshold <- blouin_classifier(sims_lr99,rel_cats = c('fs', 'hs'))
# checking the value
blouin_threshold

# ensuring that they are indeed (nearly) equal (to within a reasonable
# expected difference given the simulation)
sims_lr99[,sum(fs <= blouin_threshold)] / nrow(sims_lr99)
sims_lr99[,sum(hs >= blouin_threshold)] / nrow(sims_lr99)

all.equal(sims_lr99[,sum(fs <= blouin_threshold)] / nrow(sims_lr99),
  sims_lr99[,sum(hs >= blouin_threshold)] / nrow(sims_lr99))

We can then plot the simulated values with the estimated densities for the index in question, along with the threshold.

ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) +
  stat_density(kernel = 'cosine', alpha = 0.5, position = 'dodge') +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75)) +
  geom_vline(xintercept = blouin_threshold, colour = 'red', size = 2)

Calculate misclassification errors

Given a particular threshold, it is then possible to calculate the mis-classification error:

calc_error_rate <- function(threshold, data, rel_cats) {
  cat1 <- rel_cats[1]
  cat2 <- rel_cats[2]
  total_sims <- nrow(data)
  p1 <- data[, sum(eval(as.symbol(cat1)) <= threshold)] / total_sims
  p2 <- data[, sum(eval(as.symbol(cat2)) >= threshold)] / total_sims
  res <- matrix(rep(0,4), ncol = 2)
  res[1,1] <- 1 - p1
  res[1,2] <- p1
  res[2,1] <- p2
  res[2,2] <- 1 - p2
  row.names(res) <- c(cat1, cat2)
  colnames(res) <- c(cat1, cat2)
  return(res)
}
calc_error_rate(blouin_threshold, data = sims_lr99, rel_cats = c('fs', 'hs'))
# because the Blouin classifier finds the point that minimizes the error, the 
# error rate is symmetric. In other words, it is the same for both relatedness
# categories. Thus, with the Blouin classifier, the proportion of full-sibs 
# that are expected to be correctly classified is ~0.73, and ~0.27 will be 
# incorrectly classified as half-sibs. 
# THIS OF COURSE ASSUMES THAT OUR DATASET IS ENTIRELY COMPRISED OF FULL-SIB AND
# HALF-SIB PAIRS.
# Furthermore, this is an approximate error estimate. The precision of the estimate
# can be increased by simulating a larger number of dyads. The accuracy, however,
# depends on the number of sampled loci, and how unbiased the estimate of the 
# population allele frequency really is.

We could decide, for instance, that the error rate is still too large in favour of full-sibs being incorrectly classified as half-sibs. In this case, one can examine how the error rate changes with the threshold. In the figure below, we plot the proportion of simulated half-sib dyads (in black) and the proportion of full-sib dyads (in red) correctly classified given a certain threshold level. The vertical navy blue line give the Blouin threshold. We can see that the closer we get to a 0 threshold, the more likely that the number of correctly classified full-sibs grows to almost 100%. Meanwhile, the proportion of correctly classified half-sibs goes to almost zero. At this level, we would be the most confident that any pairs classified as half-sibs are likely to be truly half-sibs. In this case, we have set the bar for the evidence required to classify a pair as half-sibs as very high.

Again, it should be stressed that this assumes that the sample in the dataset is solely comprised of full-sibs and potential half-sibs.

thresholds = seq(0,1,length.out = 200)
change_thresholds <- sims_lr99[,
                               list(thresholds, 
                                    half_sibs = sapply(thresholds, 
                                                       function(th) sum(hs <= th)) / nrow(sims_lr99), 
                                    full_sibs = sapply(thresholds, 
                                                       function(th) sum(fs >= th)) / nrow(sims_lr99))]
ggplot(change_thresholds, aes(thresholds, half_sibs)) + 
  geom_line() + 
  geom_line(aes(thresholds, full_sibs), colour = 'red') +
  geom_vline(xintercept = blouin_threshold, colour = 'navyblue') + 
  xlab("Thresholds") +
  ylab("Proportion correctly classified")

Classifying individuals

Now that we have set on a threshold (e.g., the Blouin threshold). It is easy to classify the observed values into relatedness categories:

# use the cut function to classify pairs using the Lynch and Ritland (1999) index.
# the function will give label 'hs' to all pairs that have an LR99avg index
# value that is -1 < index <= Blouin Threshold. And, it will label as 'fs' pairs
# with LR99_avg index value that are Blouin Threshold < index <= 1.
# it automatically adds the results to the estimate data.table, which can then
# facilitate comparisons across estimators
rel_est_dt[,lr99_class := cut(LR99_avg, 
                              breaks = c(-1,blouin_threshold,1), 
                              labels = c("hs", "fs"), 
                              include.lowest = T)]

# tally the number of pairs classified in each class
table(rel_est_dt[,lr99_class])

# plot the number of pais classfied in each class
ggplot(rel_est_dt,aes(lr99_class)) + 
  geom_bar() +
  xlab("Pairs classified using the Lynch and Ritland (1999) index using the Blouin threshold.") +
  ylab("Count")

Adding an unrelated category

For completeness, we can add information on for an unrelated category, assuming, perhaps more realistically, that our dataset has not only full-sib and half-sib pairs.

Simulate unrelated individuals

# create a unrelated k-vector
un_kvector <- c(1.0, 0.0, 0.0)

# simulate unrelated dyads, and output relatedness values
unrelated <- irelr::sim_rel(data = data, reps = nreps, k_vector = un_kvector)
unrelated_dt  <- data.table(unrelated)
unrelated_dt

# add a relatedness category column to be used later
unrelated_dt[, rel_cat := rep('un', nreps)]
unrelated_dt[, sim_ix := 1:nreps]

# create a single data.table with all the simulations
sims <- rbindlist(list(half_sibs_dt, full_sibs_dt, unrelated_dt))
sims

Plotting the distribution of relatedness values

ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) +
  geom_density(alpha = 0.5) +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75))

ggplot(sims, aes(y = LR99_avg, x = rel_cat)) +
  geom_boxplot() +
  ylab("Lynch and Ritland (1999) index") +
  xlab("Relatedness Category")

Checking the joint distribution

# transform output to a new data.table that only has Lynch and Ritland 1999 values
sims_lr99 <- dcast.data.table(data = sims, 
                              formula = sim_ix ~ rel_cat, 
                              value.var = 'LR99_avg')
sims_lr99

# here, three plots are created comparing each of the three possible 
# pairs of three relatedness categories taken in groups of two

# plot of full-sibs vs half-sibs simulations
p1 = ggplot(sims_lr99, aes(x = hs, y = fs)) +
  geom_point(colour = 'gray50') +
  ylab("Full-sibs") +
  xlab("Half-sibs") +
  xlim(c(-1,1)) + 
  ylim(c(-1,1)) +
  geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') +
  stat_density2d(colour = 'red', n = 40, size = 1.5) +
  geom_vline(xintercept = 0.25) +
  geom_hline(yintercept = 0.50)

# plot of full-sibs vs unrelated simulations
p2 = ggplot(sims_lr99, aes(x = un, y = fs)) +
  geom_point(colour = 'gray50') +
  ylab("Full-sibs") +
  xlab("Unrelated") +
  xlim(c(-1,1)) + 
  ylim(c(-1,1)) +
  geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') +
  stat_density2d(colour = 'red', n = 40, size = 1.5) +
  geom_vline(xintercept = 0.0) +
  geom_hline(yintercept = 0.50)

# plot of half-sibs vs unrelated simulations
p3 = ggplot(sims_lr99, aes(x = un, y = hs)) +
  geom_point(colour = 'gray50') +
  ylab("Half-sibs") +
  xlab("Unrelated") +
  xlim(c(-1,1)) + 
  ylim(c(-1,1)) +
  geom_abline(intercept = 0, slope = 1, size = 2, colour = 'navyblue') +
  stat_density2d(colour = 'red', n = 40, size = 1.5) +
  geom_vline(xintercept = 0.0) +
  geom_hline(yintercept = 0.25)

#plotting the the three plots above in a single figure.
grid.arrange(p1, p2, p3, ncol = 2)

The probability that full-sib pair has an index larger than an unrelated pair is:

sims_lr99[,sum(fs > un)]/nrow(sims_lr99)

The probability that half-sib pair has an index larger than an unrelated pair is:

sims_lr99[,sum(hs > un)]/nrow(sims_lr99)

Classify pairs using the Blouin classifier

Here, we apply the Blouin classifier function that we created above to identify the two thresholds needed to differentiate the three simulated relatedness categories of interest:

fs --- hs threshold
hs --- unrelated threshold

The density plot of the simulated values is then used to visualize the Blouin thresholds.

rel_cats <- c('fs','hs','un')
n_thresholds <- 2
blouin_thresholds <- numeric(n_thresholds)
for(i in 1:n_thresholds) {
  blouin_thresholds[i] <- blouin_classifier(sims_lr99,
                                            rel_cats = c(rel_cats[i], 
                                                         rel_cats[(i+1)]))
}

ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) +
  geom_density(alpha = 0.5) +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75)) +
  geom_vline(xintercept = blouin_thresholds, colour = 'red', size = 2)

Once the thresholds have been identified, we can use them to classify observed pairs, and estimate error probabilities. As we can see in the bar plot below, adding the unrelated class dramatically changed the number of identified half-sibs.

# we first start be creating a vector of thresholds to separate across 
# categories
# because the thresholds need to be from lowest to highest, we use the 
# sort() function to make sure they are in increasing order. 
# then, add the bottom (-1) and top (1) thresholds, as relatedness indices
# should not be less than -1 or greater than 1.
blouin_thresholds <- c(-1, sort(blouin_thresholds), 1)

# we then again use the cut function to classify pairs using 
# the Lynch and Ritland (1999) index --- this can be repeated with any 
# simulated index.
# the function will give label 'hs' to all pairs that have an LR99avg index
# value that is -1 < index <= Blouin Threshold. And, it will label as 'fs' pairs
# with LR99_avg index value that are Blouin Threshold < index <= 1.
# it automatically adds the results to the estimate data.table, which can then
# facilitate comparisons across estimators
rel_est_dt[,lr99_class := cut(LR99_avg, 
                              breaks = blouin_thresholds, 
                              labels = c("un","hs", "fs"), 
                              include.lowest = T)]

# tally the number of pairs in each class
table(rel_est_dt[,lr99_class])

# plot the number of pairs in each class
ggplot(rel_est_dt,aes(lr99_class)) + 
  geom_bar() +
  xlab("Pairs classified using the Lynch and Ritland (1999) index using the Blouin threshold.") +
  ylab("Count")

Estimating the error becomes a little more complex when there are more than two classes. But, we can use the thresholds vector to our advantage here.

calc_error_rate_general <- function(thresholds, data, rel_cats) {
  n_cats <- length(rel_cats)
  total_sims <- nrow(data)
  res = matrix(rep(0,n_cats^2), ncol = n_cats)
  for (i in 1:n_cats) {
    class_sims <- data[, cut(eval(as.symbol(rel_cats[i])), 
               breaks = thresholds, 
               labels = rel_cats, 
               include.lowest = T)]
    res[i,] <- rev(table(class_sims)/total_sims)
  }
  rownames(res) <- rel_cats
  colnames(res) <- rel_cats
  return(res)
}

calc_error_rate_general(blouin_thresholds, 
                        data = sims_lr99, 
                        rel_cats = c('fs', 'hs', 'un'))
# the resuls are a table, where the rows indicate the relatedness category
# that was simulated, and the columns the relatedness category that the 
# simulated dyad was classified given the thresholds used. So, about 0.73
# of the full-sib dyads were correctly classified as full-sibs, while a 
# little over 0.25 were classified as half-sibs, and less than 0.02 were 
# classified as unrelated.

Again, we might not be satisfied with these rates. For instance, one might be interested in identifying pairs of dyads that have high support for being full-sibs. To do this, we could say that there should be fewer than 10% false full-sibs, where as at the moment, close to 28% (0.2691 + 0.0105) of pairs identified as full-sibs are not full-sibs. To do this, we can modify our threshold calculator function to find the optimal threshold that minimizes the missclassification error (i.e., that minimizes the number of pairs simulated in other categories than full-sibs, but mistakenly classified as full-sibs.

# a function to find the threshold that minimizes the error rate of 
# full-sibs. It takes as input the simulated data, the minimum error rate
# the relatedness category of interest, and a list of other categories.
find_threshold <- function(data, error_rate, for_cat, other_cats) {
  total_sims <- nrow(data)
  minimize_function <- function(threshold, 
                                data, 
                                error_rate, 
                                for_cat, 
                                other_cats, 
                                n) {
    n_cats <- length(other_cats)
    prop_missclass = 0
    for (i in 1:n_cats) {
      class_sims <- data[, cut(eval(as.symbol(other_cats[i])), 
                 breaks = c(-1,threshold,1), 
                 labels = c(other_cats[i],for_cat), 
                 include.lowest = T)]
      prop_missclass <- prop_missclass + 
        (table(class_sims)/total_sims)[for_cat]
    }
    return((prop_missclass - error_rate)^2)
  }
  new_threshold  <- optimise(f = minimize_function, 
           interval = c(0,1), 
           data = data, 
           error_rate = error_rate,
           for_cat = for_cat,
           other_cats = other_cats,
           n = total_sims)
  return(new_threshold$minimum)
}

new_threshold <- find_threshold(sims_lr99, 0.1, 'fs', c('un','hs'))

# the new threshold
new_threshold

ggplot(sims, aes(x = LR99_avg, fill = rel_cat)) +
  geom_density(alpha = 0.5) +
  xlab("Lynch and Ritland (1999) index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75)) +
  geom_vline(xintercept = new_threshold, colour = 'red', size = 2)

Comparing indices

Inferences about a pedigree sould be made using the index that minimizes miss-classification errors. The variance of the distribution of an index can be used to compare across indices. As shown below, we can see that different indices have very different properties depending on the relatedness category. And, as we saw above, the same pair can have very different index values across indices. However, the variance is just one aspect of the distribution of indices. The skewness and kurtosis can also give us valuable information.

Ultimately, the index of choice, will depend on what is wanted. If one wants to identify full-sibs, then the index should be such that the bias and variance is smallest for the full-sib category, but that it is also relatively symmetrical about the expected relatedness value (i.e., has skewness close to 0), and is peaked around the expected relatedness value (i.e., most of the density is on or just around the expected value, thus, the kurtosis is highly positive).

# create a vector of possible indices
indices <- names(sims)[1:8]
indices

# create a new data.table
sims_dt = list()
for (i in indices) {
  sims_trans <- dcast.data.table(data = sims, 
                                 formula = sim_ix ~ rel_cat, 
                                 value.var = i)
  sims_trans[,index := rep(i,nrow(sims_trans))]
  sims_dt[[i]] <- sims_trans
}

# bind the individual index lists together
sims_dt <- rbindlist(sims_dt)

# rescale by the expected index value for each simulated relatedness category
sims_dt[, (rel_cats) := list(fs - 0.5, hs - 0.25, un)]
sims_dt
# set the simulated related categories, and the ones for which there is 
# an interest
rel_cats <- c('fs', 'hs', 'un')
focus_cats <- c('fs','hs','un')

# examining bias
# as we can see, the index that minimizes the bias changes depending on 
# relationship category
sims_bias_dt <- sims_dt[,
                        lapply(.SD, mean), 
                        .SDcols = rel_cats, 
                        by = index]
sims_bias_dt
# indices ordered from best to worst under each relatedness category
order_bias <- sapply(focus_cats, 
                     function(ix) sims_bias_dt[order(abs(eval(as.symbol(ix))), 
                                                     decreasing = F), 
                                               index])
order_bias
# the mean rank for each index across relatedness categories
mean_rank_bias <- sort(
  sapply(indices, 
         function(ix) mean(apply(order_bias, 2, 
                                 function(col) which(col == ix)))))

# examining variance
# as we can see, the index that minimizes the variance changes depending on 
# relationship category
sims_var_dt <- sims_dt[,
                       lapply(.SD, var), 
                       .SDcols = rel_cats, 
                       by = index]
sims_var_dt
# indices ordered from best to worst under each relatedness category
order_var <- sapply(focus_cats, 
                    function(ix) 
                      sims_var_dt[order(eval(as.symbol(ix)), 
                                        decreasing = F),
                                  index])
order_var
# the mean rank for each index across relatedness categories
mean_rank_var <- sort(
  sapply(indices,
         function(ix) mean(apply(order_var, 2, 
                                 function(col) which(col == ix)))))

# examining skewness
# as we can see, the index that minimizes the skewness changes depending on 
# relationship category
# values closer to zero are indicatives of more symmetric distributions
sims_skew_dt <- sims_dt[,
                        lapply(.SD, skewness), 
                        .SDcols = rel_cats, 
                        by = index]
sims_skew_dt
# indices ordered from best to worst under each relatedness category
order_skew <- sapply(focus_cats, 
                     function(ix) 
                       sims_skew_dt[order(abs(eval(as.symbol(ix))), 
                                          decreasing = F), 
                                    index])
order_skew
# the mean rank for each index across relatedness categories
mean_rank_skew <- sort(
  sapply(
    indices, 
    function(ix) 
      mean(apply(order_skew, 2, 
                            function(col) 
                              which(col == ix)))))

# examining kurtosis
# as we can see, the index that maximizes kurtosis changes depending on 
# relationship category
# values larger than zero indicate more peaked distributions
sims_kurt_dt <- sims_dt[,
                        lapply(.SD, kurtosis), 
                        .SDcols = rel_cats, 
                        by = index]
sims_kurt_dt
# indices ordered from best to worst under each relatedness category
order_kurt <- sapply(focus_cats, 
                     function(ix) 
                       sims_kurt_dt[order(eval(as.symbol(ix)), 
                                          decreasing = T),index])
order_kurt
# the mean rank for each index across relatedness categories
mean_rank_kurt <- sort(
  sapply(indices, 
         function(ix) 
           mean(apply(order_kurt,2, function(col) which(col == ix)))))

# the mean rank across the four different distribution descriptors
sort(
  sapply(indices, 
         function(ix) 
           mean(
             c(mean_rank_bias[ix], 
               mean_rank_var[ix], 
               mean_rank_skew[ix], 
               mean_rank_kurt[ix]))))

From the above, we can see that perhaps QG89_rsxy is the best relatedness index given our data. In the mean ranking, it is tied with QG89_xy, but the latter is not a symmetrical relatedness index, so we prefer the former. However, none of the relatedness indices was a clear winner, with the top ranking indices having a mean rank close to 4. Thus, in addition to QG89_rsxy, it is might be useful to also examine W02_cor, LR99_avg, and hk08. These give us a breadth of different indices to examine the error rates.

Below, we rank the relatedness indices bases on their bias, variance, skewness, and kurtosis. We then use the Blouin classifier to find optimal thresholds to separate unrelated, half-sib, and full-sib pairs.

# creating a sims data.table with all simulated values
sims_dt = list()
for (i in indices) {
  sims_trans <- dcast.data.table(data = sims, 
                                 formula = sim_ix ~ rel_cat, 
                                 value.var = i)
  sims_trans[,index := rep(i,nrow(sims_trans))]
  sims_dt[[i]] <- sims_trans
}

# bind the individual index lists together
sims_dt <- rbindlist(sims_dt)

# list of indexes to evaluate
indices <- indices[c(4,7,5,8)]
indices
n_indices <- length(indices)
rel_cats <- c('fs','hs','un')
n_thresholds <- 2
blouin_thresholds <- matrix(0,ncol = n_thresholds, nrow = n_indices)
for(i in 1:n_indices) {
  for(t in 1:n_thresholds) {
    blouin_thresholds[i,t] <- blouin_classifier(
      sims_dt[index == indices[i],list(sim_ix, fs, hs, un)], 
      rel_cats = c(rel_cats[t], rel_cats[(t+1)]))
  }
}
rownames(blouin_thresholds) <- indices
blouin_thresholds

sims_dt_long <- data.table:::melt.data.table(sims_dt, 
                                             id.vars = c("index", "sim_ix"))

ggplot(sims_dt_long[index %in% indices,], aes(x = value, fill = variable)) +
  geom_density(alpha = 0.5) +
  xlab("Index") +
  ylab("Density") +
  scale_fill_discrete(name = "Relatedness categories", 
                      labels = c("Full-Sibs", "Half-Sibs", "Unrelated")) +
  theme(legend.justification=c(1,0), legend.position=c(1,0.75)) +
  geom_vline(data = 
               data.frame(
                 index = rownames(blouin_thresholds), 
                 blouin_thresholds), 
             aes(xintercept = X1, group = index), 
             colour = 'red', 
             size = 2) +
  geom_vline(data = 
               data.frame(
                 index = rownames(blouin_thresholds), 
                 blouin_thresholds), 
             aes(xintercept = X2, 
                 group = index), 
             colour = 'red', 
             size = 2) +
  facet_wrap(facets = ~ index, ncol = 2, scales = 'free')

Now that we have thresholds, we can estimate the error rates for each index. As we can see from the figure below, hk08 is the most variable index, whileQG89_rsxy is the best index for identifying full-sibs and half-sibs. Although, this comes at a slight disadvantage when identifying unrelated individuals.

# our error function
calc_error_rate_general <- function(thresholds, data, rel_cats) {
  n_cats <- length(rel_cats)
  total_sims <- nrow(data)
  res = matrix(rep(0,n_cats^2), ncol = n_cats)
  for (i in 1:n_cats) {
    class_sims <- data[, cut(eval(as.symbol(rel_cats[i])), 
               breaks = thresholds, 
               labels = rel_cats, 
               include.lowest = T)]
    res[i,] <- rev(table(class_sims)/total_sims)
  }
  rownames(res) <- rel_cats
  colnames(res) <- rel_cats
  return(res)
}
# adding the limits and re-ordering the thresholds
blouin_thresholds <- cbind(
  rep(-1,nrow(blouin_thresholds)), 
  blouin_thresholds[,2], 
  blouin_thresholds[,1], 
  rep(1,nrow(blouin_thresholds)))

error_rates <- lapply(indices, function(ix)
  calc_error_rate_general(threshold = blouin_thresholds[ix,],
                  data = sims_dt[index == ix,list(sim_ix,fs, hs, un)], 
                  rel_cats = c("fs", "hs", "un"))
  )
names(error_rates) <- indices
prop_class <- sapply(error_rates, diag)
prop_class <- melt(prop_class)

ggplot(prop_class, aes(x = Var1, y = value, fill = Var2)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  xlab("Relatedness category") +
  ylab("Proportion correctly classified") +
  scale_fill_discrete(name = "Relatedness Index")

Separating `parent-offspring` from `full-sibs`

Separating parent-offspring pairs from full-sibs is not generally possible with relatedness indices only. At the moment, the best option is to examine patterns of proportion of loci with at least one shared allele.

More on this topic is forthcoming

andersgs/irelr documentation built on May 12, 2019, 2:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

andersgs/irelr
Pedigree Inference using Molecular Marker Data

In andersgs/irelr: Pedigree Inference using Molecular Marker Data

Introduction

The data

Data format

Loading the data

Estimating relatedness indices

The `sim_rel()` function

Plotting the results

Distribution of individual indices

Joint distribution of indices

Simulating data to classify pairs

Differentiating between full-sibs and half-sibs

The `sim_rel()` function

Examining the simulated values for a single index

The joint distribution of indices of different relatedness categories

The Blouin classifier

Calculate misclassification errors

Classifying individuals

Adding an unrelated category

Simulate unrelated individuals

Plotting the distribution of relatedness values

Checking the joint distribution

Classify pairs using the Blouin classifier

Comparing indices

Separating `parent-offspring` from `full-sibs`

R Package Documentation

Browse R Packages

We want your feedback!

andersgs/irelr Pedigree Inference using Molecular Marker Data

In andersgs/irelr: Pedigree Inference using Molecular Marker Data

Introduction

The data

Data format

Loading the data

Estimating relatedness indices

The sim_rel() function

Plotting the results

Distribution of individual indices

Joint distribution of indices

Simulating data to classify pairs

Differentiating between full-sibs and half-sibs

The sim_rel() function

Examining the simulated values for a single index

The joint distribution of indices of different relatedness categories

The Blouin classifier

Calculate misclassification errors

Classifying individuals

Adding an unrelated category

Simulate unrelated individuals

Plotting the distribution of relatedness values

Checking the joint distribution

Classify pairs using the Blouin classifier

Comparing indices

Separating parent-offspring from full-sibs

R Package Documentation

Browse R Packages

We want your feedback!

andersgs/irelr
Pedigree Inference using Molecular Marker Data

The `sim_rel()` function

The `sim_rel()` function

Separating `parent-offspring` from `full-sibs`