genedroppeR: conduct single-locus gene-drop analyses through pedigrees.

Susan Johnston, May 2019

Updated Sep 2023

A "gene-drop" simulation approach can model expected changes in allele frequencies due to genetic drift, given an identical pedigree structure.

To install this package, you can use the install_github command in library(devtools):

library(devtools)
install_github("susjoh/genedroppeR")

NB This is a work in progress. More description will be given in time. The functions are also slower than they could be - we are working on it.

Do you have comments, polite criticism, feedback, helpful suggestions? Please contact me at Susan.Johnston (at) ed.ac.uk.

Loading the library and examining the example dataset.

library(genedroppeR)

The library comes with an example dataset from a long-term study of unicorns on the island of Áiteigin off the South coast of Scotland (Figure 1). They have been genotyped for three loci: HornSNP, a major quantitative trait locus for horn length; MHC, encompassing haplotype variation at the Magic Histocompatibility Locus; and ColourSNP, a SNP responsible for a rare glitter colour polymorphism.

Figure 1: A typical unicorn.

data(unicorn)
str(unicorn)

Unicorns have a promiscuous mating system, in common with other magical beasts. We can see an example of the pedigree below. The numbers at the side indicate the cohorts of when each unicorn was born. This can be any sequential series of numbers.

# library(devtools)
# install_github("susjoh/ggpedigree")

library(ggpedigree)

ggpedigree(pedigree = unicorn[,1:3],
           cohort = unicorn$cohort,
           line.col.mother = "black",
           line.col.father = "black",
           point.size = 0, line.alpha = 0.1)

We can summarise this data in terms of the allele frequency changes using the function summary_cohort(). Let's start with the HornSNP locus.

unicorn.summ <- summary_cohort(id = unicorn$id,
                               mother = unicorn$mother,
                               father = unicorn$father,
                               cohort = unicorn$cohort,
                               genotype = unicorn$HornSNP)

unicorn.summ

This output provides the allele frequencies for the A and G alleles per cohort, as well as a count of the number of genotyped individuals and the total number of individuals per cohort, the proportion of genotyped individuals, the numbers of non founders and founders, and the proportion of founders per cohort. A founder is defined as an individual where neither the mother or father is known.

The frequency of the A allele, which confers a larger horn, seems to be increasing in the population:

library(ggplot2)

ggplot(unicorn.summ, aes(cohort, A)) +
  geom_line() +
  stat_smooth(method = "lm") +
  ggtitle("Temporal dynamics of A allele")

However, given the complex structure of the pedigree, it may be that this increase in allele frequency is more likely to be due to drift than selection. One approach to model this is to simulate allele frequency changes given a pedigree of the same structure, and see if the observed allele frequency change is significantly different from what we would expect by chance.

Example 1: The Horns locus.

In early cohorts, many individuals are likely to be founders. To run the gene-drop, we select some cohorts at the start to be the "founder" cohorts, where allele frequencies are sampled from the observed frequencies in each year. To determine which cohorts we define as founder cohorts, we need to explore the proportion of genotyped and founder individuals to determine the best way to set up our simulation.

ggplot(unicorn.summ, aes(cohort, PropFounders)) +
  geom_line() +
  ggtitle("Proportion of Founder IDs per cohort")

ggplot(unicorn.summ, aes(cohort, PropGenotyped)) +
  geom_line()  +
  ggtitle("Proportion of Genotyped IDs per cohort")

The number of cohorts declines rapidly until around 5 cohorts, so lets define cohorts 1:5 as founder cohorts, and 6:20 as simulated cohorts i.e. we will investigate how allele frequencies change in the second 15 year period from year 6 to year 20.

As Horns is a biallelic locus, we can model the gene-drop simulation using the function genedrop_snp(). We define the id, mother, father, cohort and genotype using columns from the unicorn dataset. We then define the following:

Let's try it:

unicorn.UF <- genedrop_snp(id = unicorn$id,
                           mother = unicorn$mother,
                           father = unicorn$father,
                           cohort = unicorn$cohort,
                           genotype = unicorn$HornSNP,
                           nsim = 1000,
                           n_founder_cohorts = 5,
                           fix_founders = F,
                           verbose = T,
                           interval = 200)
str(unicorn.UF)

This outputs a data frame that includes the simulated genotypes for each individual. For computational reasons, the genotypes have been reported as allele dosages, e.g. 0 = AA, 1 = AG, 2 = GG.

This is a very large dataset, but we can summarise the allele frequencies per cohort with the function summary_genedrop:

unicorn.UF.summ <- summary_genedrop(unicorn.UF)
str(unicorn.UF.summ)

This object contains two lists: the observed frequencies at Horn and the simulated frequencies.

unicorn.UF.summ$observed_frequencies

To visualise how the observed and simulated frequencies compare, we can use the plot_genedrop_results() function:

plot_genedrop_results(unicorn.UF.summ)

We can see that the founder cohorts (1:5) were sampled based on the observed frequencies, but this constraint is then removed after 5 generations. The red line shows the observed allele frequency change.

We can then use the function plot_genedrop_lm_slopes() to look at how the linear regression slopes of the allele frequencies compare to the observed slope (red vertical line). This function also outputs the number of simulated slopes that were higher or lower than the observed slope.

plot_genedrop_lm_slopes(unicorn.UF.summ,
                        n_founder_cohorts = 5,
                        remove_founders = T)

There is some suggestion that the increase in frequency of the A allele could be due to selection in this population.

One apparent shortcoming of this approach is that we can only detect strong, directional selection. This is a valid criticism. However, one could potentially identify signature of balancing selection at a locus if there is less cumulative change in allele frequency over time than expected due to change (NB. At this stage this is just an idea - any feedback on this approach is welcome). The function plot_genedrop_cumulative_change() plots and reports the cumulated change over the non-founder cohorts:

plot_genedrop_cumulative_change(unicorn.UF.summ,
                                n_founder_cohorts = 5,
                                remove_founders = T)

This also suggests that the observed allele frequency change from year to year is less than expected by chance.

unicorn.UF.resample <- genedrop_snp(id = unicorn$id,
                           mother = unicorn$mother,
                           father = unicorn$father,
                           cohort = unicorn$cohort,
                           genotype = unicorn$HornSNP,
                           nsim = 1000,
                           n_founder_cohorts = 5,
                           fix_founders = F,
                           verbose = T,
                           interval = 200,
                           resample_offspring = T)



unicorn.UF.resample.summ <- summary_genedrop(unicorn.UF.resample)
plot_genedrop_results(unicorn.UF.resample.summ)

plot_genedrop_lm_slopes(unicorn.UF.resample.summ,
                        n_founder_cohorts = 5,
                        remove_founders = T)

plot_genedrop_cumulative_change(unicorn.UF.resample.summ,
                                n_founder_cohorts = 5,
                                remove_founders = T)

Example 2: The Magic Histocompatibility Locus (MHC)

This locus characterises haplotype variation at a moderately variable locus controlling magic compatibility within unicorns.

table(unicorn$MHC)

As you can see, there are 4 alleles at this locus. Because there are more than two alleles, we can conduct gene-dropping simulations with the function genedrop_multi(). NB. If you are running this on genotypes with a delimiter e.g. "A/C" or "Lo_Hi", you will need to define it using genotype_delim = "/" or genotype_delim = "_" within the function.

unicorn.summ <- summary_cohort(id = unicorn$id,
                               mother = unicorn$mother,
                               father = unicorn$father,
                               cohort = unicorn$cohort,
                               genotype = unicorn$MHC)

unicorn.summ

unicorn.MHC.UF <- genedrop_multi(id = unicorn$id,
                                 mother = unicorn$mother,
                                 father = unicorn$father,
                                 cohort = unicorn$cohort,
                                 genotype = unicorn$MHC,
                                 nsim = 1000,
                                 n_founder_cohorts = 5,
                                 fix_founders = F,
                                 verbose = T,
                                 interval = 1)

unicorn.MHC.UF.summ <- summary_genedrop(unicorn.MHC.UF)

plot_genedrop_results(unicorn.MHC.UF.summ)

plot_genedrop_lm_slopes(unicorn.MHC.UF.summ)

plot_genedrop_cumulative_change(unicorn.MHC.UF.summ)

Example 3: The Colour Locus

unicorn.summ <- summary_cohort(id = unicorn$id,
                               mother = unicorn$mother,
                               father = unicorn$father,
                               cohort = unicorn$cohort,
                               genotype = unicorn$ColourSNP)

unicorn.summ

unicorn.colour.UF <- genedrop_snp(id = unicorn$id,
                                  mother = unicorn$mother,
                                  father = unicorn$father,
                                  cohort = unicorn$cohort,
                                  genotype = unicorn$ColourSNP,
                                  nsim = 1000,
                                  n_founder_cohorts = 5,
                                  fix_founders = F,
                                  verbose = T,
                                  interval = 200)

unicorn.colour.UF.summ <- summary_genedrop(unicorn.colour.UF)

plot_genedrop_results(unicorn.colour.UF.summ)

plot_genedrop_lm_slopes(unicorn.colour.UF.summ)

plot_genedrop_cumulative_change(unicorn.colour.UF.summ)


susjoh/genedroppeR documentation built on Sept. 9, 2024, 3:19 a.m.