knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tidyverse) library(CKMRpop)
For this first example, we use the hypothetical life history of species 2 --- a teleost. First we have to set spip up to run with that life history.
spip
has a large number of demographic parameters. Typically spip
is run as
a command-line program in Unix. In CKMRpop, all that action goes on under the
hood, but you still have to use the spip
parameters. This vignette is not about
using spip
. For a short listing of all the spip
options, do this:
library(CKMRpop) system(paste0(spip_binary(), " --help"))
If you want a full, complete, long listing of all the spip
options, then you
can do:
library(CKMRpop) system(paste0(spip_binary(), " --help-full"))
All of the "long-form" options to spip
are given on the Unix command line starting
with two dashes, like --fem-surv-probs
. To set parameters within CKMRpop
to send
to spip, you simply make a named list of input values. The names of the items in the
list are the long-format option names without the leading two dashes. For an example,
see the package data object species_1_life_history
, as described below.
These parameters are included in the package in the variable species_1_life_history
.
It is named list of parameters to send to spip. The list names are the names of the
spip options. It looks like this:
species_2_life_history
We must note here that we have done something reasonable for computational efficiency. The rate of survivla from age 0 to age 1 is almost certainly much lower than 100%, but 0-year olds do not contribute to the next generation (0 prob of reproducing), and we arent't really keeping track of their numbers, so we just include them in there, knowing that we aren't really caring about individuals until they are at least 1 year old. And with that, we can set the cohort size to whatever we need it to be to have the right number of 2+ year-olds, for example. There isn't much point to simulating a bunch of newborns that end up dying before we even care about them.
Now, we want to add instructions to those life-history parameters, telling spip how long to run the simulation, and what the initial census sizes should be.
So, first, we copy species_2_life_history
to a new variable, SPD
:
SPD <- species_2_life_history
Now, we can add things to SPD.
The number of new fish added each year is called the "cohort-size". Once we know that, we can figure out what the stable age distribution would be given the survival rates, and we can use that as our starting point. There is a function in the package that helps with that:
# before we tell spip what the cohort sizes are, we need to # tell it how long we will be running the simulation SPD$`number-of-years` <- 60 # run the sim forward for 50 years # This is our cohort size that gives us a stable age distribution # and population size of 2+ individuals of about half a million #cohort_size <- 370000 cohort_size <- 370 # Do some matrix algebra to compute starting values from the # stable age distribution: L <- leslie_from_spip(SPD, cohort_size) sum(L$stable_age_distro_fem[-(1:2)]) # then we add those to the spip parameters SPD$`initial-males` <- floor(L$stable_age_distro_fem) SPD$`initial-females` <- floor(L$stable_age_distro_male) # tell spip to use the cohort size SPD$`cohort-size` <- paste("const", cohort_size, collapse = " ") # how many 2, 3, and 4 year olds? sum(L$stable_age_distro_fem[2:4]) + sum(L$stable_age_distro_male[2:4])
Spip let's you specify what fraction of fish of different ages should be sampled in different years. We want about 5000 samples a year so let's sample the 2 to 4 year olds at 1.5%. We will sample this many for 3 years, from years 51 to 53. That will give us enough time that we shouldn't have founders less than 3 generations back.
First off, note that these are going to be lethal samples, so:
SPD$`lethal-sampling` <- 1.00
samp_frac <- 0.01 samp_start_year <- 51 samp_stop_year <- 52 SPD$`discard-all` <- 0 SPD$`gtyp-ppn-fem-pre` <- paste( samp_start_year, "-", samp_stop_year, " 0 ", # don't sample any 0/1 year-olds samp_frac, " ", samp_frac, " ", samp_frac, " ", # sample 1.5% of the 2, 3, and 4, year olds paste(rep(0, SPD$`max-age` - 4), collapse = " "), sep = "" ) SPD$`gtyp-ppn-male-pre` <- SPD$`gtyp-ppn-fem-pre`
There is a function that does all this for you. It runs spip in a temporary directory. After running spip, it also processes the output with a few shell scripts. The function returns the path to the temporary directory. You will need that to slurp all the results back in.
set.seed(5) spip_dir <- run_spip(pars = SPD) # now read that in and find relatives within the grandparental range slurped <- slurp_spip(spip_dir, 2)
Although during massive production simulations, you might not go back to every run and summarize it to see what it looks like, when you are parameterizing demographic simulations you will want to be able to quickly look at observed demographic rates and things. There are a few functions in CKMRpop that make this quick and easy to do.
This is just a convenience function to make a pretty plot so you can check to see what the population demographics look like:
ggplot_census_by_year_age_sex(slurped$census_prekill)
This shows that the function leslie_from_spip()
does a good job of finding the
initial population numbers that accord with the stable age distribution.
We can compute the survival rates like this:
surv_rates <- summarize_survival_from_census(slurped$census_prekill)
That returns a list. One part of the list is a tibble with observed survival fractions. The first 40 rows look like this:
surv_rates$survival_tibble %>% slice(1:40)
The second part of the list holds a plot with histograms of age-specific, observed survival rates across all years. The blue line is the mean over all years.
surv_rates$plot_histos_by_age_and_sex
To compare these values to the parameter values for the simulation, you must pass those to the function:
surv_rates2 <- summarize_survival_from_census( census = slurped$census_prekill, fem_surv_probs = SPD$`fem-surv-probs`, male_surv_probs = SPD$`male-surv-probs` ) # print the plot surv_rates2$plot_histos_by_age_and_sex
Here, the red dashed line is the value chosen as the parameter for the simulations. The means are particularly different for the older age classes, which makes sense because there the total number of individuals in each of those year classes is smaller.
It makes sense to check that your simulation is delivering a reasonable distribution of offspring per year. This is the number of offspring that survive to just before the first prekill census. Keep in mind that, for super high-fecundity species, we won't model every single larva, we just don't start "keeping track of them" until they reach a stage that is recognizable in some way.
We make this summary from the pedigree information. In order to get the number of adults that were present, but did not produce any offspring, we also need to pass in the postkill census information. Also, to get lifetime reproductive output, we need to know how old individuals were when they died, so we also pass in the information about deaths.
To make all the summaries, we do:
offs_and_mates <- summarize_offspring_and_mate_numbers( census_postkill = slurped$census_postkill, pedigree = slurped$pedigree, deaths = slurped$deaths, lifetime_hexbin_width = c(1, 2) )
Note that we are setting the lifetime reproductive output hexbin width to be suitable for this example.
The function above returns a list of plots, as follows:
offs_and_mates$plot_age_specific_number_of_offspring
Especially when dealing with viviparous species (like sharks and mammals) it is worth checking this to make sure that there aren't some females having far too many offspring.
Especially with long-lived organisms, it can be instructive to see how lifetime reproductive output varies with age at death.
offs_and_mates$plot_lifetime_output_vs_age_at_death
Yep, many individuals have no offspring, and you have more kids if you live longer.
Out of all the offspring born each year, we can tabulate the fraction that were born to males (or females) of each age. This summary shows a histogram of those values. The represent the distribution of the fractional contribution of each age group each year.
offs_and_mates$plot_fraction_of_offspring_from_each_age_class
The blue vertical lines show the means over all years.
Some of the parameters in spip
affect the distribution of the number
of mates that each individual will have. We can have a quick look
at whether the distribution of number of mates (that produced at least
one offspring) appears to be what we might hope it to be.
mates <- count_and_plot_mate_distribution(slurped$pedigree)
That gives us a tibble with a summary, like this:
head(mates$mate_counts)
And also a plot:
mates$plot_mate_counts
First, let's count up how many samples of different ages we have gotten. We are sampling before the episode of death in a year.
samps <- slurped$samples %>% mutate( first_samp = map_int(samp_years_list, function(x) x[1]), # just record the first year they were sampled age = first_samp - born_year )
Now, get the total number of distinct individuals sampled each year
samps %>% count(first_samp)
samps %>% count(sex, first_samp, age)
From the samples that we slurped up from the spip output we
compile all the related pairs that we found in there with a single
function. It is important to note that this finds all the related pairs
that share ancestors back within num_generations
generations. Recall, that
we ran slurp_spip()
with num_generations = 2
which means that we
check for matching ancestors up to and including the grandparents of the sample.
crel <- compile_related_pairs(slurped$samples)
The result that comes back has a single row for each pair. The individuals appear in each pair such that the first name comes before the second name, alphabetically, in each pair. There is information in list columns about the year(s) that each member of the pair was sampled in and also years in which they were born, and also the indices of the populations they were sampled from. (Knowing the population will become useful when/if we start simulating multiple populations connected by gene flow). Here we show the first 10 pairs in the samples:
crel %>% slice(1:10)
Because some pairs might be related in multiple ways (i.e., they might be paternal half-sibs,
but, through their mother's lineage, they might also be half-first cousins), things can
get complicated. However, CKMRpop
has an algorithm to categorize pairs according to
their "most important" relationship.
The column dom_relat
gives the "dominant" or "closest" relationship between the pair.
The possibilities, when considering up to two generations of ancestors are:
Se
: self. PO
: parent-offspringSi
: sibling. GP
: grandparentalA
: avuncular (aunt-neice)FC
: first cousin. The max_hit
column can be interpreted as the number of shared ancestors at the
level of the dominant relationship. For example a pair of half-siblings are of
category Si
and have max_hit = 1
, because they share one parent. On the other
hand, a pair with category Si
and max_hit = 2
would be full siblings.
Likewise, A
with max_hit = 1
is a half-aunt-neice or half-uncle-nephew pair,
while A
with max_hit = 2
would be a full-aunt-neice or full-uncle-nephew pair.
The column dr_hits
gives the number of shared ancestors on the upper vs lower
diagonals of the ancestry match matrices (see below). These are meaningful primarily
for understanding the "directionality" of non-symmetrical relationships. Some explanation
is in order: some relationships, like Se, Si, and FC are symmetrical relationships, because,
if, for example Greta is your sibling, then you are also Greta's sibling. Likewise, if you are
Milton's first cousin, then Milton is also your first cousin. Other relationships, like
PO, A, and GP, are not symmetrical: If Chelsea is your mother, then you are not Chelsea's
mother, you are Chelsea's child. In the non-symmetrical relationships there is always one
member who is typically expected to be older than the other. This is a requirement
in a direct-descent relationship (like parent-offspring, or grandparent-grandchild),
but is not actually required in avuncular relationships (i.e. it is possible to have an
aunt that is younger than the nephew...). We refer to the "typically older" member
of non-symmetrical pairs as the "upper member" and the upper_member
column of the output
above tells us whether id_1
or id_2
is the upper member in such relationships,
when upper_member
is 1 or 2, respectively. upper_member
is NA for symmetrical
relationships and it can be 0 for weird situations that should rarely arise where,
for example a pair A and B is related such that A is B's half-uncle, but B is A's half-aunt.
Often the dominant relationship is the only relationship between the pair. However,
if you want to delve deeper into the full relationship out to num_generations
generations, you can analyze the ancestry match matrix for the pair, which is stored in the
the anc_match_matrix
column. This matrix holds a TRUE
for each shared ancestor in the two individual's ancestry (out to num_generations
).
If this seems obtuse, it should become more understandable when we look at some figures, later.
Here, we count up the number of pairs that fall into different relationship types:
relat_counts <- count_and_plot_ancestry_matrices(crel)
The first component of the return list is a tibble of the relationship counts in a highly summarized form
tabulating just the dom_relat
and max_hit
over all the pairs.
relat_counts$highly_summarised
This is telling us there are 230 half-first-cousin pairs, 161 half-avuncular (aunt/uncle with neice/nephew) pairs, and so forth.
As noted before, the table able just lists the dominant relationship of
each pair. If you want to quickly assess, within those dominant categories,
how many specific ancestry match matrices underlie them, you can
look at the dr_counts
component of the output:
relat_counts$dr_counts
Each of these distinct ancestry match matrices for each dominant relationship can be visualized in a series of faceted plots, which are also returned. For example, the ancestry match matrices seen amongst the FC relationships are:
relat_counts$dr_plots$FC
Within each dominant relationship, the distinct ancestry matrices in each
separate panel are named according to their number
(001
, 002
, 003
, etc), relationship and the
dr_hits
vector, (FC[1,2]
) and the number of times this ancestry match matrix
was observed amongst pairs in the sample (like - 4
). So 006-FC[1,1] - 24
was
observed in 24 of the sampled pairs.
It is worth pointing out that the 014-FC[2,2] - 2
plot shows two pairwise relationships
in which individual 2 is inbred, because its father's father (pp) and mother's father
(mp) are the same individual.
Let's look at the distinct ancestry match matrices from the siblings:
relat_counts$dr_plots$Si
Here in 005
we see an interesting case where ind_1 and ind_2 are maternal half sibs,
put also individual 2's father is also his/her uncle.
At times, for example, when looking for the more bizarre relationships, you might just want to visualize all the ancestry match matrices in order of the number of times that they occur. The number of different ancestry match matrices (and the matrices themselves) can be accessed with:
relat_counts$anc_mat_counts
But more useful for visualizing things is relat_counts$anc_mat_plots
which is a list that
holds a series of pages/plots showing all the different
ancestry matrices seen. Here are the first 30:
relat_counts$anc_mat_plots[[1]]
And here are the remaining 15 relationship types:
These are worth staring at for a while, and making sure you understand what they are saying. I spent a lot of time staring at these, which is how I settled upon a decent algorithm for identifying the dominant relationship in each.
When using spip
within CKMRpop
you have to specify the fraction of
individuals in the population that you want to sample at any particular time.
You must set those fractions so that, given the population size, you end up with
roughly the correct number of samples for the situation you are trying to
simulate. Sometimes, however, you might want to have sampled exactly 5,000
fish. Or some other number. The function downsample_pairs
lets you randomly
discard specific instances in which an individual was sampled so that the
number of individuals (or sampling instances) that remains is the exact number
you want.
For example, looking closely at slurped$samples
shows that 168 distinct individuals were sampled:
nrow(slurped$samples)
However, those 168 individuals represent 170 distinct sampling instances, because two of the individuals were each sampled twice, as, in this simulation scenario, sampling the individuals does not remove them from the population.
SS2 <- slurped$samples %>% filter(map_int(samp_years_list, length) > 1) %>% select(ID, samp_years_list) SS2
And the years that these individuals were sampled are as follows:
# first indiv: SS2$samp_years_list[[1]]
Great! Now, imagine that we wanted to see how many kin pairs we found when our sampling was such that we had only 100 instances of sampling (i.e., it could have been 98 individuals sampled in total, but two of them were sampled in two different years). We do like so:
subsampled_pairs <- downsample_pairs( S = slurped$samples, P = crel, n = 100 )
Now there are only 179 pairs instead of 513.
We can do a little calculation to see if that makes sense: because the number of pairs varies roughly quadratically, we would expect that if we go from 170 instances to 100, we would expect the number of pairs to decrease by a factor of $(100/170)^2 \approx 0.346$. Let's check that: $179/513 = \approx 0.348$. Yep, that makes sense.
Finally, in order to visually summarize all the kin pairs that were found, with specific reference to their age, time of sampling, and sex, I find it helpful to use what I have named the "Uncooked Spaghetti Plot". There are multiple subpanels on this plot. Here is how to read/view these plots:
F->F
, F->M
, M->F
, and M->M
. These
refer to the different possible combinations of sexes of the individuals in the pair.F
for female or M
for male) denoting the sex of the "upper_member"
of the relationship. That is, if it is PO, then the sex of the parent is the first letter.
The sex of the non-upper-member is the second letter. Thus a PO
pair that consists of
a father and a daughter would appear in a plot that is in the PO
row in the M->F
column.F->M
subpanel.max_hits
) at the level
of the dominant relationship. This is how you can distinguish full-sibs from half-sibs, etc.We crunch out the data and make the plot like this:
# because we jitter some points, we can set a seed to get the same # result each time set.seed(22) spag <- uncooked_spaghetti( Pairs = crel, Samples = slurped$samples )
Now, the plot can be printed like so:
spag$plot
Looking at this, it is clear that I will also want to add a function to tally up the number of pairwise relationships that are shared between members---basically just compiling the connected components for all of those.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.