Introduction

knitr::opts_chunk$set(fig.width = 7)

First, the library is loaded:

library(malan)

For reproducibility, the seed for the (pseudo) random number generator is set:

set.seed(1)

Population simulation

A standard Wright-Fisher population can be simulated (hiding progress information) as follows:

sim_res <- sample_geneology(population_size = 10, generations = 10, progress = FALSE)

Building the pedigrees

Until pedigrees are built/inferred, there is not much information available (e.g. about children). So let us infer the pedigrees:

pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
pedigrees
pedigrees_count(pedigrees)
pedigrees_table(pedigrees)
pedigree_size(pedigrees[[1]])
pedigree_size(pedigrees[[2]])
#pedigree_size(pedigrees[[3]]) # error as there are only 2 pedigrees

The pedigrees can be plotted all at once:

plot(pedigrees)

Or just one at a time:

plot(pedigrees[[1]])
plot(pedigrees[[2]])

Some information about the population can be obtained. For example, the individuals in the final generation can be saved:

str(sim_res, 1)
live_individuals <- sim_res$end_generation_individuals

And a live individual is printed:

print_individual(live_individuals[[1]])

We can also print another individual (from the entire population):

indv <- get_individual(sim_res$population, 22)
print_individual(indv)

Run a mutation process

set.seed(1)

mutrts <- c(0.5, 0.5)
pedigrees_all_populate_haplotypes(pedigrees = pedigrees, 
                                  loci = length(mutrts), 
                                  mutation_rates = mutrts, progress = FALSE)

Individual pedigrees can now be plotted with haplotype information:

plot(pedigrees[[1]], haplotypes = TRUE)

And the individual id can be removed to only display the haplotype:

plot(pedigrees[[1]], ids = FALSE, haplotypes = TRUE)

And one or more individuals can be marked/highlighted:

plot(pedigrees[[1]], ids = TRUE, haplotypes = TRUE, mark_pids = c(14, 16))

More than 1 full generation

By standard, only the last generation contains $N$ individuals. If the 3 last generations should be full, this can be done by specifying generations_full = 3 as follows:

set.seed(1)
sim_res <- sample_geneology(population_size = 10, 
                            generations = 5, 
                            generations_full = 3,
                            progress = FALSE)
pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
plot(pedigrees)

And to obtain the complete history, generations_full is set to generations:

set.seed(1)
sim_res <- sample_geneology(population_size = 10, 
                            generations = 5, 
                            generations_full = 5,
                            progress = FALSE)
pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
plot(pedigrees)

Now, there are 10 individuals in all generations.

Simulate to one founder

By standard, the number of generations are specified. Instead, it can be specified to continue simulating until one common founder is reached by specifying generations = -1:

set.seed(1)
sim_res <- sample_geneology(population_size = 10, 
                            generations = -1, 
                            progress = FALSE)
pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
plot(pedigrees)

The number of generations needed can be obtained as follows:

sim_res$generations

Counting matches

Simulating the population

Let's try to simulate a larger population with 3 full generations (the additional generations_return is to get all individuals in the last 3 generations returned in the individuals_generations slot, cf. below):

set.seed(1)
sim_res <- sample_geneology(population_size = 1e3, 
                            generations = 200, 
                            generations_full = 3,
                            generations_return = 3, # default value
                            progress = FALSE)

And build the pedigrees:

pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
pedigrees_table(pedigrees)
pedigrees_count(pedigrees)

So there are r pedigrees_count(pedigrees) pedigrees. Let's try to plot the largest one:

ped_sizes <- sapply(1L:pedigrees_count(pedigrees), function(i) pedigree_size(pedigrees[[i]]))
ped_sizes
largest_i <- which.max(ped_sizes)
plot(pedigrees[[largest_i]])

And the impose mutations from a 20 locus haplotype with mutation rate 0.001 per locus:

set.seed(10)
mutrts <- rep(0.001, 20)
pedigrees_all_populate_haplotypes(pedigrees = pedigrees, 
                                  loci = length(mutrts), 
                                  mutation_rates = mutrts, progress = FALSE)

The haplotypes at the live individuals (3 generations) can be inspected:

live_individuals <- sim_res$individuals_generations
length(live_individuals)
haps <- get_haplotypes_individuals(individuals = live_individuals)
head(haps)

Lets look at the spectrum:

haps_str <- apply(haps, 1, paste0, collapse = ";")
haps_tab <- table(haps_str)
sort(haps_tab, decreasing = TRUE)[1:10]
spectrum <- table(haps_tab)
spectrum

Drawing an individual and counting matches

set.seed(100)
Q_index <- sample.int(n = length(live_individuals), size = 1)
Q <- live_individuals[[Q_index]]
Q_hap <- get_haplotype(Q)
Q_hap

First, identify $Q$'s pedigree:

Q_ped <- get_pedigree_from_individual(Q)

Now, count matches in pedigree and in live population:

count_haplotype_occurrences_pedigree(pedigree = Q_ped, haplotype = Q_hap, generation_upper_bound_in_result = 2)
count_haplotype_occurrences_individuals(individuals = live_individuals, haplotype = Q_hap)

We can also inspect pedigree matches information about number of meioses and $L_1$ distances:

path_details <- pedigree_haplotype_matches_in_pedigree_meiosis_L1_dists(suspect = Q, 
                                                                        generation_upper_bound_in_result = 2)
nrow(path_details)
head(path_details)

Look at the distribution of number of meioses between $Q$ and the matches (there are 0 meioses between Q and himself):

meioses <- path_details[, 1]
hist(meioses)

On the path between $Q$ and the match, the maximum $L_1$ difference between $Q$'s haplotype and the haplotypes of the individuals on the path is recorded (0 means that that no mutations have occured on the path between $Q$ and the match):

L1_max <- path_details[, 2]
table(L1_max)
mean(L1_max == 0)

Mixtures

Draw true contributors:

set.seed(100)
U_indices <- sample.int(n = length(live_individuals), size = 2, replace = FALSE)
U1 <- live_individuals[[U_indices[1]]]
U2 <- live_individuals[[U_indices[2]]]
H1 <- get_haplotype(U1)
H2 <- get_haplotype(U2)

View haplotypes:

rbind(H1, H2)

Now, find those haplotype in live individuals (those haplotypes are in haps from before) that are included in (or compatible with) the mixture:

#mixres <- indices_in_mixture_by_haplotype_matrix(haplotypes = haps, H1 = H1, H2 = H2)
mixres <- mixture_info_by_individuals_2pers(live_individuals, U1, U2)
str(mixres, 1)

Compare with matches:

length(mixres$pids_matching_donor1)
count_haplotype_occurrences_individuals(individuals = live_individuals, haplotype = H1)

length(mixres$pids_matching_donor2)
count_haplotype_occurrences_individuals(individuals = live_individuals, haplotype = H2)

In mixture that are not H1 and not H2:

length(mixres$pids_others_included)

Inspect these (only unique ones):

others_haps <- get_haplotypes_pids(sim_res$population, mixres$pids_others_included)
others_haps <- others_haps[!duplicated(others_haps), ]
others_haps

And get their population counts:

others_haps_counts <- unlist(lapply(seq_len(nrow(others_haps)), function(hap_i) {
            count_haplotype_occurrences_individuals(individuals = live_individuals, 
                                                    haplotype = others_haps[hap_i, ])
          }))
sum(others_haps_counts)
length(mixres$pids_others_included)

Compare with profiles of true contributors:

rbind(H1, H2)

Other functions

Variance in number of children

Let $\alpha$ be the parameter of a symmetric Dirichlet distribution specifying each man's probability to be the father of an arbitrary male in the next generation. When $\alpha=5$, a man's relative probability to be the father has 95\% probability to lie between 0.32 and 2.05, compared with a constant 1 under the standard Wright-Fisher model and the standard deviation in the number of male offspring per man is 1.10 (standard Wright-Fisher = 1).

This symmetric Dirichlet distribution is implemented by drawing father (unscaled) probabilities from a Gamma distribution with parameters shape and scale that are then normalised to sum to 1. To obtain a symmetric Dirichlet distribution with parameter $\alpha$, the shape must be $\alpha$ and scale $1/\alpha$. This is simulated as follows (note the enable_gamma_variance_extension parameter):

dirichlet_alpha <- 5

set.seed(1)
sim_res <- sample_geneology(population_size = 10, 
                            generations = 10, 
                            enable_gamma_variance_extension = TRUE,
                            gamma_parameter_shape = dirichlet_alpha,
                            gamma_parameter_scale = 1 / dirichlet_alpha,
                            progress = FALSE)
pedigrees <- build_pedigrees(sim_res$population, progress = FALSE)
plot(pedigrees)

Let us verify the claim that the standard deviation in the number of male offspring per man is 1.10. The easiest way is to get information about father id's, which is done by asking for a verbose_result:

N <- 1000

set.seed(1)
sim_res <- sample_geneology(population_size = N, 
                            generations = 2, 
                            enable_gamma_variance_extension = TRUE,
                            gamma_parameter_shape = dirichlet_alpha,
                            gamma_parameter_scale = 1/dirichlet_alpha,
                            progress = FALSE, verbose_result = TRUE)

tbl_fathers_with_children <- table(sim_res$father_pids[, 1])
tbl_fathers_no_children <- rep(0, N - length(tbl_fathers_with_children))

number_of_children <- c(tbl_fathers_with_children, tbl_fathers_no_children)
number_of_children <- as.numeric(number_of_children)

mean(number_of_children)
sd(number_of_children)

Let os get estimates in parallel:

get_number_children <- function(N) {
  sim_res <- sample_geneology(population_size = N, 
                              generations = 2, 
                              enable_gamma_variance_extension = TRUE,
                              gamma_parameter_shape = dirichlet_alpha,
                              gamma_parameter_scale = 1 / dirichlet_alpha,
                              progress = FALSE, verbose_result = TRUE)

  tbl_fathers_with_children <- table(sim_res$father_pids[, 1])
  tbl_fathers_no_children <- rep(0, N - length(tbl_fathers_with_children))

  number_of_children <- c(tbl_fathers_with_children, tbl_fathers_no_children)
  number_of_children <- as.numeric(number_of_children)

  return(number_of_children)
}

library(parallel)
options(mc.cores = 2)
RNGkind("L'Ecuyer-CMRG") # for mclapply
set.seed(1)
x <- mclapply(1:100, function(i) get_number_children(100))
sds <- unlist(lapply(x, sd))
mean(sds)

Population growth

Population growth can be simulated by specifying the population size at each generation by the population_sizes vector, where the length thus specifies the number of generations:

set.seed(1)
sim_res_growth <- sample_geneology_varying_size(population_sizes = c(10, 20, 10), 
                                                generations_full = 3, 
                                                progress = FALSE)

Note how generations_full was used to obtain all individuals in the generations (and not just those with descendants in the last two):

pedigrees_growth <- build_pedigrees(sim_res_growth$population, progress = FALSE)
plot(pedigrees_growth)


Try the malan package in your browser

Any scripts or data that you put into this service are public.

malan documentation built on July 2, 2020, 3:01 a.m.