Case study: Partial profiles
In malan: MAle Lineage ANalysis

knitr::opts_chunk$set(fig.width = 7)

First, the library is loaded:

library(malan)

For reproducibility, the seed for the (pseudo) random number generator is set:

set.seed(1)

Population simulation

First, the population sizes are determined:

population_sizes <- round(c(rep(100, 10), 100*1.02^(1:10)))
plot(population_sizes, type = "b")

A population can be simulated (hiding progress information) as follows:

set.seed(1) # For reproducibility
sim_res_growth <- sample_geneology_varying_size(
  population_sizes = population_sizes, 

  # VRS = 0.2:
  enable_gamma_variance_extension = TRUE,
  gamma_parameter_shape = 5,
  gamma_parameter_scale = 1/5,

  # Live population: 
  # 3 generations
  generations_full = 3, 
  generations_return = 3,

  progress = FALSE)

Live population:

live_pop <- sim_res_growth$individuals_generations

Building the pedigrees

Until pedigrees are build/infered, there is not much information available (e.g. about children). So let us infer the pedigrees:

pedigrees <- build_pedigrees(sim_res_growth$population, progress = FALSE)
pedigrees
pedigrees_count(pedigrees)
pedigrees_table(pedigrees)
pedigree_size(pedigrees[[1]])

We can look at the population as a (tidy)graph:

g <- as_tbl_graph(pedigrees)
g

This can be plotted:

if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g, layout = 'tree') +
    geom_edge_link() +
    geom_node_point(size = 8) +
    geom_node_text(aes(label = name), color = "white") +
    facet_nodes(~ ped_id) +
    theme_graph() 

  print(p)
}

This is rather difficult to make any sense of. Let's instead plot only pedigree 1:

PED_ID <- 1

g_ped2 <- g %>% 
  activate(nodes) %>% 
  filter(ped_id == PED_ID)

if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g_ped2, layout = 'tree') +
      geom_edge_link() +
      geom_node_point(size = 8) +
      geom_node_text(aes(label = name), color = "white") +
      theme_graph() 
  print(p)
}

Run a mutation process

Up until now, only the genealogy has been simulated. Now, we run a mutational process, i.e. assign haplotypes to founders and let haplotypes flow down the individuals.

We use realistic data. In the package, there is information about the individual markers:

ystr_markers

Note, that MutProb is the point estimate given by MutProb = Mutations / Meioses. Information about which markers that are in which kit is also provided:

ystr_kits
ystr_kits %>% count(Kit)

Let us take all PowerPlex Y23 markers and assume that we only have a partial profile where DYS437 and DYS448 dropped out. At the same time, we also filter out the integer alleles (for generating random founder haplotypes in a minute):

partial_kit <- ystr_kits %>% 
  filter(Kit == "PowerPlex Y23") %>% 
  inner_join(ystr_markers, by = "Marker") %>% 
  filter(!(Marker %in% c("DYS437", "DYS448"))) %>% 
  rowwise() %>% # To work on each row
  mutate(IntegerAlleles = list(Alleles[Alleles == round(Alleles)]),
         MinIntAllele = min(IntegerAlleles),
         MaxIntAllele = max(IntegerAlleles)) %>% 
  ungroup() %>% 
  select(-Kit, -Alleles)
partial_kit

This "partial kit" has the following mutation probabilities:

mu <- partial_kit %>% pull(MutProb)
mu

We can make a founder haplotype generator as follows (sampling alleles randomly is not how Y-STR works, but it may work fine for founder haplotypes):

generate_random_haplotype <- function() {
  partial_kit %>% 
    rowwise() %>% 
    mutate(Allele = IntegerAlleles[sample.int(length(IntegerAlleles), 1)]) %>% 
    pull(Allele)
}

Now, a new haplotype is created everytime the function is called (with no arguments):

generate_random_haplotype()
generate_random_haplotype()

Of course such generator can also be created for a reference database with Y-STR profiles.

Now, we are ready to assign haplotypes to the genealogy:

set.seed(1)
pedigrees_all_populate_haplotypes_custom_founders(
  pedigrees = pedigrees, 
  get_founder_haplotype = generate_random_haplotype,
  mutation_rates = mu,
  progress = FALSE)

We can now plot pedigrees with haplotype information (note that as_tbl_graph needs to be called again):

g_ped2 <- as_tbl_graph(pedigrees) %>% 
  activate(nodes) %>% 
  filter(ped_id == PED_ID) %>%
  group_by(name) %>% 
  mutate(haplotype_str = paste0(haplotype[[1]], collapse = ";"))
  #mutate(haplotype_str = map(haplotype, paste0, collapse = ";")[[1]])

if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g_ped2, layout = 'tree') +
      geom_edge_link() +
      geom_node_point(aes(color = haplotype_str), size = 8) +
      geom_node_text(aes(label = name), color = "white") +
      theme_graph() 
  print(p)
}

Counting matches

We have live_pop from the population.

Drawing an individual and counting matches

set.seed(5)
Q_index <- sample.int(n = length(live_pop), size = 1)
Q <- live_pop[[Q_index]]
print_individual(Q)
Q_hap <- get_haplotype(Q)
Q_hap

Now, count matches in live part of pedigree and in live part of population:

Q_ped <- get_pedigree_from_individual(Q)
ped_live_matches <- count_haplotype_occurrences_pedigree(
  pedigree = Q_ped, 
  haplotype = Q_hap, 
  generation_upper_bound_in_result = 2) # gen 0, 1, 2
pop_live_matches <- count_haplotype_occurrences_individuals(
  individuals = live_pop, 
  haplotype = Q_hap)

ped_live_matches
pop_live_matches

We can also inspect pedigree matches information about number of meioses and $L_1$ distances:

path_details <- pedigree_haplotype_matches_in_pedigree_meiosis_L1_dists(
  suspect = Q, 
  generation_upper_bound_in_result = 2)

nrow(path_details)
head(path_details)

This can of course be repeated to many populations (genealogies, haplotype processes, suspects etc.). Also note that variability can be put on mutation rates, e.g. by a Bayesian approach.