knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tidyverse) library(gdropR)
The gdropR package provides a very simple, lightweight wrapper that sets up input files to the software program Mendel to allow simulation of linked markers on a user-defined pedigree.
Unfortunately, Mendel does not seem to be distributed any longer. So, if you need it, ask Eric.
The purpose of this vignette is to show how to use the functions in gdropR to:
This example leans toward a current use case, so I won't discuss much, more generally.
We want 1000 SNPs chosen to have fairly intermediate allele frequencies. We will
simulate their freqs from a beta(10, 10) distribution. We need to put these in the
tibble format described in sprinkle_markers_into_genome()
.
set.seed(5) L <- 1000 m1 <- tibble( Chrom = "Unk", Locus = paste0("Unk-", 1:L), Pos = NA_real_, LocIdx = 1:L, a1 = rbeta(L, 10, 10), a2 = 1 - a1 ) %>% pivot_longer( cols = c(a1, a2), names_to = "Allele", values_to = "Freq" ) %>% group_by(Locus) %>% mutate(AlleIdx = 1:n()) %>% ungroup() %>% select(Chrom, Locus, Pos, Allele, LocIdx, AlleIdx, Freq) markers <- reindex_markers(m1)
One of the data objects in the package is chinook_chromosomes
. We
load that up and then get it into the required format for using
sprinkle_markers_into_genome()
. We need special columns and we
also need to report the length of each chromosomes as a fraction of the
longest chromosome.
genome <- chinook_chromosomes %>% mutate( idx = 1:n(), chrom = name1, num_bases = length, scaled_length = num_bases / max(num_bases) ) %>% select(idx, chrom, scaled_length, num_bases)
Once we have that we can sprinkle our markers randomly into the genome. And after we are done with that we rename the markers in the Locus column to correspond to their positions on the chromosomes.
mapped_markers <- sprinkle_markers_into_genome( markers, genome ) %>% group_by(Chrom) %>% mutate(Locus = paste0(Chrom, "-", rep(1:(n()/2), each = 2))) %>% ungroup()
As part of the package data I have a pedigree that has 5 generations of individuals born into a population. Each of those 5 generations, 60 males and 60 females were born. I simulated this pedigree to by largely acyclic.
To use it here, we need to add a column giving the sex of the individuals, and
we must explicitly add rows for the births of the founders with
the parents set to "0". Note that we know the sex of anyone who appears
in the Ma or Pa columns. But for the final generation, we don't know that.
We set the fem_prob
parameter to make all those final generation individuals
females. (We will be pretending that we can only sample females from
this population.)
pedigree <- FourGen_555_pedigree %>% add_explicit_founder_parents() %>% add_pedigree_sex_column(fem_prob = 1.0)
We do this to get a wide data frame back:
wide_genos <- simulate_linked_genotypes( mapped_markers, pedigree )
Look at the first 5 rows and 5 columns of wide_genos
:
wide_genos[1:5, 1:5]
I only have a single genotyping error function, and it only deals with
biallelic markers that have genotypes of 1/1
, 1/2
, 2/1
or 2/2
.
Heterozygotes are miscalled as homozygotes with rate het_miscall
, or,
one allele in the homozygote genotype gets changed to the other at a rate of
hom_miscall
. We never mistake one homozygote for the other.
We need to get the genotypes in long format for this function to work, so:
long_genos <- wide_genos %>% pivot_longer( cols = c(-indiv, -sex), names_to = "Locus", values_to = "Genotype" )
Now, that they are in long format we use the function
biallelic_geno_error()
, like this:
err_genos <- long_genos %>% mutate( obs_geno = biallelic_geno_error( genos = Genotype, het_miscall = 0.02, hom_miscall = 0.005 ) )
Let's count up all those genotypes to makes sure our genotype error is correct:
err_genos %>% count(Genotype, obs_geno) %>% group_by(Genotype) %>% mutate(fract = n / sum(n))
Yep, that looks correct.
From this point, I suspect it would be pretty easy to turn this into pedFac output format. The one thing we would like to add would be the generation level of each individual in the pedigree. We have that for most of the individuals, but not the additional founders we had to put in there.
Look at the pedigree:
head(FourGen_555_pedigree)
The "generation" column there effectively gives the generation at which
each of the Kids was born, with the original generation being called
0. Additionally, the ID of the individuals has the year the individual
was born (with the initial generation counted as 1). For example,
F_1_36_f
was born in year 1. (Also it is a female "F" and a founder "_f").
So, our generation levels for everyone can be obtained from our processed
pedigree
:
ped_with_gens <- pedigree %>% mutate( generation = case_when( !is.na(generation) ~ as.integer(generation), (Pa == "0" | Ma == "0") ~ as.integer(str_match(Kid, "[MF]_([0-9]+)_")[,2]) - 1L, TRUE ~ NA_integer_ ) )
So, those are the generation levels of the individuals. To get the genotypes in 0/1/2 wide format, it would be something like this:
genos012 <- err_genos %>% mutate(obs012 = case_when( obs_geno == "1/1" ~ 0L, obs_geno %in% c("1/2", "2/1") ~ 1L, obs_geno == "2/2" ~ 2L, TRUE ~ NA_integer_ )) %>% select(-Genotype, -obs_geno) %>% pivot_wider( names_from = Locus, values_from = obs012 ) # look at them genos012[1:8, 1:7]
Let's confirm that the IDs of the individuals did not get messed up. First, make sure that there is a one to one correspondence with the names in the original pedigree:
# get all the IDs found in the original pedigree orig_IDs <- FourGen_555_pedigree %>% select(Kid, Pa, Ma) %>% unlist() %>% unname() %>% unique() # get all the IDs occurring in the genotypes geno_IDs <- genos012$indiv # see if any are duplicates: length(geno_IDs) == length(unique(geno_IDs)) # no duplicates. setdiff(geno_IDs, orig_IDs)
OK. MENDEL has an 8 character limit on the IDs of individuals. The rest get truncated. The correct way to deal with that would be to rename the individuals and send them to Mendel, but I won't hassle with that for now.
Instead, for this case, it is sufficient to just add an f
on the end of any
IDs that end with an underscore.
genos012_fixed <- genos012 %>% mutate(indiv = case_when( str_detect(indiv, "_$") ~ str_c(indiv, "f"), TRUE ~ indiv )) # now check: setequal(genos012_fixed$indiv, orig_IDs)
From that, you can easily select the individuals that are to be observed.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.