knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) RunBottom <- FALSE
The first tutorial included a rather complex pedigree, mostly for testing
purposes. Here we use a simpler pedigree, but apply it across multiple population
pairs in a more realistic example of the sort someone might want to do with gscramble
.
This example comes out of Tim's work with understanding movement and introgression in
feral pig populations in Missouri.
We start off by loading up some libraries.
library(gscramble) library(tidyverse)
gsp3
: This vignettes' genomic permutation pedigreeThe genomic permutation pedigree we will be using for this exercise is in the
R data object gsp3
. The package file at:
system.file("extdata/gsp3.csv", package = "gscramble")
holds a CSV file that gives the GSP
tibble when read in using readr::read_csv()
.
In picture form, it looks like this:
And it looks like this as an R-object:
gsp3
We use the builtin package genotype data from pigs. The relevant meta data from
the individuals in that data are in I_meta
. The first part of those meta data
look like:
head(I_meta)
The number of different individuals in each group/population are as follows:
I_meta %>% count(group)
Now, some of the samples from some of the groups are quite small, so we would not want to consume too many individuals from each of those. In order to get many simulated samples of hybrid individuals it will be necessary to do multiple simulations, but we can make each simulation count as much as possible by ensuring that one individual is consumed from each group.
Here we set up a RepPop that samples 2 individuals each from Pop2 and Pop4 to be from population A and 1 individual from each of Pop10 and Pop3 to be from popuation B. And we add to that 2 individuals from Pop2 and 4 to be from A and 1 from even group from B. This means we consume 3 individuals from each population.
rp1 <- bind_rows( tibble( rep = 1:6, pop = "A", group = str_c("Pop", rep(c(2, 4), 3)) ), tibble( rep = 1:6, pop = "B", group = str_c("Pop", rep(c(10, 3), 3)) ), tibble( rep = 7:12, pop = "B", group = str_c("Pop", rep(c(2, 4), 3)) ), tibble( rep = 7:12, pop = "A", group = str_c("Pop", rep(c(10, 3), 3)) ) ) %>% arrange(rep, pop) rp1
Such an arrangement will produce 6 simulations that include 3 hybrid individuals from each pair of populations in the same "rep". But it will consume no more than 2 individuals from each even population and no more than one individual from each odd population. And no hybrids will be formed by any individuals from Pop13.
To segregate all those we would do:
Segs <- segregate(tibble(gpp = list(gsp3), reppop = list(rp1)), RR = RecRates, MM = M_meta)
Then we might want to compute the admixture fraction for each individual, just to check it out and make sure that we got it right:
computeQs_from_segments(Segs)
That all checks out.
Since we know the simulation-specific founder haplo index for everything, all we need now is convert each of those into an absolute index in a matrix somewhere. To deal with that we need to formalize our system for putting individuals into a rows (or columns?) in a matrix.
This is still in development, and the following steps will not be seen, but this is how it is shaping out.
First, we make a single data structure with the geno matrix (reformatted into halflotypes) and the meta data.
GS <- rearrange_genos(G = Geno, Im = I_meta, Mm = M_meta)
Now, do everything else in one fell swoop:
BW <- big_wrapper(GS, Segs, M_meta)
This gives us matrices and tibbles in a list
names(BW) # this is the genotype matrix with everyone in it dim(BW$ret_geno)
Here are is the ID file that goes with the indivs in the geno-matrix:
BW$ret_ids
The naming convention is:
h-gpp-rep-ped_sample_id-samp_index-matrix_row
Note that the scrambled (but not hybridized) individuals that were not consumed for making hybrids look like this:
BW$ret_ids %>% slice(100:110)
Finally, the true Q values are there with the indiv ID for the hybrids as they key.
BW$hyb_Qs
We have to change the missing data into something else, 'cuz we end up with more missing data when we scramble. So, we change missing data to "M" and then run it:
GenoM <- Geno GenoM[GenoM== 0] <- "M" GSM <- rearrange_genos(G = GenoM, Im = I_meta, Mm = M_meta)
Before we proceed, I will do a quick test here to make sure that the rearranged genos still have the correct number of alleles in each individual.
# first check the total number of alleles table(as.vector(GenoM)) table(as.vector(GSM$G[[1]]))
So, that is all good. Now, check the number of alleles by population
# first, the original ones pop_counts_orig <- I_meta %>% mutate(idx = 1:n()) %>% group_by(group) %>% summarize(idxs = list(c(idx))) %>% mutate( alle_counts = map(.x = idxs, .f = function(x) table(as.vector(GenoM[x, ]))) ) # then the rearranged ones pop_counts_rearr <- GSM$I[[1]] %>% group_by(group) %>% summarise(idxs = list(c(abs_column))) %>% mutate( alle_counts = map(.x = idxs, .f = function(x) table(as.vector(GSM$G[[1]][, x]))) ) # now join those and compare check_it <- left_join( pop_counts_orig, pop_counts_rearr, by = "group" ) %>% mutate(all_good = map2_lgl(.x = alle_counts.x, .y = alle_counts.y, .f = function(x, y) all.equal(x, y))) check_it
So, that totally checks out.
BWM <- big_wrapper(GSM, Segs, M_meta) # now count up the number of different alleles GMc <- table(as.vector(GenoM)) GMc GSMc <- table(as.vector(GSM$G[[1]])) GSMc BWMc <- table(as.vector(BWM$ret_geno)) BWMc GSMc - GMc BWMc - GMc sum(BWMc - GMc)
Nope! That is not checking out. So there must be bug somewhere in the part that sprinkles
markers onto the segments. Bummer.... HOLD ON! THIS CHECKS OUT IF WE ARE NOT SCRAMBLING GENOTYPES,
SO I SUSPECT THATH THE PROBLEM IS IN perm_gs_by_pops()
Gonna check that out now:
GS_input <- GSM GS <- perm_gs_by_pops(GS_input) table(as.vector(GSM$G[[1]])) table(as.vector(GS$G[[1]])) table(as.vector(GS$G_permed[[1]]))
NO! That check out. But somehow it is interacting with passing segments around. Fuuuuuuuuuuuuh!
Let's see if we can track down which loci these discrepancies are occurring in:
# first, make a big tibble that has the variant_id and allele on it, and then count them up BWMa <- tibble( variant_id = rep(M_meta$variant_id, each = 329 * 2), allele = as.vector(BWM$ret_geno) ) %>% count(variant_id, allele) GMa <- tibble( variant_id = rep(M_meta$variant_id, each = 329 * 2), allele = as.vector(GenoM) ) %>% count(variant_id, allele) %>% rename(nGMa = n) comp <- left_join(GMa, BWMa) %>% group_by(variant_id) %>% mutate(diff = nGMa - n, diff_sum = sum(diff)) comp
So, the number of gene copies is the same at each marker, but the allele counts are off, but balance within every variant.
That makes me think that we are, perhaps, grabbing the same haflotype from each founder. Because then the heterozygotes will be wonked like this. i.e., since we are segregating each half of a founder into the scrambled individuals, if it is a heterozygote and we only take one of them, each time, then we will have too many of one allele, and exactly two few of the other one.
Look at the distribution of the diffs there:
ggplot(comp, aes(x = diff)) + geom_histogram() max(comp$diff, na.rm = TRUE)
So, that is weird. But let's confirm that it is only an issue amongst the ones that were segregated through the pedigree, and not just from the permed ones.
# first get the IDs of all the permed ones simply_permed_ids <- BWM$ret_ids %>% filter(str_detect(indiv, "permed")) %>% mutate(ID = str_replace(indiv, "^permed_", "")) %>% pull(ID) spO <- table(as.vector(GenoM[which(I_meta$indiv %in% simply_permed_ids), ])) spGM <- table(as.vector(GSM$G[[1]][, GSM$I[[1]]$indiv %in% simply_permed_ids ])) spBWM <- table(as.vector(BWM$ret_geno[which(str_detect(BWM$ret_ids$indiv, "^permed")), ])) spO spGM spBWM
That is interesting. It looks like there is a discrepancy in the BWM individuals that have not been segregated through the pedigree, as well. So, I need to track that down! Wait, no! We don't expect these things to be the same, necessarily, because we have scrambled everything before doing it. (NOTE! If we don't scramble things, they all come out just like they should!) So, the only thing that we can expect is that the total number of alleles will be correct. So, it must really be coming down to how we are assigning alleles to segregated segments... Hmm....
Everything below here is left for an explanation of what is going on, but all of it is wrapped up in a single function called big_wrapper(). None of the code below here gets evaluated when knitting, at the moment.
# then update that with a scrambled version in $GS_perm[[1]] # NOTE": This assumes missing data is coded as a character "0" GS <- perm_gs_by_pops(GS)
Then, to each row in the Segs data frame we need to join onto it the absolute column of the GS$G matrix that corresponds to the founder from which that haplotype was taken.
Segs2 <- GS$I[[1]] %>% select(group, gs_column, abs_column) %>% left_join(Segs, ., by = c("group_origin" = "group", "sim_level_founder_haplo" = "gs_column"))
Now we must give each marker an absolute index, but then break those, along with the positions into a list of tibbles/vectors that can be easily and quickly accessed. Once we have that we can do a summarize over each GPP sample that returns a list column that holds a 2-column matrix subscriptor
Here is the first step:
m_list <- M_meta %>% mutate(chrom_f = factor(chrom, levels = unique(chrom)), idx = 1:n()) %>% select(-variant_id) %>% split(., f = .$chrom_f)
Now, we just need a function that we will call as a summarise function on a grouped tibble. With things being grouped by GPP sample (and ordered by chrom_f and segment start). It will return a matrix whose first column is the position of the marker and the second column is the absolute column of the founder haplotype that the markers are being copied from. The function will look something like this:
make_subscript_matrix <- function(n, chrom, start, end, abs_column, m_list, num_markers) { ret <- lapply(1:n, function(i) { the_chrom <- chrom[i] idxs <- m_list[[the_chrom]]$idx[ m_list[[the_chrom]]$pos > start[i] & m_list[[the_chrom]]$pos <= end[i] ] cbind(idxs, abs_column[i]) }) # here is a quick hack to deal with empty segments (i.e. those that have no markers in them) ret <- ret[sapply(ret, ncol) == 2] ret <- do.call(rbind, args = ret) # now do a quick check to make sure that every position is in there in the # correct order stopifnot(all(ret[,1] == 1:num_markers)) ret }
So, to operate on Seg2 with this we do:
pick_tib <- Segs2 %>% group_by(gpp, rep, ped_sample_id, samp_index, gamete_index) %>% summarise(m_subscript_matrix = list(make_subscript_matrix(n = n(), chrom = chrom, start = start, end = end, abs_column = abs_column, m_list = m_list, num_markers = nrow(GS$G[[1]]))))
For testing:
Segs2 %>% summarise(m_subscript_matrix = list(make_subscript_matrix(n = n(), chrom = chrom, start = start, end = end, abs_column = abs_column, m_list = m_list, num_markers = nrow(GS$G[[1]]))))
Now, we can do some more error checking here to make sure that we haven't selected the same SNP from the sample founder haflotype twice:
full_mat <- do.call(rbind, pick_tib$m_subscript_matrix) strs <- paste(full_mat[,1], full_mat[,2], sep = "--") tab <- table(strs) tab[tab>1] # now positions are used twice # then also check that each position was used 72 times tab2 <- table(full_mat[, 1]) table(tab2) # yep, that is what it should be # now, check that each founder halflotype subscript is seen the same # number of times tab3 <- table(full_mat[, 2]) tab3 table(tab3) # Cool! Everything checks out
And then for the last hurrah we grab all those alleles out of the GS$G matrix. I think we can do it quickly like this:
ped_sampled_indivs <- GS$G[[1]][full_mat] %>% matrix(nrow = dim(GS$G[[1]]))
That is freakishly fast too. Good.
Now, we just need to deal with missing data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.