In eriqande/gscramble: Simulating Admixed Genotypes Without Replacement

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
RunBottom <- FALSE

The first tutorial included a rather complex pedigree, mostly for testing purposes. Here we use a simpler pedigree, but apply it across multiple population pairs in a more realistic example of the sort someone might want to do with gscramble. This example comes out of Tim's work with understanding movement and introgression in feral pig populations in Missouri.

We start off by loading up some libraries.

library(gscramble)
library(tidyverse)

`gsp3`: This vignettes' genomic permutation pedigree

The genomic permutation pedigree we will be using for this exercise is in the R data object gsp3. The package file at:

system.file("extdata/gsp3.csv", package = "gscramble")

holds a CSV file that gives the GSP tibble when read in using readr::read_csv(). In picture form, it looks like this:

And it looks like this as an R-object:

gsp3

The individual meta data

We use the builtin package genotype data from pigs. The relevant meta data from the individuals in that data are in I_meta. The first part of those meta data look like:

head(I_meta)

The number of different individuals in each group/population are as follows:

I_meta %>%
  count(group)

Now, some of the samples from some of the groups are quite small, so we would not want to consume too many individuals from each of those. In order to get many simulated samples of hybrid individuals it will be necessary to do multiple simulations, but we can make each simulation count as much as possible by ensuring that one individual is consumed from each group.

Here we set up a RepPop that samples 2 individuals each from Pop2 and Pop4 to be from population A and 1 individual from each of Pop10 and Pop3 to be from popuation B. And we add to that 2 individuals from Pop2 and 4 to be from A and 1 from even group from B. This means we consume 3 individuals from each population.

rp1 <- bind_rows(
  tibble(
    rep = 1:6,
    pop = "A",
    group = str_c("Pop", rep(c(2, 4), 3))
  ),
  tibble(
    rep = 1:6,
    pop = "B",
    group = str_c("Pop", rep(c(10, 3), 3))
  ),
  tibble(
    rep = 7:12,
    pop = "B",
    group = str_c("Pop", rep(c(2, 4), 3))
  ),
  tibble(
    rep = 7:12,
    pop = "A",
    group = str_c("Pop", rep(c(10, 3), 3))
  )
) %>%
  arrange(rep, pop)

rp1

Such an arrangement will produce 6 simulations that include 3 hybrid individuals from each pair of populations in the same "rep". But it will consume no more than 2 individuals from each even population and no more than one individual from each odd population. And no hybrids will be formed by any individuals from Pop13.

To segregate all those we would do:

Segs <- segregate(tibble(gpp = list(gsp3), reppop = list(rp1)), RR = RecRates, MM = M_meta)

Then we might want to compute the admixture fraction for each individual, just to check it out and make sure that we got it right:

computeQs_from_segments(Segs)

That all checks out.

Tying it back to the genos

Since we know the simulation-specific founder haplo index for everything, all we need now is convert each of those into an absolute index in a matrix somewhere. To deal with that we need to formalize our system for putting individuals into a rows (or columns?) in a matrix.

This is still in development, and the following steps will not be seen, but this is how it is shaping out.

First, we make a single data structure with the geno matrix (reformatted into halflotypes) and the meta data.

GS <- rearrange_genos(G = Geno, Im = I_meta, Mm = M_meta)

Now, do everything else in one fell swoop:

BW <- big_wrapper(GS, Segs, M_meta)

This gives us matrices and tibbles in a list

names(BW)

# this is the genotype matrix with everyone in it
dim(BW$ret_geno)

Here are is the ID file that goes with the indivs in the geno-matrix:

BW$ret_ids

The naming convention is:

h-gpp-rep-ped_sample_id-samp_index-matrix_row

Note that the scrambled (but not hybridized) individuals that were not consumed for making hybrids look like this:

BW$ret_ids %>%
  slice(100:110)

Finally, the true Q values are there with the indiv ID for the hybrids as they key.

BW$hyb_Qs

Let's make sure we have the same number of alleles

We have to change the missing data into something else, 'cuz we end up with more missing data when we scramble. So, we change missing data to "M" and then run it:

GenoM <- Geno
GenoM[GenoM== 0] <- "M"
GSM <- rearrange_genos(G = GenoM, Im = I_meta, Mm = M_meta)

Check/Test on rearrange_genos()

Before we proceed, I will do a quick test here to make sure that the rearranged genos still have the correct number of alleles in each individual.

# first check the total number of alleles
table(as.vector(GenoM))
table(as.vector(GSM$G[[1]]))

So, that is all good. Now, check the number of alleles by population

# first, the original ones
pop_counts_orig <- I_meta %>%
  mutate(idx = 1:n()) %>%
  group_by(group) %>%
  summarize(idxs = list(c(idx))) %>%
  mutate(
    alle_counts = map(.x = idxs, .f = function(x) table(as.vector(GenoM[x, ])))
  )

# then the rearranged ones
pop_counts_rearr <- GSM$I[[1]] %>%
  group_by(group) %>%
  summarise(idxs = list(c(abs_column))) %>%
  mutate(
    alle_counts = map(.x = idxs, .f = function(x) table(as.vector(GSM$G[[1]][, x])))
  )

# now join those and compare
check_it <- left_join(
  pop_counts_orig,
  pop_counts_rearr,
  by = "group"
) %>%
  mutate(all_good = map2_lgl(.x = alle_counts.x, .y = alle_counts.y, .f = function(x, y) all.equal(x, y)))

check_it

So, that totally checks out.

BWM <- big_wrapper(GSM, Segs, M_meta)

# now count up the number of different alleles
GMc <- table(as.vector(GenoM))
GMc

GSMc <- table(as.vector(GSM$G[[1]]))
GSMc

BWMc <- table(as.vector(BWM$ret_geno))
BWMc

GSMc - GMc

BWMc - GMc

sum(BWMc - GMc)

Nope! That is not checking out. So there must be bug somewhere in the part that sprinkles markers onto the segments. Bummer.... HOLD ON! THIS CHECKS OUT IF WE ARE NOT SCRAMBLING GENOTYPES, SO I SUSPECT THATH THE PROBLEM IS IN perm_gs_by_pops()

Gonna check that out now:

GS_input <- GSM
GS <- perm_gs_by_pops(GS_input)

table(as.vector(GSM$G[[1]]))
table(as.vector(GS$G[[1]]))
table(as.vector(GS$G_permed[[1]]))

NO! That check out. But somehow it is interacting with passing segments around. Fuuuuuuuuuuuuh!

Let's see if we can track down which loci these discrepancies are occurring in:

# first, make a big tibble that has the variant_id and allele on it, and then count them up
BWMa <- tibble(
  variant_id = rep(M_meta$variant_id, each = 329 * 2),
  allele = as.vector(BWM$ret_geno)
) %>%
  count(variant_id, allele)

GMa <- tibble(
  variant_id = rep(M_meta$variant_id, each = 329 * 2),
  allele = as.vector(GenoM)
) %>%
  count(variant_id, allele) %>%
  rename(nGMa = n)

comp <- left_join(GMa, BWMa) %>%
  group_by(variant_id) %>% 
  mutate(diff = nGMa - n, diff_sum = sum(diff))

comp

So, the number of gene copies is the same at each marker, but the allele counts are off, but balance within every variant.

That makes me think that we are, perhaps, grabbing the same haflotype from each founder. Because then the heterozygotes will be wonked like this. i.e., since we are segregating each half of a founder into the scrambled individuals, if it is a heterozygote and we only take one of them, each time, then we will have too many of one allele, and exactly two few of the other one.

Look at the distribution of the diffs there:

ggplot(comp, aes(x = diff)) +
  geom_histogram()

max(comp$diff, na.rm = TRUE)

So, that is weird. But let's confirm that it is only an issue amongst the ones that were segregated through the pedigree, and not just from the permed ones.

# first get the IDs of all the permed ones
simply_permed_ids <- BWM$ret_ids %>%
  filter(str_detect(indiv, "permed")) %>%
  mutate(ID = str_replace(indiv, "^permed_", "")) %>%
  pull(ID)

spO <- table(as.vector(GenoM[which(I_meta$indiv %in% simply_permed_ids), ]))
spGM <- table(as.vector(GSM$G[[1]][, GSM$I[[1]]$indiv %in% simply_permed_ids ]))
spBWM <- table(as.vector(BWM$ret_geno[which(str_detect(BWM$ret_ids$indiv, "^permed")), ]))

spO
spGM
spBWM

That is interesting. It looks like there is a discrepancy in the BWM individuals that have not been segregated through the pedigree, as well. So, I need to track that down! Wait, no! We don't expect these things to be the same, necessarily, because we have scrambled everything before doing it. (NOTE! If we don't scramble things, they all come out just like they should!) So, the only thing that we can expect is that the total number of alleles will be correct. So, it must really be coming down to how we are assigning alleles to segregated segments... Hmm....

Subsumed into the big_wrapper() function

Everything below here is left for an explanation of what is going on, but all of it is wrapped up in a single function called big_wrapper(). None of the code below here gets evaluated when knitting, at the moment.

# then update that with a scrambled version in $GS_perm[[1]]
# NOTE": This assumes missing data is coded as a character "0"
GS <- perm_gs_by_pops(GS)

Then, to each row in the Segs data frame we need to join onto it the absolute column of the GS$G matrix that corresponds to the founder from which that haplotype was taken.

Segs2 <- GS$I[[1]] %>%
  select(group, gs_column, abs_column) %>%
  left_join(Segs, ., by = c("group_origin" = "group", "sim_level_founder_haplo" = "gs_column"))

Now we must give each marker an absolute index, but then break those, along with the positions into a list of tibbles/vectors that can be easily and quickly accessed. Once we have that we can do a summarize over each GPP sample that returns a list column that holds a 2-column matrix subscriptor

Here is the first step:

m_list <- M_meta %>% mutate(chrom_f = factor(chrom, levels = unique(chrom)), idx = 1:n()) %>% select(-variant_id) %>% split(., f = .$chrom_f)

Now, we just need a function that we will call as a summarise function on a grouped tibble. With things being grouped by GPP sample (and ordered by chrom_f and segment start). It will return a matrix whose first column is the position of the marker and the second column is the absolute column of the founder haplotype that the markers are being copied from. The function will look something like this:

make_subscript_matrix <- function(n, chrom, start, end, abs_column, m_list, num_markers) {
  ret <- lapply(1:n, function(i) {
    the_chrom <- chrom[i]
    idxs <- m_list[[the_chrom]]$idx[ m_list[[the_chrom]]$pos > start[i] & m_list[[the_chrom]]$pos <= end[i] ]
    cbind(idxs, abs_column[i])
  }) 

  # here is a quick hack to deal with empty segments (i.e. those that have no markers in them)
  ret <- ret[sapply(ret, ncol) == 2]

  ret <- do.call(rbind, args = ret)

  # now do a quick check to make sure that every position is in there in the 
  # correct order
  stopifnot(all(ret[,1] == 1:num_markers))

  ret
}

So, to operate on Seg2 with this we do:

pick_tib <- Segs2 %>%
  group_by(gpp, rep, ped_sample_id, samp_index, gamete_index) %>%
  summarise(m_subscript_matrix = list(make_subscript_matrix(n = n(),
                                                            chrom = chrom,
                                                            start = start,
                                                            end =  end,
                                                            abs_column = abs_column,
                                                            m_list = m_list, 
                                                            num_markers = nrow(GS$G[[1]]))))

For testing:

Segs2 %>%
  summarise(m_subscript_matrix = list(make_subscript_matrix(n = n(),
                                                            chrom = chrom,
                                                            start = start,
                                                            end =  end,
                                                            abs_column = abs_column,
                                                            m_list = m_list, 
                                                            num_markers = nrow(GS$G[[1]]))))

Now, we can do some more error checking here to make sure that we haven't selected the same SNP from the sample founder haflotype twice:

full_mat <- do.call(rbind, pick_tib$m_subscript_matrix)

strs <- paste(full_mat[,1], full_mat[,2], sep = "--")
tab <- table(strs)
tab[tab>1]  # now positions are used twice

# then also check that each position was used 72 times
tab2 <- table(full_mat[, 1])
table(tab2)
# yep, that is what it should be

# now, check that each founder halflotype subscript is seen the same
# number of times
tab3 <- table(full_mat[, 2])
tab3
table(tab3)

# Cool! Everything checks out

And then for the last hurrah we grab all those alleles out of the GS$G matrix. I think we can do it quickly like this:

ped_sampled_indivs <- GS$G[[1]][full_mat] %>%
  matrix(nrow = dim(GS$G[[1]]))

That is freakishly fast too. Good.

Now, we just need to deal with missing data.

eriqande/gscramble documentation built on March 5, 2024, 4:22 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

eriqande/gscramble
Simulating Admixed Genotypes Without Replacement

In eriqande/gscramble: Simulating Admixed Genotypes Without Replacement

`gsp3`: This vignettes' genomic permutation pedigree

The individual meta data

Tying it back to the genos

Let's make sure we have the same number of alleles

Check/Test on rearrange_genos()

Subsumed into the big_wrapper() function

R Package Documentation

Browse R Packages

We want your feedback!

eriqande/gscramble Simulating Admixed Genotypes Without Replacement

In eriqande/gscramble: Simulating Admixed Genotypes Without Replacement

gsp3: This vignettes' genomic permutation pedigree

The individual meta data

Tying it back to the genos

Let's make sure we have the same number of alleles

Check/Test on rearrange_genos()

Subsumed into the big_wrapper() function

R Package Documentation

Browse R Packages

We want your feedback!

eriqande/gscramble
Simulating Admixed Genotypes Without Replacement

`gsp3`: This vignettes' genomic permutation pedigree