R/segregate.R

Defines functions segregate

Documented in segregate

#' Segregate segments down genomic simulation pedigrees
#'
#' Given a collection of genomic simulation pedigrees and requests
#' for how many simulations should be done (in the `request` input),
#' as well as recombination rates, this simulates the segregation
#' of segments down through the pedigrees
#' @param request a tibble with list columns "gpp" and "reppop".
#' Each element of the gpp column is a tibble giving a genomic simulation
#' pedigree as documented as the input for `prep_gsp_for_hap_dropping()`.
#' Each element of the "reppop" column is a tibble with columns
#' `index`, `pop`, `group`, to indicate which of the founding
#' populations ("A", "B", etc.) correspond to the different groups
#' (from the `group` column in, for example, the meta data for individuals
#' in your genotype data set, like the data object `I_meta`).
#' Because it is quite likely that you might wish to iterate
#' the segregation procedure multiple
#' times in a single simulation, you can specify that by doing multiple
#' "reps" (replicates) of the procedure.  **BIG NOTE**: The values in the index column that you
#' choose must start at 1 and should be dense within.  In other words, if
#' the max value in the index column is N, then every integer from 1 to N
#' must be in there.
#' @param RR the recombination rates in the format of the package data
#' @param MM the marker meta data tibble (like M_meta).  If this is NULL (the default) that
#' is fine.  If not, then it uses the order of the markers in MM to define
#' the levels of a chrom_f column so that we can sort the rows of the output
#' correctly, with respect to markers in the Genotype data frame.  This will
#' let us more efficiently subscript the markers out of the matrix. If MM is
#' not present, then the function will create `chrom_f` by using the
#' order of the chromosomes from RR. If MM is not NULL, then the function
#' will also check to make sure that the markers are within the extent of the
#' recombination rate bins, giving an error otherwise.
#' @return The output from this function is a tibble.  Each row represents one segment of genetic
#' material amongst the sampled individuals from the genomic permutation pedigrees. The columns give
#' information about the provenance and destination of that segment as follows.
#' Each segment exists in one of the samples (`samp_index`) from a sampled individual
#' with a `ped_sample_id` on a given `gpp` (the index giving the row of the request input tibble)
#' in a given `index` within the individual.  Further, it is on one of two gametes
#' (`gamete_index`) that segregated into the individual, and it came from a certain founding
#' population (`pop_origin`) that corresponds to the named groups in the genotype file (`group_origin`).
#' And, of course, the segment occupies the space from `start` to `end` on a chromosome `chrom`.
#' Finally, the index of the founder haplotype on the given gpp that this segement descended from is
#' given in `rs_founder_haplotype` which is short for "rep-specific founder haplotype". This final
#' piece of information is crucial for segregating variation from the individuals in the `Geno` file
#' onto these segments. Finally, the column `sim_level_founder_haplo` assigns a unique index for each founder
#' haplotype. This is necessary because any simulation can involve multiple gpps and/or indexes of gpps,
#' and the founders in each of those must all be unique within a simulation. so that those haplotypes
#' can all, eventually, be accessed easily out of the genotype matrix.
#' @export
#' @examples
#' # We construct an example here where we will request segregation
#' # down a GSP with two F1s and F1B backcrosses between two hypothetical
#' # populations, A and B.
#' gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE)
#'
#' # We will imagine that in our marker data there are three groups
#' # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing
#' # only to grp3.
#' reppop <- tibble::tibble(
#'     index = as.integer(c(1, 1, 2, 2)),
#'     pop = c("A", "B", "A", "B"),
#'     group = c("grp3", "grp1", "grp3", "grp2")
#' )
#'
#' # combine those into a request
#' request <- tibble::tibble(
#'    gpp = list(gsp_f1f1b),
#'    reppop = list(reppop)
#' )
#'
#'
#' result1 <- segregate(request, RecRates)
#'
#' # here we pass it some markers, too
#' result2 <- segregate(request, RecRates, M_meta)
#'
#' result1
#'
#' result2
segregate <- function(request, RR, MM = NULL) {

  # ERROR CHECKING
  # first make sure that RR is formatted properly
  check_rec_rates_formatting(RR)

  # then, if MM is not NULL check to make sure that no marker
  # positions exceed the chrom lengths
  if(!is.null(MM)) {
    check_chrom_lengths(MM, RR)
  }


  # just lapply over the rows of request and bind_rows() the outputs together
  idx <- 1:nrow(request)
  names(idx) <- idx

  ret <- lapply(idx, function(i) {

    # check that the reppop is correctly formatted
    check_reppop(request$reppop[[i]], i)


    # number of reps to do is the maximum number listed in the request
    Reps <- max(request$reppop[[i]]$index)
    tmp <- drop_segs_down_gsp(GSP = request$gpp[[i]],
                       RR = RR,
                       Reps = Reps)

    # now, we join the group specification on there
    left_join(tmp, request$reppop[[i]], by = c("index" = "index", "pop_origin" = "pop")) %>%
      rename(group_origin = group)
  }) %>%
    bind_rows(.id = "gpp") %>%
    mutate(gpp = as.integer(gpp))

  # ERROR CHECKING on the result
    # any missing groups? etc.

  ret2 <- ret %>%
    ungroup() %>%
    sim_level_founder_haplos()  # on this line we also compute the simulation-level founder haplos!

  if(!is.null(MM)) {
    ret_tmp <- ret2 %>%
      mutate(chrom_f = factor(chrom, levels = unique(MM$chrom)))
  } else {
    ret_tmp <- ret2 %>%
      mutate(chrom_f = factor(chrom, levels = unique(RR$chrom)))
  }

  # arrange and return
  ret_tmp %>%
      select(chrom_f, everything()) %>%
      arrange(gpp, index, ped_sample_id, samp_index, gamete_index, chrom_f, start)
}
eriqande/gscramble documentation built on March 5, 2024, 4:22 p.m.