R/AdmixLocal.R
In AdmixPoly: Global and Local Admixture Inference in Polyploids

Documented in AdmixLocal

#' Forward-Backward algorithm for local admixture inference
#'
#' @param Geno
#' List of genotying matrices of size `M` (number of markers),
#' with each matrix of dimension `N`(number of individuals) x `L`(number of alleles)
#' whose elements consists of discrete or continuous dosages
#' @param ResAdmixGlobal
#' A list generated by the \code{\link{AdmixGlobal}} function
#' @param Ind
#' Name of the individual to be processed
#' @param P
#' Ploidy level of the individual (positive integer superior or equal to 2)
#' @param GeneticMap
#' Dataframe with three columns: `Chromosome`, `Marker`, and `Distance`,
#' with `M` rows. Distances must be in centiMorgan
#' @param SmoothParam
#' Smoothing parameter corresponding to the number of generations since admixture
#' @param MaxRec
#' Maximum number of recombination between adjacent markers,
#' used as an approximation for transition probabilities, 
#' in order to speed up calculations and limit memory usage.
#' The approximation is disabled by default. 
#' Specify a positive integer between 1 and `P` to activate it, 
#' which overrides `DistIntBounds`.
#' @param DistIntBounds
#' Vector of bounds for genetic distance intervals,
#' used as an approximation for transition probabilities, 
#' in order to speed up calculations and limit memory usage.
#' To deactivate the approximation, set `DistIntBounds=NULL`.
#' @param MinProp
#' Minimum value for admixture proportion below which the contribution of the
#' group is not considered for the individual
#' (positive numeric value inferior to 1/K where K is the number of groups)
#' @param EmisProbMaxPermut
#' Maximum number of permutations involved in computing exact emission probability,
#' otherwise an approximate emission probability is calculated,
#' used as an approximation to speed up calculations. 
#' When continuous dosages are informed, the approximate emission probability 
#' is calculated.
#' @param NbThreads
#' Number of threads to be used (positive integer) with a default value of 0
#' setting automatically all threads available
#' @param Verbose
#' A boolean describing if detailed information should be printed
#'
#' @return
#' A list of four items: a list of size `C` (number of chromosomes) with ancestry
#' dosage matrices obtained from the posterior distribution (`Posterior`),
#' a list of size `C` with ancestry dosages obtained from the Viterbi 
#' algorithm (`Viterbi`), the log-likelihood value (`LogLik`),
#' and a list of size `C` of vectors indicating what
#' emission probability was computed  at each marker (`EmisProbType`)
#'
#' @description
#' This function performs local admixture inference for a selected individual
#' 
#' @details
#' The function `AdmixLocal()` performs local admixture inference from:
#' (i) genotyping data (`Geno`) formatted as a list of matrices 
#' (one for each marker), 
#' (ii) a list (`ResAdmixGlobal`) generated by the global admixture function 
#' [AdmixGlobal()],
#' (iii) a genetic map dataframe (`GeneticMap`),
#' (iV) the name of the individual to analyse (`Ind`),
#' (v) the ploidy level of the individual (`P`).
#' 
#' The inference of the local admixture hidden Markov model
#' is performed by calculating posterior probabilities and the
#' Viterbi path, both available from the results.
#' By default, all available threads/CPU cores are used but the number
#' can be chosen using `NbThreads`.
#' 
#' The rate at which breakpoints between blocks of homogeneous 
#' ancestry occur per unit of genetic distance is defined as a smoothing
#' parameter controlling the size of ancestry blocks (`SmoothParam`). The larger
#' the value, the smaller the expected ancestry block size.
#' 
#' Two approximations exist for transition probabilities to speedup calculations:
#' (i) the first consists of a simplification of genetic distances between 
#' adjacent marker where genetic distances (in cM) comprised within each of 
#' these intervals are averaged (`DistIntBounds`),
#' (ii) the second consists of limiting the number of recombination authorized 
#' to move from one ancestry state to another between adjacent markers (`MaxRec`).
#' When both arguments are specified as `NULL`, exact transition probabilities
#' are calculated.
#' 
#' When allele dosages are discrete, the exact emission probability distribution 
#' function is calculated when the number of phased genotypes (i.e. permutations)
#' that can be derived from the unphased allele dosage does not exceed a 
#' threshold defined by `EmisProbMaxPermut`. 
#' When number of permutations exceeds the threshold or when allele dosages are 
#' continuous (e.g. obtained from read depth ratios), an approximation of the
#' emission probability distribution is calculated, which helps 
#' speeding up calculations.
#' 
#' When an individual has a very low global (genome-wide) admixture proportion 
#' for an ancestral group, this contribution can be considered negligible and
#' discarded from the local admixture inference to speedup calculations. 
#' The minimum threshold can be set using `MinProp`.
#' 
#' @importFrom magrittr %>%
#' @importFrom magrittr set_names
#' @importFrom purrr map_dfr
#' @importFrom utils tail
#' @importFrom Matrix sparseMatrix
#' 
#' @seealso
#' * [SimulatePop()] to simulate a polyploid admixed population.
#' * [AdmixGlobal()] to perform global (genome-wide) admixture inference and 
#' generate the `ResAdmixGlobal` object for the `AdmixLocal()` function.
#' * [LocalPlot()] to generate a local admixture plot using the results from
#' the `AdmixLocal()` function.
#'
#' @export
#'
#' @examples
#' ## Simulate Simulate a polyploid admixed population
#' DataSim <- SimulatePop(K=3L, N=10L, P=6L, M=50L, C=5L, L=10L, Seed=123, NbThreads=1)
#'
#' ## Perform global admixture inference
#' ResAdmixGlobal <- AdmixGlobal(Geno=DataSim$Geno, K=3, Verbose=FALSE, NbThreads=1)
#'
#' ## Perform local admixture inference for one individual
#' ResAdmixLocal <- AdmixLocal(Geno=DataSim$Geno, ResAdmixGlobal=ResAdmixGlobal, 
#'                             Ind="Ind4", P=6L, DataSim$GeneticMap, 
#'                             Verbose=FALSE, NbThreads=1)
#'
#' ## Posterior ancestry dosages for Chr1
#' head(ResAdmixLocal$Posterior$Chr1)
#'
#' ## Viterbi ancestry dosages for Chr1
#' head(ResAdmixLocal$Viterbi$Chr1)
#'
#' ## Log-likelihood
#' ResAdmixLocal$LogLik
#'
#' ## Type of emission probability for Chr1
#' head(ResAdmixLocal$EmisProbType$Chr1)
AdmixLocal <- function(Geno, ResAdmixGlobal, Ind, P, GeneticMap,SmoothParam=1,
                       MaxRec=NULL, DistIntBounds=c(0.001,0.005,0.01,0.05,0.1,0.5,1),
                       MinProp=1e-6, EmisProbMaxPermut=100L, NbThreads=0L,
                       Verbose=TRUE){
  ## Initial checks
  stopifnot("Geno must be a list of numeric matrices" =
              is.list(Geno) && is.matrix(Geno[[1]]) && is.numeric(Geno[[1]]))
  stopifnot("Geno must be a named list (marker names)" =
              !is.null(names(Geno)))
  stopifnot("Geno matrices must have column names (allele names at markers)" =
              !is.null(colnames(Geno[[1]])))
  stopifnot("Geno matrices must have row names (individual names)" =
              !is.null(rownames(Geno[[1]])))
  stopifnot("ResAdmixGlobal must be generated by the AdmixGlobal function" =
              inherits(ResAdmixGlobal, "AdmixGlobal"))
  stopifnot("ResAdmixGlobal must have been generated from Geno" =
              all(names(ResAdmixGlobal$Freq)==names(Geno)) &&
              all(rownames(ResAdmixGlobal$Prop)==rownames(Geno[[1]])))
  stopifnot("Ind must be the name of a single individual included in the dataset" =
              Ind%in%rownames(ResAdmixGlobal$Prop) && length(Ind)==1)
  stopifnot("P must be a positive integer" =
              P>=1L)
  stopifnot("GeneticMap must be a dataframe with three named columns
             (Marker, Chromosome and Distance)" =
              is.data.frame(GeneticMap) &&
              all(c("Marker","Chromosome","Distance")%in%colnames(GeneticMap)))
  stopifnot("Geno list names and Marker column of GeneticMap must have identical elements" =
              all(names(Geno)%in%GeneticMap$Marker))
  stopifnot("SmoothParam must be a positive numeric value" =
              SmoothParam>0)
  stopifnot("MaxRec must range between 1 and P, or be deactivated by specifying NULL" =
              is.null(MaxRec) || MaxRec%in%c(1:P))
  stopifnot("DistIntBounds must be a vector of positive numeric values,
            or be deactivated by specifying NULL" =
              is.null(DistIntBounds) || (is.numeric(DistIntBounds) && is.vector(DistIntBounds)))
  stopifnot("NbThreads should be a non-negative integer" =
              NbThreads>=0)
  stopifnot("Verbose should be a boolean" =
              is.logical(Verbose))

  ## Define global variable for tidy operations
  . <- NULL
  Chromosome <- NULL
  Marker_1 <- NULL
  Marker_2 <- NULL
  Dist <- NULL

  ## Get quantities
  Prop_i <- ResAdmixGlobal$Prop[Ind,]
  Freq <- ResAdmixGlobal$Freq
  Marker_names <- names(Geno) %>% set_names(.,.)
  GeneticMap <- GeneticMap[match(Marker_names,GeneticMap$Marker),] 
  Chromosome_names <- unique(GeneticMap$Chromosome) %>% set_names(.,.)
  K <- length(Prop_i)
  M <- length(Geno)
  C <- length(Chromosome_names)
  Lm <- sapply(Geno, ncol)

  ## Additional checks
  stopifnot("MinProp cannot be negative" =
              MinProp>=0)
  stopifnot("MinProp is too high relative to number of groups" =
              MinProp<1/K)

  ## Print information
  if(Verbose){
    cat("#####################################################\n",
        "#####################################################\n",
        "####                  AdmixPoly                  ####\n",
        "####   Inference of local admixture proportions  ####\n",
        "#####################################################\n",
        "#####################################################\n\n",sep="")
    cat("Name of the individual = ",Ind,"\n",
        "Ploidy level of the individual = ",P,"\n",
        "Number of markers = ",M,"\n",
        "Number of chromosomes = ",C,"\n",
        "Number of alleles = ",ifelse(length(table(Lm))==1,Lm,
                                      paste0("[",min(Lm),",",max(Lm),"]")),"\n",
        "Number of groups = ",K,"\n\n",sep = "")
    cat("Options:\n",
        " - SmoothParam = ",SmoothParam,"\n",
        " - MinProp = ",MinProp,"\n",
        " - EmisProbMaxPermut = ",EmisProbMaxPermut,"\n",
        sep = "")
    if(is.null(DistIntBounds)){
      cat(" - DistIntBounds (cM) = NULL\n",sep = "")
    }else{
      cat(" - DistIntBounds (cM) =",DistIntBounds,"\n")
    }
    if(is.null(MaxRec)){
      cat(" - MaxRec = NULL\n",sep = "")
    }else{
      cat(" - MaxRec =",MaxRec,"\n")
    }
  }

  ## Identify active groups
  K_active <- which(Prop_i>MinProp)

  ## Test if more than one active group
  if(length(K_active)==1){
    if(Verbose) cat(Ind,"is not admixed but considered as 100% from",names(K_active),"\n")
    Ancestry <- lapply(Chromosome_names,function(chr){
      GeneticMap_c <- GeneticMap %>% filter(Chromosome==chr)
      res_c <- matrix(0,nrow = nrow(GeneticMap_c),ncol = K, 
                      dimnames = list(GeneticMap_c$Marker,colnames(ResAdmixGlobal$Prop)))
      res_c[,K_active] <- P
      return(res_c)
    })
    res <- list(Posterior=Ancestry, Viterbi=Ancestry)
    return(res)
  }else{

    ## Format data
    Prop_i <- Prop_i[K_active]
    Prop_i <- Prop_i/sum(Prop_i)
    K <- length(Prop_i)
    ResFormat <- .FormatLocal(Geno = Geno, Freq = Freq, GeneticMap = GeneticMap,
                              Ind = Ind, K_active = K_active, P = P, NbThreads = NbThreads)
    Geno <- ResFormat$Geno
    Freq <- ResFormat$Freq
    GeneticMap <- ResFormat$GeneticMap
    Chromosome_names <- unique(GeneticMap$Chromosome) %>% set_names(.,.)

    ## Calculate distances between marker pairs
    PairwiseDist <- lapply(Chromosome_names,function(chr){
      GeneticMap_chr <- GeneticMap %>% filter(Chromosome==chr)
      res <- tibble(Marker_1 = GeneticMap_chr$Marker[-nrow(GeneticMap_chr)],
                    Marker_2 = GeneticMap_chr$Marker[-1],
                    Chromosome = chr,
                    Dist = GeneticMap_chr$Distance[-1] -
                      GeneticMap_chr$Distance[-nrow(GeneticMap_chr)])
      return(res)
    })

    ## Function to simplify genetic distances
    SimplifyDist = function(Dist,Interval){
      MeanDist <- mean(Dist[Dist>=Interval[1]&Dist<Interval[2]])
      res <- sapply(Dist,function(d)ifelse(d>=Interval[1]&d<Interval[2],MeanDist,d))
      return(res)
    }

    ## Simplify genetic distances between consecutive markers
    DistMin <- sapply(PairwiseDist,function(chr)min(chr$Dist))
    DistMax <- sapply(PairwiseDist,function(chr)max(chr$Dist))
    if(!is.null(DistIntBounds)){
      ## Checks
      if(min(DistIntBounds)>min(DistMin)){
        DistIntBounds <- c(0,DistIntBounds)
      }
      if(max(DistIntBounds)<max(DistMax)){
        DistIntBounds <- c(DistIntBounds,Inf)
      }
      ## List of intervals
      DistIntervals <- lapply(1:(length(DistIntBounds)-1),function(i)c(DistIntBounds[i],DistIntBounds[i+1]))
      ## Calculate mean distances per interval
      PairwiseDistAll <- map_dfr(PairwiseDist, ~.x) %>% 
        { mutate(., Dist = Reduce(function(d, i) 
          SimplifyDist(Dist = d, Interval = i), x = DistIntervals, init = .$Dist))} %>% 
      # PairwiseDistAll <- map_dfr(PairwiseDist,~.x) %>% 
        # mutate(Dist = Reduce(function(d,i)
          # SimplifyDist(Dist = d,Interval = i),x = append(list(Dist),DistIntervals))) %>%
        mutate(Index = as.numeric(as.factor(Dist)))
      PairwiseDist <- lapply(Chromosome_names,function(chr) PairwiseDistAll %>% filter(Chromosome==chr))

      ## Print information
      if(Verbose){
        cat("\nAverage genetic distances are considered between consecutive markers:","\n")
        for(a in 1:max(PairwiseDistAll$Index)){
          cat(" - Interval (cM) = [",DistIntervals[[a]][1],";",DistIntervals[[a]][2],
              "[  -  Number of marker pairs = ",length(which(PairwiseDistAll$Index==a)),
              "  -  Average distance = ",round(mean(PairwiseDistAll$Dist[PairwiseDistAll$Index==a]),3),
              " cM","\n",sep = "")
        }
      }
    }else{
      if(Verbose) cat("\nExact genetic distances are considered for the HMM\n")
      PairwiseDistAll <- map_dfr(PairwiseDist,~.x) %>%
        mutate(Index = as.numeric(1:length(Dist)))
      PairwiseDist <- lapply(Chromosome_names,function(chr) PairwiseDistAll %>% filter(Chromosome==chr))
    }
    ## Get unique set of distances
    DistVal <- PairwiseDistAll$Dist[match(1:max(PairwiseDistAll$Index),PairwiseDistAll$Index)]

    ## Local ancestry inference
    if(Verbose) cat("\nInferring local admixture\n")
    if(!is.null(MaxRec)){
      res <- .ForwardBackward(Geno = Geno, Prop = Prop_i, P = P, Freq = Freq,
                              PairwiseDist = PairwiseDist, DistVal = DistVal,
                              SmoothParam = SmoothParam, 
                              MaxRec = MaxRec, #DistInt = FALSE,
                              EmisProbMaxPermut = EmisProbMaxPermut,
                              NbThreads = NbThreads)
    # }else if(!is.null(DistIntBounds)){
    #   res <- .ForwardBackward(Geno = Geno, Prop = Prop_i, P = P, Freq = Freq,
    #                           PairwiseDist = PairwiseDist, DistVal = DistVal,
    #                           SmoothParam = SmoothParam, 
    #                           MaxRec = -1, #DistInt = TRUE,
    #                           EmisProbMaxPermut = EmisProbMaxPermut,
    #                           NbThreads=NbThreads)
    }else{
      res <- .ForwardBackward(Geno = Geno, Prop = Prop_i, P = P, Freq = Freq,
                              PairwiseDist = PairwiseDist, DistVal = DistVal,
                              SmoothParam = SmoothParam, 
                              MaxRec = -1, #DistInt = FALSE,
                              EmisProbMaxPermut = EmisProbMaxPermut,
                              NbThreads = NbThreads)
    }

    ## Formatting of results
    if(length(K_active)<ncol(ResAdmixGlobal$Prop)){
      res$Posterior <- lapply(Chromosome_names,function(chr){
        res_c <- matrix(0,nrow = nrow(res$Posterior[[chr]]),ncol = ncol(ResAdmixGlobal$Prop), 
                        dimnames = list(rownames(res$Posterior[[chr]]),colnames(ResAdmixGlobal$Prop)))
        res_c[,colnames(res$Posterior[[chr]])] <- res$Posterior[[chr]]
        return(res_c)
      })
      res$Viterbi <- lapply(Chromosome_names,function(chr){
        res_c <- matrix(0,nrow = nrow(res$Viterbi[[chr]]),ncol = ncol(ResAdmixGlobal$Prop), 
                        dimnames = list(rownames(res$Viterbi[[chr]]),colnames(ResAdmixGlobal$Prop)))
        res_c[,colnames(res$Viterbi[[chr]])] <- res$Viterbi[[chr]]
        return(res_c)
      })
    }
    return(res)
  }
}