R/edec_aux_functions.R

Defines functions find_best_perm estimate_stability

Documented in estimate_stability find_best_perm

#' This chunk of codes are taken from EDec paper
#' Estimate stability of EDec models
#'
#' This function runs EDec Stage 1 for a series of random subsets of methylation
#' profiles of bulk tissue samples, with varying numbers of constituent cell
#' types. It then computes the similarity of estimated methylation profiles and
#' proportions of constituent cell types across subsets of data for models with
#' each number of constituent cell types. Stability of the model across subsets
#' of the data is generally a good indicator of which number of cell types is an
#' appropriate choice for that dataset.
#'
#' A specified number of subsets (\code{num_subsets}) of the samples with
#' methylation profiles will be generated by randomly selecting a fraction
#' (\code{subset_prop}) of the columns of \code{meth_bulk_samples}. For each of
#' those subsets of samples, EDec Stage 1 will be run using all possible number
#' of cell types (\code{possible_num_ct}). Since different runs of EDec Stage 1
#' with the same parameters can give different results, there is also the option
#' of running EDec Stage 1 multiple times (\code{reps_per_subset}) with each
#' number of cell types in each subset of the data, and keeping the best fitting
#' model. Once all runs of EDec Stage 1 are complete, the estimated methylation
#' profiles and proportions of constituent cell types for each given number of
#' constituent cell types will be compared across data subsets. Such comparisons
#' will be made by computing the Pearson correlation between methylation
#' profiles or proportion estimates for the same cell type in each pair of data
#' subsets. To determine which methylation profiles or proportion estimates
#' correspond to the same cell type in two runs of EDec, this function will
#' compute the correlation between every pair of estimated methylation profiles,
#' and find the permutation of the correlation matrix that is most similar to
#' the identity matrix.
#'
#' @param meth_bulk_samples Matrix with methylation profiles of bulk tissue
#'   samples. Rows correspond to loci/probes and columns correspond to different
#'   samples.
#' @param informative_loci A vector containing names (strings) of rows
#'   corresponding to loci/probes that are informative for distinguishing cell
#'   types.
#' @param possible_num_ct A vector of containing the possible numbers of cell
#'   types to be used in EDec Stage 1
#' @param subset_prop Proportion of samples from the full dataset to be included
#' in each subset of the data.
#' @param num_subsets Number of random subsets of the data on which EDec Stage 1
#'   with different numbers of cell types will be tested.
#' @param reps_per_subset How many times to run EDec Stage 1 with each number of
#' cell types in each subset of the data.
#' @param max_its Maximum number of iterations after which the EDec Stage 1
#'   algorithm will stop.
#' @param rss_diff_stop Maximum difference between the residual sum of squares
#'   of the model in two consecutive iterations for the EDec Stage 1 algorithm
#'   to converge.
#'
#' @return A list with the following components:
#' @return \describe{
#'  \item{\code{most_stable_num_ct}}{The number of cell types giving the most
#'   stable models across the data subsets. Minimum Pearson correlation between
#'   either methylation or proportion estimates across all data subsets is used
#'   to determine most stable model.}
#'  \item{\code{methylation_estimates}}{A list containing matrices of average
#'   methylation profiles of constituent cell types for each data subset and
#'   number of cell types.}
#'  \item{\code{proportion_estimates}}{A list containing matrices of proportions
#'   of constituent cell types in each input sample for each data subset and
#'   number of cell types.}
#'  \item{\code{stability_metric_meth}}{A matrix containing 0 to 100th
#'   quantiles, with 5% steps, of Pearson correlations between estimated
#'   methylation profiles of constituent cell types across subsets of the data
#'   for models with each possible number of cell types. Rows represent different
#'   number of cell types. Columns represent different quantiles.}
#'  \item{\code{stability_metric_props}}{A matrix containing 0 to 100th
#'   quantiles, with 5% steps, of Pearson correlations between estimated
#'   proportions of constituent cell types across subsets of the data for models
#'   with each possible number of cell types. Rows represent different number of
#'   cell types. Columns represent different quantiles.}
#'  \item{\code{stability_metric_comb}}{A matrix containing 0 to 100th
#'   quantiles, with 5% steps, of Pearson correlations between estimated
#'   proportions of constituent cell types and between methylation profiles of
#'   constituent cell types across subsets of the data for models with each
#'   possible number of cell types. Rows represent different number of cell
#'   types. Columns represent different quantiles.}
#'  }
#'
#' @export
estimate_stability = function(meth_bulk_samples,
                              informative_loci,
                              possible_num_ct,
                              subset_prop = 0.8,
                              num_subsets = 5,
                              reps_per_subset = 1,
                              max_its = 1000,
                              rss_diff_stop = 1e-8) {

  # ---------------------------------------------------------------------------
  # Check for NA values in input methylation profiles
  # ---------------------------------------------------------------------------

  if (sum(is.na(meth_bulk_samples)) > 0) {
    warning("Your input methylation profiles contain NA values.
            Loci with NA values in any samples will not be included
            in the analysis, and will not be present in cell type
            specific methylation profiles.")
  }

  # ---------------------------------------------------------------------------
  # Prepare data for analysis and declare output variables
  # ---------------------------------------------------------------------------

  meth_bulk_samples = as.matrix(stats::na.omit(meth_bulk_samples))
  all_prop_estimates <- list()
  all_meth_estimates <- list()

  total_num_samples <- ncol(meth_bulk_samples)
  num_samples_per_subset <- round(subset_prop * total_num_samples)

  ##for (s in 1:num_subsets) {
  for (s in seq_len(num_subsets)) {
    # -------------------------------------------------------------------------
    # Create random data subsets
    # -------------------------------------------------------------------------
    selected_samples <- sample(colnames(meth_bulk_samples), num_samples_per_subset)
    meth_bulk_samples_subset = meth_bulk_samples[ ,selected_samples]

    for (num_ct in possible_num_ct) {
      # -----------------------------------------------------------------------
      # Run EDec Stage 1 on the data subset
      # -----------------------------------------------------------------------
      stage1_result = run_edec_stage_1(meth_bulk_samples = meth_bulk_samples_subset,
                                       informative_loci = informative_loci,
                                       num_cell_types = num_ct,
                                       max_its = max_its,
                                       rss_diff_stop = rss_diff_stop)

      # ---------------------------------------------------------------------------
      # If reps_per_subset > 1, run EDec Stage 1 multiple times and keep best
      # fitting model
      # ---------------------------------------------------------------------------
      min_rss <- stage1_result$res.sum.squares
      rep <- 2
      while (rep <= reps_per_subset) {
        new_stage1_result <- run_edec_stage_1(meth_bulk_samples = meth_bulk_samples_subset,
                                              informative_loci = informative_loci,
                                              num_cell_types = num_ct,
                                              max_its = max_its,
                                              rss_diff_stop = rss_diff_stop)
        if(min_rss > new_stage1_result$res.sum.squares){
          stage1_result <- new_stage1_result
        }
        rep <- rep + 1
      }

      # -----------------------------------------------------------------------
      # Use methylation profiles of constituent cell types estimated from the
      # data subset to estimate proportions of constituent cell types in every
      # sample
      # -----------------------------------------------------------------------
      prop_estimates_all_samples <- estimate_props_qp(meth_bulk_samples[informative_loci,],
                                                      stage1_result$methylation[informative_loci,])

      # -----------------------------------------------------------------------
      # Save methylation profiles and proportion estimates to output lists
      # -----------------------------------------------------------------------
      all_meth_estimates[[paste("subset.",s,"_nct.",num_ct,sep="")]] <- stage1_result$methylation
      all_prop_estimates[[paste("subset.",s,"_nct.",num_ct,sep="")]] <- prop_estimates_all_samples
      print(paste("Num.CT=",num_ct," Subset=",s,sep=""))
    }
  }

  # -----------------------------------------------------------------------
  # Compute correlations between methylation profiles and proportion
  # estimates for models with each possible number of cell types across
  # data subsets.
  # -----------------------------------------------------------------------
  stab_metrics_meth <- NULL
  stab_metrics_prop <- NULL
  stab_metrics_comb <- NULL
  for (num_ct in possible_num_ct) {
    cor_vec_meth = NULL
    cor_vec_prop = NULL
    cor_vec_comb = NULL
    #for (s1 in 1:(num_subsets - 1)) {
    for (s1 in seq_len(num_subsets - 1)) {
      for(s2 in (s1 + 1):num_subsets){
        cors_meth <- stats::cor(all_meth_estimates[[paste("subset.",s1,"_nct.",num_ct,sep="")]],
                                all_meth_estimates[[paste("subset.",s2,"_nct.",num_ct,sep="")]])
        best_perm <- find_best_perm(cors_meth)
        cor_vec_meth <- c(cor_vec_meth, diag(best_perm$pmatrix))
        corv_vec_comb <- c(cor_vec_comb,diag(best_perm$pmatrix))

        cors_props <- stats::cor(all_prop_estimates[[paste("subset.",s1,"_nct.",num_ct,sep="")]],
                                 all_prop_estimates[[paste("subset.",s2,"_nct.",num_ct,sep="")]])
        cors_props <- cors_props[best_perm$pvec,]
        cor_vec_prop <- c(cor_vec_prop, diag(cors_props))
        cor_vec_comb <- c(cor_vec_comb, diag(cors_props))
      }
    }
    stab_metrics_meth <- rbind(stab_metrics_meth,stats::quantile(cor_vec_meth,seq(0,1,0.05)))
    stab_metrics_prop <- rbind(stab_metrics_prop,stats::quantile(cor_vec_prop,seq(0,1,0.05)))
    stab_metrics_comb <- rbind(stab_metrics_comb,stats::quantile(cor_vec_comb,seq(0,1,0.05)))
  }

  rownames(stab_metrics_meth) <- paste("num_ct_",possible_num_ct)

  # ---------------------------------------------------------------------------
  # Identify most number of cell types giving the most stable models
  # ---------------------------------------------------------------------------
  most_stable_num_ct <- possible_num_ct[which.max(stab_metrics_comb[,1])]

  # -----------------------------------------------------------------------
  # Create output list and return
  # -----------------------------------------------------------------------
  result <- list(most_stable_num_ct = most_stable_num_ct,
                 methylation_estimates = all_meth_estimates,
                 proportion_estimates = all_prop_estimates,
                 stability_metric_meth = stab_metrics_meth,
                 stability_metric_props = stab_metrics_prop,
                 stability_metric_comb = stab_metrics_comb)
  return(result)
}

#' Find best permutation of correlation matrix
#'
#' This function finds the permutation of a correlation matrix that is most
#' similar to the identity matrix.
#'
#' @param cor_matrix A matrix of pairwise correlations between a set of numeric
#'   vectors
#'
#' @return A list containing the following components:
#' @return \describe{
#'  \item{\code{pmatrix}}{The permuted correlation matrix.}
#'  \item{\code{pvec}}{The permutation vector, such that \code{pmatrix =
#'   cor_matrix[pvec, ]}.}
#' }
find_best_perm <- function(cor_matrix) {
  # finds the permutation P of A such that ||PA - B|| is minimum
  # in Frobenius norm
  # Uses the linear-sum assignment problem (LSAP) solver
  # in the "clue" package
  # Returns P%*%A and the permutation vector `pvec' such that
  # A[pvec, ] is the permutation of A closest to B
  n <- nrow(cor_matrix)
  id_matrix <- diag(1, nrow(cor_matrix))
  D <- matrix(NA, n, n)
  ##for (i in 1:n) {
  for (i in seq_len(n)) {
    ##for (j in 1:n) {
    for (j in seq_len(n)) {
      D[j, i] <- (sum((id_matrix[j, ] - cor_matrix[i, ])^2))
    }
  }
  vec <- c(clue::solve_LSAP(D))
  list(pmatrix=cor_matrix[vec,], pvec=vec)
}
boseb/CTDPathSim2.0 documentation built on April 14, 2022, 12:39 a.m.