R/signatureFit_pipeline.R

Defines functions signatureFit_pipeline

Documented in signatureFit_pipeline

# Andrea Degasperi ad923@cam.ac.uk, Serena Nik-Zainal group, University of Cambridge, UK, 2022

#' Signature fit pipeline
#'
#' This function is the main interface for computing signature fit using the signature.tools.lib R package.
#'
#' The pipeline will produce some feedback in the form or info, warning, and error messages.
#' Please check the output to see whether everything worked as planned.
#'
#' @param catalogues catalogues matrix, samples as columns, channels as rows. The mutation type of the catalogue will be inferred automatically by checking the rownames.
#' @param genome.v either "hg38" (will load BSgenome.Hsapiens.UCSC.hg38), "hg19" (will load BSgenome.Hsapiens.1000genomes.hs37d5), mm10 (will load BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10) or canFam3 (will load BSgenome.Cfamiliaris.UCSC.canFam3::BSgenome.Cfamiliaris.UCSC.canFam3)
#' @param organ If signatures is not specified, then use this parameter to provide an organ name to automatically select appropriate signatures. Organ names and signature selection depends on the signature_version provided. When using RefSigv1 or RefSigv2 as signature_version organ-specific signatures will be used. Use one of the following organs: "Biliary", "Bladder", "Bone_SoftTissue", "Breast", "Cervix" (v1 only), "CNS", "Colorectal", "Esophagus", "Head_neck", "Kidney", "Liver", "Lung", "Lymphoid", "NET" (v2 only), "Oral_Oropharyngeal" (v2 only), "Ovary", "Pancreas", "Prostate", "Skin", "Stomach", "Uterus". Alternatively, set this to "Other" to use a curated set of common and rare signatures. If COSMICv2 or COSMICv3.2 are used, signatures are selected if the were found in the given organ/dataset. The mutation type is automatically inferred from the catalogue.
#' @param SNV_vcf_files list of file names corresponding to SNV VCF files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name.
#' @param SNV_tab_files list of file names corresponding to SNV TAB files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name. The files should contain a header in the first line with the following columns: chr, position, REF, ALT.
#' @param DNV_vcf_files list of file names corresponding to SNV/DNV VCF files to be used to construct 96-channel substitution catalogues. Adjacent SNVs will be combined into DNVs. This should be a named vector, where the names indicate the sample name.
#' @param DNV_tab_files list of file names corresponding to SNV/DNV TAB files to be used to construct 96-channel substitution catalogues. Adjacent SNVs will be combined into DNVs. This should be a named vector, where the names indicate the sample name. The files should contain a header in the first line with the following columns: chr, position, REF, ALT.
#' @param SV_bedpe_files list of file names corresponding to SV (Rearrangements) BEDPE files to be used to construct 32-channel rearrangement catalogues. This should be a named vector, where the names indicate the sample name. The files should contain a rearrangement for each row (two breakpoint positions should be on one row as determined by a pair of mates of paired-end sequencing) and should already be filtered according to the user preference, as all rearrangements in the file will be used and no filter will be applied. The files should contain a header in the first line with the following columns: "chrom1", "start1", "end1", "chrom2", "start2", "end2" and "sample" (sample name). In addition, either two columns indicating the strands of the mates, "strand1" (+ or -) and "strand2" (+ or -), or one column indicating the structural variant class, "svclass": translocation, inversion, deletion, tandem-duplication. The column "svclass" should correspond to (Sanger BRASS convention): inversion (strands +/- or -/+ and mates on the same chromosome), deletion (strands +/+ and mates on the same chromosome), tandem-duplication (strands -/- and mates on the same chromosome), translocation (mates are on different chromosomes).
#' @param signatures signatures should be a matrix or dataframe, signatures as columns, channels as rows. The mutation type of the signatures will be inferred automatically by checking the rownames. Use this parameter only if you want to use your own signatures. Leave NULL if you want to use the signatures provided by the package, for example by specifying a specific organ or signature_version.
#' @param rare_signatures used only when fit_method=FitMS, and the signature parameter is also given. The parameter rare_signatures should be a matrix or dataframe, signatures as columns, channels as rows. The mutation type of the signatures will be inferred automatically by checking the rownames.
#' @param signature_version either "COSMICv2", "COSMICv3.2", "RefSigv1" or "RefSigv2". If not specified, "RefSigv2 will be used. The mutation type is automatically inferred from the catalogue.
#' @param signature_names if no signatures have been provided using the signatures and rare_signatures parameters, and if no organ is specified, then signature_names can be used to specify a list of signature names, which should match the corresponding mutation type (inferred automatically) and reference signatures requested using the signature_version parameter.
#' @param fit_method either Fit or FitMS. Notice that automatic selection of signatures in FitMS is currently available only for SNV mutations or catalogues, signature_version=RefSigv2 and specifying an organ. Alternatively, FitMS can be used by specifying both signatures (which will be considered common signatures) and rare_signatures parameters.
#' @param optimisation_method KLD or NNLS
#' @param useBootstrap set to TRUE to use bootstrap
#' @param nboot number of bootstraps to use, more bootstraps more accurate results
#' @param exposureFilterType use either fixedThreshold or giniScaledThreshold. When using fixedThreshold, exposures will be removed based on a fixed percentage with respect to the total number of mutations (threshold_percent will be used). When using giniScaledThreshold each signature will used a different threshold calculated as (1-Gini(signature))*giniThresholdScaling
#' @param threshold_percent threshold in percentage of total mutations in a sample, only exposures larger than threshold are considered. Set it to -1 to deactivate.
#' @param threshold_nmuts threshold in number of mutations in a sample, only exposures larger than threshold are considered.Set it to -1 to deactivate.
#' @param giniThresholdScaling scaling factor for the threshold type giniScaledThreshold, which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScaling, and will be used as a percentage of mutations in a sample that the exposure of "signature" need to be larger than. Set it to -1 to deactivate.
#' @param giniThresholdScaling_nmuts scaling factor for the threshold type giniScaledThreshold, which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScaling_nmuts, and will be used as number of mutations in a sample that the exposure of "signature" need to be larger than. Set to -1 to deactivate.
#' @param multiStepMode this is a FitMS parameter. Use one of the following: "constrainedFit", "partialNMF", "errorReduction", or "cossimIncrease".
#' @param residualNegativeProp maximum proportion of mutations (w.r.t. total mutations in a sample) that can be in the negative part of a residual when using the constrained least squares fit
#' when using multiStepMode=constrainedFit
#' @param minResidualMutations minimum number of mutations in a residual when using constrainedFit or partialNMF. Deactivated by default.
#' @param minCosSimRareSig minimum cosine similarity between a residual and a rare signature for considering the rare signature as a candidate for a sample when using constrainedFit or partialNMF
#' @param minErrorReductionPerc minimum percentage of error reduction for a signature to be considered as candidate when using the errorReduction method. The error is computed as mean absolute deviation
#' @param minCosSimIncrease minimum cosine similarity increase for a signature to be considered as candidate when using the cossimIncrease method
#' @param threshold_p.value p-value to determine whether an exposure is above the threshold_percent. In other words, this is the empirical probability that the exposure is lower than the threshold
#' @param commonSignatureTier is either T1, T2 or T3. The default option is T1. For each organ, T1 indicates to use the common organ-specific signatures, while T2 indicates to use the corresponding reference signatures. In general, T1 should be more appropriate for organs where there are no mixed organ-specific signatures, e.g. GEL-Ovary_common_SBS1+18, while T2 might be more suitable for when such mixed signatures are present, so that each signature can be fitted, e.g. fitting the two signatures SBS1 and SBS18, instead of a single GEL-Ovary_common_SBS1+18. T3 is an intermediate option between T1 and T2, where only the mixed organ signatures are replaced with the corresponding reference signatures. This parameter affects both the organ signatures used in Fit and the common signatures used in FitMS
#' @param rareSignatureTier is either T0, T1, T2, T3 or T4. The default option is T2. For each organ, T0 are rare signatures that were observed in the requested organ, including low quality signatures (QC amber and red signatures). T1 are high quality (QC green) rare signatures that were observed in the requested organ. T2-T4 signatures extend the rare signatures set to what has been observed also in other organs. T2 includes all QC green signatures found in other organs, with the additional restriction in the case of SBS that the additional signatures were classified as rare at least twice in Degasperi et al. 2022 Science. T3 includes all QC green signatures (if not SBS, T3=T2). T4 includes all signatures including QC amber and red. In general we advise to use the rare T2 tier.
#' @param maxRareSigsPerSample maximum number of rare signatures that should be searched in each sample. In most situations, leaving this at 1 should be enough.
#' @param rareCandidateSelectionCriteria MaxCosSim or MinError. FitMS parameter. Whenever there is more than one rare signature that passes the multiStepMode criteria, then the best candidate rare signature is automatically selected using the rareCandidateSelectionCriteria. Candidate rare signatures can be manually selected using the function fitMerge. The parameter rareCandidateSelectionCriteria is set to MinError by default. Error is computed as the mean absolute deviation of channels.
#' @param nparallel to use parallel specify >1
#' @param noFit if TRUE, terminate the pipeline early without running signature Fit. This is useful if one only wants to generate catalogues from mutation lists.
#' @param randomSeed set an integer random seed
#' @param verbose use FALSE to suppress messages
#' @return returns the fit object with activities/exposures of the signatures in the given sample and other information
#' @keywords mutational signatures fit
#' @export
#' @examples
#' res <- signatureFit_pipeline(catalogues,"Breast")
#' plotFitResults(res$fitResults,"results/")
signatureFit_pipeline <- function(catalogues=NULL,
                                  genome.v="hg19",
                                  organ=NULL,
                                  SNV_vcf_files=NULL,
                                  SNV_tab_files=NULL,
                                  DNV_vcf_files=NULL,
                                  DNV_tab_files=NULL,
                                  SV_bedpe_files=NULL,
                                  signatures=NULL,
                                  rare_signatures=NULL,
                                  signature_version="RefSigv2",
                                  signature_names=NULL,
                                  fit_method="FitMS", # either FitMS or Fit
                                  optimisation_method = "KLD",
                                  useBootstrap = FALSE,
                                  nboot = 200,
                                  exposureFilterType = "fixedThreshold", # or "giniScaledThreshold"
                                  threshold_percent = 5,
                                  threshold_nmuts= -1,
                                  giniThresholdScaling = 10,
                                  giniThresholdScaling_nmuts = -1,
                                  multiStepMode = "errorReduction", # or "partialNMF", or "errorReduction", or "cossimIncrease"
                                  threshold_p.value = 0.05,
                                  commonSignatureTier = "T1",
                                  rareSignatureTier = "T2",  # either T1 for observed in organ or T2 for extended
                                  residualNegativeProp = 0.003,
                                  minResidualMutations = NULL,
                                  minCosSimRareSig = 0.8,
                                  minErrorReductionPerc = 15,
                                  minCosSimIncrease = 0.02,
                                  maxRareSigsPerSample = 1,
                                  rareCandidateSelectionCriteria="MinError",
                                  noFit = FALSE,
                                  nparallel = 1,
                                  randomSeed = NULL,
                                  verbose = FALSE){

  message("[info signatureFit_pipeline] signatureFit pipeline starting!")

  #if multiple parallel cores are used, set it here
  doParallel::registerDoParallel(nparallel)

  if(!is.null(randomSeed)){
    set.seed(randomSeed)
  }

  # let's see if a catalogue was provided and whether we can infer the mutation type.
  mtype_catalogues <- NULL
  sampleNames <- NULL
  if(!is.null(catalogues)){
    mtype_catalogues <- getTypeOfMutationsFromChannels(catalogues)
    sampleNames <- colnames(catalogues)
  }

  # now check whether we need to build some catalogues
  specified_files <- ! c(is.null(SNV_vcf_files),
                         is.null(SNV_tab_files),
                        is.null(DNV_vcf_files),
                        is.null(DNV_tab_files),
                        is.null(SV_bedpe_files))
  names(specified_files) <- c("SNV_vcf_files",
                              "SNV_tab_files",
                              "DNV_vcf_files",
                              "DNV_tab_files",
                              "SV_bedpe_files")
  # In case the function that builds catalogues returns annotated mutations
  annotated_mutations <- NULL
  mtype_mutations <- NULL
  catalogues_mutations <- NULL
  clustering_regions_mutations <- NULL

  if(sum(specified_files)>1){
    # too many mutation file types passed
    message("[error signatureFit_pipeline] too many mutation file types specified: ",paste(names(specified_files)[specified_files],collapse = ","),". Please specify only one.")
    return(NULL)
  }else if(sum(specified_files)==0){
    # If no mutation files were used, then just make sure the catalgues var is not NULL
    if(is.null(catalogues)){
      message("[error signatureFit_pipeline] no mutation files nor catalogues file specified. Nothing to run the analysis on.")
      return(NULL)
    }
  }else{
    # now here we have exaclty one mutation file to use, so build the catalogue accordingly
    if(!is.null(SNV_vcf_files)){
      message("[info signatureFit_pipeline] reading ",length(SNV_vcf_files)," SNV vcf mutation files and building catalogues.")

      cat_list <- foreach::foreach(i=1:length(SNV_vcf_files)) %dopar% {
        sample <- names(SNV_vcf_files)[i]
        res <- vcfToSNVcatalogue(SNV_vcf_files[sample],genome.v = genome.v)
        colnames(res$catalogue) <- sample
        res$muts <- cbind(data.frame(sampleName=rep(sample,nrow(res$muts)),stringsAsFactors = F),res$muts)
        res
      }
      catalogues_mutations <- data.frame(row.names = rownames(cat_list[[1]]$catalogue),stringsAsFactors = F)
      for(i in 1:length(cat_list)){
        catalogues_mutations <- cbind(catalogues_mutations,cat_list[[i]]$catalogue)
        annotated_mutations <- rbind(annotated_mutations,cat_list[[i]]$muts)
      }

    }else if(!is.null(SNV_tab_files)){
      message("[info signatureFit_pipeline] reading ",length(SNV_tab_files)," SNV tab mutation files and building catalogues.")

      cat_list <- foreach::foreach(i=1:length(SNV_tab_files)) %dopar% {
        sample <- names(SNV_tab_files)[i]
        subs <- read.table(file = SNV_tab_files[sample],
                           sep = "\t",header = TRUE,check.names = FALSE,stringsAsFactors = FALSE)
        res <- tabToSNVcatalogue(subs,genome.v = genome.v)
        colnames(res$catalogue) <- sample
        res$muts <- cbind(data.frame(sampleName=rep(sample,nrow(res$muts)),stringsAsFactors = F),res$muts)
        res
      }
      catalogues_mutations <- data.frame(row.names = rownames(cat_list[[1]]$catalogue),stringsAsFactors = F)
      for(i in 1:length(cat_list)){
        catalogues_mutations <- cbind(catalogues_mutations,cat_list[[i]]$catalogue)
        annotated_mutations <- rbind(annotated_mutations,cat_list[[i]]$muts)
      }

    }else if(!is.null(DNV_vcf_files)){
      message("[info signatureFit_pipeline] reading ",length(DNV_vcf_files)," DNV vcf mutation files and building catalogues.")

      cat_list <- foreach::foreach(i=1:length(DNV_vcf_files)) %dopar% {
        sample <- names(DNV_vcf_files)[i]
        res <- vcfToDNVcatalogue(DNV_vcf_files[sample],genome.v = genome.v)
        colnames(res$DNV_catalogue) <- sample
        res$DNV_catalogue <- convertToAlexandrovChannels(res$DNV_catalogue)
        if(!is.null(res$annotated_DNVs)) res$annotated_DNVs <- cbind(data.frame(sampleName=rep(sample,nrow(res$annotated_DNVs)),stringsAsFactors = F),res$annotated_DNVs)
        res
      }
      catalogues_mutations <- data.frame(row.names = rownames(cat_list[[1]]$DNV_catalogue),stringsAsFactors = F)
      for(i in 1:length(cat_list)){
        catalogues_mutations <- cbind(catalogues_mutations,cat_list[[i]]$DNV_catalogue)
        if(!is.null(cat_list[[i]]$annotated_DNVs)) annotated_mutations <- rbind(annotated_mutations,cat_list[[i]]$annotated_DNVs)
      }

    }else if(!is.null(DNV_tab_files)){
      message("[info signatureFit_pipeline] reading ",length(DNV_tab_files)," DNV tab mutation files and building catalogues.")

      cat_list <- foreach::foreach(i=1:length(DNV_tab_files)) %dopar% {
        sample <- names(DNV_tab_files)[i]
        muts <- read.table(file = DNV_tab_files[sample],
                           sep = "\t",header = TRUE,check.names = FALSE,stringsAsFactors = FALSE)
        res <- tabToDNVcatalogue(muts)
        colnames(res$DNV_catalogue) <- sample
        res$DNV_catalogue <- convertToAlexandrovChannels(res$DNV_catalogue)
        if(!is.null(res$annotated_DNVs)) res$annotated_DNVs <- cbind(data.frame(sampleName=rep(sample,nrow(res$annotated_DNVs)),stringsAsFactors = F),res$annotated_DNVs)
        res
      }
      catalogues_mutations <- data.frame(row.names = rownames(cat_list[[1]]$DNV_catalogue),stringsAsFactors = F)
      for(i in 1:length(cat_list)){
        catalogues_mutations <- cbind(catalogues_mutations,cat_list[[i]]$DNV_catalogue)
        if(!is.null(cat_list[[i]]$annotated_DNVs)) annotated_mutations <- rbind(annotated_mutations,cat_list[[i]]$annotated_DNVs)
      }

    }else if(!is.null(SV_bedpe_files)){
      message("[info signatureFit_pipeline] reading ",length(SV_bedpe_files)," SV bedpe mutation files and building catalogues.")

      cat_list <- foreach::foreach(i=1:length(SV_bedpe_files)) %dopar% {
        sample <- names(SV_bedpe_files)[i]
        sv_bedpe <- read.table(SV_bedpe_files[sample],sep = "\t",header = TRUE,
                               stringsAsFactors = FALSE,check.names = FALSE,comment.char = "")
        reslist <- bedpeToRearrCatalogue(sv_bedpe)
        if(!is.null(reslist$clustering_regions)) reslist$clustering_regions$sampleName <- sample
        if(!is.null(reslist$annotated_bedpe)) reslist$annotated_bedpe <-  cbind(data.frame(sampleName=rep(sample,nrow(reslist$annotated_bedpe)),stringsAsFactors = F),reslist$annotated_bedpe)
        # check that only one catalogue is generated. If not, take the one with more mutatations and raise a warning
        resncol <- ncol(reslist$rearr_catalogue)
        if(resncol>1){
          rescolsum <- apply(reslist$rearr_catalogue,2,sum)
          sampletouse <- which.max(rescolsum)[1]
          message(paste0("[warning signatureFit_pipeline] BEDPE file for sample ",sample," contained ",resncol," sample names: ",paste(colnames(reslist$rearr_catalogue),collapse = ", "),". This could be due to germline rearrangements that should be removed. Using only the sample with the largest number of rearrangements (",colnames(reslist$rearr_catalogue)[sampletouse],"). Please double check and rerun if necessary with only one sample for each BEDPE file."))
          reslist$rearr_catalogue <- reslist$rearr_catalogue[,sampletouse,drop=F]
          reslist$annotated_bedpe <- reslist$annotated_bedpe[reslist$annotated_bedpe$sample==colnames(reslist$rearr_catalogue)[sampletouse],,drop=F]
        }
        colnames(reslist$rearr_catalogue) <- sample
        reslist
      }
      catalogues_mutations <- data.frame(row.names = rownames(cat_list[[1]]$rearr_catalogue),stringsAsFactors = F)
      bedpecolumns <- c("sampleName","chrom1", "start1", "end1", "chrom2", "start2", "end2" , "sample","svclass","id", "is.clustered", "length","catalogue.label")
      for(i in 1:length(cat_list)){
        catalogues_mutations <- cbind(catalogues_mutations,cat_list[[i]]$rearr_catalogue)
        if(!is.null(cat_list[[i]]$annotated_bedpe)) annotated_mutations <- rbind(annotated_mutations,cat_list[[i]]$annotated_bedpe[,bedpecolumns,drop=F])
        if(!is.null(cat_list[[i]]$clustering_regions)) clustering_regions_mutations <- rbind(clustering_regions_mutations,cat_list[[i]]$clustering_regions)
      }
    }

    # now we have built a catalogue table from the mutations
    mtype_mutations <- getTypeOfMutationsFromChannels(catalogues_mutations)
    if(is.null(catalogues)){
      catalogues <- catalogues_mutations
      mtype_catalogues <- mtype_mutations
      sampleNames <- colnames(catalogues)
    }else{
      # we already have a catalogue
      # let's check it is the same mutation type as the catalogue provided
      if(mtype_catalogues==mtype_mutations){
        # check for overlapping sample names and give precedence to the catalogue provided
        conflict_samples <- intersect(colnames(catalogues),colnames(catalogues_mutations))
        if(length(conflict_samples)>0){
          message("[warning signatureFit_pipeline] sample names conflict. The same sample names have been found in the catalogue provided and in the list of mutation files provided. Catalogues already provided will be used and mutations will be ignored for samples: ",paste(conflict_samples,collapse = ","),".")
          catalogues_mutations <- catalogues_mutations[,setdiff(colnames(catalogues_mutations),conflict_samples),drop=F]
        }
        # combine the catalogues:
        catalogues <- cbind(catalogues,catalogues_mutations)
        sampleNames <- colnames(catalogues)
      }else{
        message("[error signatureFit_pipeline] mutation files provided (",names(specified_files)[specified_files],
                ") has a mutation type (",mtype_mutations,") different from the mutation type of the catalogue (",mtype_catalogues,
                "). They need to be the same.")
        return(NULL)
      }
    }
  }

  # now we have the catalogues to fit and the annotated mutations ready
  # we should save these now in the return obj
  returnObj <- list()
  returnObj$catalogues <- catalogues
  returnObj$mtype_catalogues <- mtype_catalogues
  returnObj$annotated_mutations <- annotated_mutations
  returnObj$clustering_regions_mutations <- clustering_regions_mutations

  # if no signature fit was requested then finish early and return catalogue and annotated mutations
  if(noFit){
    message("[info signatureFit_pipeline] Finishing pipeline early as no signature fit requested (noFit=TRUE). ",
            "Check the returned object for generated catalogues.")
    return(returnObj)
  }

  # find out what signatures you need, start from checking whether signatures were provided

  if(!is.null(signatures)){
    # check that the tumour type is the same as the catalogues'
    mtype_signatures <- getTypeOfMutationsFromChannels(signatures)
    if(mtype_catalogues!=mtype_signatures){
      message("[error signatureFit_pipeline] mutation files or catalogues provided have a mutation type (",
              mtype_catalogues,") different from the mutation type of the signatures file provided (",
              mtype_signatures,"). They need to be the same.")
      return(returnObj)
    }
  }

  if(!is.null(rare_signatures)){
    # check that the tumour type is the same as the catalogues'
    mtype_rare_signatures <- getTypeOfMutationsFromChannels(rare_signatures)
    if(mtype_catalogues!=mtype_rare_signatures){
      message("[error signatureFit_pipeline] mutation files or catalogues provided have a mutation type (",
              mtype_catalogues,") different from the mutation type of the rare signatures file provided (",
              mtype_rare_signatures,"). They need to be the same.")
      return(returnObj)
    }
  }

  # now, let's consider the following order/priorities:
  # 1. If the user provided signatures, then use those and ignore all other options
  #    If the user chose FitMS but did not provide rare signatures nor an organ, then we should warn the user and switch to Fit, while suggesting to provide the rare signatures or organ if FitMS is needed
  # 2. If the user did not provide signatures, let's check if the user provided an organ.
  #    With an organ, we can select signatures automatically, depending on the signature_version parameter
  #    a) if signature_version is NULL, then assume we are using the latest organ-specific signatures
  #       (RefSigv2 if subs and dbs or RefSigv1 if rearr)
  #    b) if signature version is RefSigv1 or RefSigv2, then use organ specific signatures
  #    c) if signature version is COSMICv2 or COSMICv3.2, use a subset of signatures found in a given organ
  # 3. If no signatures are provided and no organ is specified, then use the COSMIC or RefSigs reference signatures.
  #    If a list of signatures is provided with signature_names, then use only the signatures in the list,
  #    while if signature_names is NULL then use all the signatures at once

  if(!is.null(signatures)){
    if(is.null(rare_signatures) & fit_method=="FitMS" & is.null(organ)){
      message("[warning signatureFit_pipeline] user selected FitMS which requires both common and rare signatures. ",
              "Common signatures were provided using the signatures parameter, but the parameter rare_signatures is missing and no organ was provided to select it automatically. ",
              "The fit_method parameters has been changed to Fit so that the provided signatures can be fitted. If FitMS was ",
              "the intended method, please either provide the rare_signatures parameter or provide an organ for automatic signature selection.")
      fit_method <- "Fit"
    }
    if(!is.null(organ)){
      if(fit_method=="FitMS"){
        message("[warning signatureFit_pipeline] The organ parameter ",organ," will not be used to automatically select the common signatures to fit, because the user specified the common signatures using the signatures parameter.")
      }else if(fit_method=="Fit"){
        message("[warning signatureFit_pipeline] The organ parameter ",organ," will not be used to automatically select the signatures to fit, because the user specified the signatures using the signatures parameter.")
      }
    }
    if(!is.null(signature_version) & fit_method=="Fit"){
      message("[warning signatureFit_pipeline] signature_version parameter ",signature_version," will be ignored because the user has provided the signatures parameter.")
    }
  }else{
    # user has not provided the signatures parameter, so let's see what was requested

    # let's check whether a specific signature_version was requested, and if not set to the latest
    if(is.null(signature_version)){
      message("[warning signatureFit_pipeline] signature_version parameter unspecified, using latest RefSigv2.")
      signature_version <- "RefSigv2"
    }
    # if no organ was provided and fit_method=FitMS, we need to switch to Fit at this point
    if(fit_method=="FitMS" & is.null(organ)){
      message("[warning signatureFit_pipeline] user selected FitMS, but did not provide an organ for the automatic ",
              "selection of signatures and did not provide signatures via the signatures and rare_signatuers parameters. ",
              "The fit_method parameters has been changed to Fit so that we can still fit some signatures according to the ",
              "signature_version parameter and the mutation type of the catalogues.")
      fit_method <- "Fit"
    }

    # Need to avoid code duplication and errors by fixing some cases using warnings
    if(signature_version=="RefSigv2" & mtype_catalogues=="rearr"){
      message("[warning signatureFit_pipeline] rearrangements RefSig mutational signatures only available in RefSigv1, ",
              "switching to signature_version=RefSigv1")
      signature_version <- "RefSigv1"
    }
    if(fit_method=="FitMS" & mtype_catalogues=="rearr"){
      message("[warning signatureFit_pipeline] FitMS does not support rearrangement signatures yet, ",
              "switching to fit_method=Fit")
      fit_method <- "Fit"
    }
    if(fit_method=="FitMS" & mtype_catalogues=="subs" & signature_version=="RefSigv1"){
      message("[warning signatureFit_pipeline] FitMS with signature version RefSigv1 requested for substitutions, ",
              "switching to fit_method=Fit. If you intended to use FitMS, then set signature_version to RefSigv2")
      fit_method <- "Fit"
    }
    if(signature_version=="RefSigv1" & mtype_catalogues=="DNV"){
      message("[warning signatureFit_pipeline] DBS RefSig mutational signatures only available in RefSigv2, ",
              "switching to signature_version=RefSigv2")
      signature_version <- "RefSigv2"
    }
    if((signature_version=="COSMICv3.2" | signature_version=="COSMICv2") &  fit_method=="FitMS"){
      message("[warning signatureFit_pipeline] FitMS does not support COSMIC signatures, ",
              "switching to fit_method=Fit")
      fit_method <- "Fit"
    }

    if(signature_version=="RefSigv2"){
      if(!is.null(organ)){
        # an organ was requested. If FitMS was requested, then leave it to FitMS to get the appropriate signatures
        if(fit_method=="Fit"){
          if(mtype_catalogues %in% c("subs","DNV")){
            signatures <- getOrganSignatures(organ = organ,version = 2,typemut = mtype_catalogues)
            # need to check that the getOrganSignatures functions returned something
            if(is.null(signatures)){
              message("[error signatureFit_pipeline] RefSigv2 signatures not available for mutation type ",mtype_catalogues,
                      " and organ ",organ,".")
              return(returnObj)
            }
            if(commonSignatureTier=="T2"){
              # need to convert
              convertedNames <- convertSigNamesFromOrganToRefSigs(colnames(signatures),typemut = mtype_catalogues)
              if(mtype_catalogues == "subs"){
                signatures <- referenceSignaturesSBSv2.03[,convertedNames,drop=F]
              }else {
                signatures <- referenceSignaturesDBSv1.01[,convertedNames,drop=F]
              }
            }else if(commonSignatureTier=="T3"){
              convertedNames <- convertSigNamesFromOrganToRefSigs(colnames(signatures),typemut = mtype_catalogues,mixedOnly = TRUE)
              if(mtype_catalogues == "subs"){
                combinedSigs <- cbind(referenceSignaturesSBSv2.03,organSignaturesSBSv2.03)
                signatures <- combinedSigs[,convertedNames,drop=F]
              }else {
                combinedSigs <- cbind(referenceSignaturesDBSv1.01,organSignaturesDBSv1.01)
                signatures <- combinedSigs[,convertedNames,drop=F]
              }
            }
          }else if(mtype_catalogues == "rearr"){
            # ---
          }else{
            message("[error signatureFit_pipeline] mutation type ",mtype_catalogues," not available for automatic selection of signatures for signature_version RefSigv2. ",
                    "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
            return(returnObj)
          }
        }
      }else{
        # user did not provide an organ, use reference signatures and perhaps signature_names
        message("[info signatureFit_pipeline] no organ provided, so using the reference signatures appropriate for the signature version ",signature_version," and mutation type ",mtype_catalogues,".")
        if(mtype_catalogues == c("subs")){
          signatures <- referenceSignaturesSBSv2.03
        }else if(mtype_catalogues == c("DNV")){
          signatures <- referenceSignaturesDBSv1.01
        }else if(mtype_catalogues == c("rearr")){
          # --- 
        }else{
          message("[error signatureFit_pipeline] mutation type ",mtype_catalogues," not available for automatic selection of signatures for signature_version RefSigv2. ",
                  "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
          return(returnObj)
        }
      }
    }else if(signature_version=="RefSigv1"){
      if(!is.null(organ)){
        # an organ was requested.
        if(fit_method=="Fit"){
          if(mtype_catalogues %in% c("subs","rearr")){
            signatures <- getOrganSignatures(organ = organ,version = 1,typemut = mtype_catalogues)
            # need to check that the getOrganSignatures functions returned something
            if(is.null(signatures)){
              message("[error signatureFit_pipeline] RefSigv1 signatures not available for mutation type ",mtype_catalogues,
                      " and organ ",organ,".")
              return(returnObj)
            }
            if(commonSignatureTier=="T2"){
              # need to convert
              convertedNames <- convertSigNamesFromOrganToRefSigs(colnames(signatures),typemut = mtype_catalogues)
              if(mtype_catalogues == "subs"){
                signatures <- RefSigv1_subs[,convertedNames,drop=F]
              }else {
                signatures <- RefSigv1_rearr[,convertedNames,drop=F]
              }
            }else if(commonSignatureTier=="T3"){
              convertedNames <- convertSigNamesFromOrganToRefSigs(colnames(signatures),typemut = mtype_catalogues,mixedOnly = TRUE)
              if(mtype_catalogues == "subs"){
                combinedSigs <- cbind(RefSigv1_subs,all_organ_sigs_subs)
                signatures <- combinedSigs[,convertedNames,drop=F]
              }else {
                combinedSigs <- cbind(RefSigv1_rearr,all_organ_sigs_rearr)
                signatures <- combinedSigs[,convertedNames,drop=F]
              }
            }

          }else if(mtype_catalogues == "DNV"){
            # ---
          }else{
            message("[error signatureFit_pipeline] mutation type ",mtype_catalogues," not available for automatic selection of signatures. ",
                    "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
            return(returnObj)
          }
        }
      }else{
        # user did not provide an organ, use reference signatures and perhaps signature_names
        message("[info signatureFit_pipeline] no organ provided, so using the reference signatures appropriate for the signature version ",signature_version," and mutation type ",mtype_catalogues,".")
        if(mtype_catalogues == c("subs")){
          signatures <- RefSigv1_subs
        }else if(mtype_catalogues == c("DNV")){
          # ---
        }else if(mtype_catalogues == c("rearr")){
          signatures <- RefSigv1_rearr
        }else{
          message("[error signatureFit_pipeline] mutation type ",mtype_catalogues," not available for automatic selection of signatures. ",
                  "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
          return(returnObj)
        }
      }
    }else if(signature_version=="COSMICv2"){
      if(!is.null(organ)){
        # TODO
        message("[error signatureFit_pipeline] organ signatures selection for COSMICv2 signatures not implemented yet. ",
                "Leave organ=NULL and select signatures manually with the signature_names parameter.")
        return(returnObj)

      }else{
        if(mtype_catalogues == c("subs")){
          signatures <- cosmic30
        }else{
          message("[error signatureFit_pipeline] using COSMICv2 signatures for ",mtype_catalogues," currenty not supported. ",
                  "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
          return(returnObj)
        }

      }
    }else if(signature_version=="COSMICv3.2"){
      if(!is.null(organ)){
        # TODO
        message("[error signatureFit_pipeline] organ signatures selection for COSMICv3.2 signatures not implemented yet. ",
                "Leave organ=NULL and select signatures manually with the signature_names parameter.")
        return(returnObj)

      }else{
        if(mtype_catalogues == c("subs")){
          signatures <- COSMIC_v3.2_SBS_GRCh37
        }else{
          message("[error signatureFit_pipeline] using COSMICv2 signatures for ",mtype_catalogues," currenty not supported. ",
                  "Please provide your own signatures using the signatures parameter, and also rare_signatures if using FitMS.")
          return(returnObj)
        }
      }
    }else{
      message("[error signatureFit_pipeline] signature version ",signature_version," unrecognised.")
      return(returnObj)
    }
  }


  # signature selection using signature_names
  if(!is.null(signature_names) & !is.null(signatures) & is.null(organ)){
    sigsToUseNames <- colnames(signatures)
    if(fit_method=="Fit"){
      # check that the names make sense
      if(!all(signature_names %in% sigsToUseNames)){
        message("[error signatureFit_pipeline] using reference signatures for ",mtype_catalogues," and signature_version ",signature_version,
                ", but some signature names provided ",
                "with the signature_names parameters do not seem to match the reference signatures. This is the list of ",
                "reference signatures available: ",paste(sigsToUseNames,collapse = ", "),".")
        return(returnObj)
      }
      # selected signatures only
      signatures <- signatures[,signature_names,drop=F]
    }else if(fit_method=="FitMS"){
      message("[warning signatureFit_pipeline] using FitMS, parameter signature_names will be ignored.")
    }

  }


  # Now, if FitMS was requested, and signature_type==NULL, then I think I can just call FitMS,
  # FitMS itself will take care of whether some of the parameters are invalid
  if(fit_method=="Fit"){
    # when running Fit, the parameter signatures cannot be NULL or an empty table, so check again
    if(is.null(signatures)){
      message("[error signatureFit_pipeline] Trying to run Fit but could not determine the signatures to use. Please check your settings.")
      return(returnObj)
    }else{
      if(ncol(signatures)==0){
        message("[error signatureFit_pipeline] Trying to run Fit but could not determine the signatures to use. Please check your settings.")
        return(returnObj)
      }
    }
    message("[info signatureFit_pipeline] all set, running Fit.")
    fitRes <- Fit(catalogues = catalogues,
                  signatures = signatures,
                  exposureFilterType = exposureFilterType,
                  giniThresholdScaling = giniThresholdScaling,
                  giniThresholdScaling_nmuts = giniThresholdScaling_nmuts,
                  threshold_percent = threshold_percent,
                  threshold_nmuts = threshold_nmuts,
                  method = optimisation_method,
                  useBootstrap = useBootstrap,
                  nboot = nboot,
                  threshold_p.value = threshold_p.value,
                  nparallel = nparallel,
                  randomSeed = randomSeed,
                  verbose = verbose)
    fitRes$commonSignatureTier <- commonSignatureTier
    signaturesUsed <- fitRes$signatures

  }else if(fit_method=="FitMS"){
    message("[info signatureFit_pipeline] all set, running FitMS.")
    fitRes <- FitMS(catalogues = catalogues,
                    organ = organ,
                    commonSignatureTier = commonSignatureTier,
                    rareSignatureTier = rareSignatureTier,
                    commonSignatures = signatures,
                    rareSignatures = rare_signatures,
                    method = optimisation_method,
                    exposureFilterType = exposureFilterType,
                    threshold_percent = threshold_percent,
                    threshold_nmuts = threshold_nmuts,
                    giniThresholdScaling = giniThresholdScaling,
                    giniThresholdScaling_nmuts = giniThresholdScaling_nmuts,
                    multiStepMode = multiStepMode,
                    residualNegativeProp = residualNegativeProp,
                    minResidualMutations = minResidualMutations,
                    minCosSimRareSig = minCosSimRareSig,
                    minErrorReductionPerc = minErrorReductionPerc,minCosSimIncrease = minCosSimIncrease,
                    useBootstrap = useBootstrap,
                    nboot = nboot,
                    threshold_p.value = threshold_p.value,
                    maxRareSigsPerSample = maxRareSigsPerSample,
                    rareCandidateSelectionCriteria = rareCandidateSelectionCriteria,
                    nparallel = nparallel,
                    randomSeed = randomSeed,
                    verbose = verbose)
      signaturesUsed <- fitRes$commonSignatures
      if(!is.null(fitRes$rareSignatures)) signaturesUsed <- cbind(signaturesUsed,fitRes$rareSignatures)

  }else{
    message("[error signatureFit_pipeline] unknown fit_method ",fit_method,".")
    return(returnObj)
  }
  
  # check if we can annotate the mutations with signature probability
  if(!is.null(returnObj$annotated_mutations)){
    annotated_mutations <- NULL
    for(sampleName in colnames(catalogues)){
      annotated_mutations <- rbind(annotated_mutations,
                                   assignSignatureProbabilityToMutations(sampleMutations = returnObj$annotated_mutations[returnObj$annotated_mutations$sampleName==sampleName,,drop=F],
                                                                         sampleSigsExposures = fitRes$exposures[sampleName,,drop=F],
                                                                         signatures = signaturesUsed))
    }
    returnObj$annotated_mutations <- annotated_mutations
  }

  message("[info signatureFit_pipeline] signatureFit pipeline completed!")

  # now just return the results
  returnObj$fitResults <- fitRes
  return(returnObj)

}
Nik-Zainal-Group/signature.tools.lib documentation built on April 13, 2025, 5:50 p.m.