signatureFit_pipeline: Signature fit pipeline

View source: R/signatureFit_pipeline.R

signatureFit_pipelineR Documentation

Signature fit pipeline

Description

This function is the main interface for computing signature fit using the signature.tools.lib R package.

Usage

signatureFit_pipeline(
  catalogues = NULL,
  genome.v = "hg19",
  organ = NULL,
  SNV_vcf_files = NULL,
  SNV_tab_files = NULL,
  DNV_vcf_files = NULL,
  DNV_tab_files = NULL,
  SV_bedpe_files = NULL,
  signatures = NULL,
  rare_signatures = NULL,
  signature_version = "RefSigv2",
  signature_names = NULL,
  fit_method = "FitMS",
  optimisation_method = "KLD",
  useBootstrap = FALSE,
  nboot = 200,
  exposureFilterType = "fixedThreshold",
  threshold_percent = 5,
  threshold_nmuts = -1,
  giniThresholdScaling = 10,
  giniThresholdScaling_nmuts = -1,
  multiStepMode = "errorReduction",
  threshold_p.value = 0.05,
  commonSignatureTier = "T1",
  rareSignatureTier = "T2",
  residualNegativeProp = 0.003,
  minResidualMutations = NULL,
  minCosSimRareSig = 0.8,
  minErrorReductionPerc = 15,
  minCosSimIncrease = 0.02,
  maxRareSigsPerSample = 1,
  rareCandidateSelectionCriteria = "MinError",
  noFit = FALSE,
  nparallel = 1,
  randomSeed = NULL,
  verbose = FALSE
)

Arguments

catalogues

catalogues matrix, samples as columns, channels as rows. The mutation type of the catalogue will be inferred automatically by checking the rownames.

genome.v

either "hg38" (will load BSgenome.Hsapiens.UCSC.hg38), "hg19" (will load BSgenome.Hsapiens.1000genomes.hs37d5), mm10 (will load BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10) or canFam3 (will load BSgenome.Cfamiliaris.UCSC.canFam3::BSgenome.Cfamiliaris.UCSC.canFam3)

organ

If signatures is not specified, then use this parameter to provide an organ name to automatically select appropriate signatures. Organ names and signature selection depends on the signature_version provided. When using RefSigv1 or RefSigv2 as signature_version organ-specific signatures will be used. Use one of the following organs: "Biliary", "Bladder", "Bone_SoftTissue", "Breast", "Cervix" (v1 only), "CNS", "Colorectal", "Esophagus", "Head_neck", "Kidney", "Liver", "Lung", "Lymphoid", "NET" (v2 only), "Oral_Oropharyngeal" (v2 only), "Ovary", "Pancreas", "Prostate", "Skin", "Stomach", "Uterus". Alternatively, set this to "Other" to use a curated set of common and rare signatures. If COSMICv2 or COSMICv3.2 are used, signatures are selected if the were found in the given organ/dataset. The mutation type is automatically inferred from the catalogue.

SNV_vcf_files

list of file names corresponding to SNV VCF files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name.

SNV_tab_files

list of file names corresponding to SNV TAB files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name. The files should contain a header in the first line with the following columns: chr, position, REF, ALT.

DNV_vcf_files

list of file names corresponding to SNV/DNV VCF files to be used to construct 96-channel substitution catalogues. Adjacent SNVs will be combined into DNVs. This should be a named vector, where the names indicate the sample name.

DNV_tab_files

list of file names corresponding to SNV/DNV TAB files to be used to construct 96-channel substitution catalogues. Adjacent SNVs will be combined into DNVs. This should be a named vector, where the names indicate the sample name. The files should contain a header in the first line with the following columns: chr, position, REF, ALT.

SV_bedpe_files

list of file names corresponding to SV (Rearrangements) BEDPE files to be used to construct 32-channel rearrangement catalogues. This should be a named vector, where the names indicate the sample name. The files should contain a rearrangement for each row (two breakpoint positions should be on one row as determined by a pair of mates of paired-end sequencing) and should already be filtered according to the user preference, as all rearrangements in the file will be used and no filter will be applied. The files should contain a header in the first line with the following columns: "chrom1", "start1", "end1", "chrom2", "start2", "end2" and "sample" (sample name). In addition, either two columns indicating the strands of the mates, "strand1" (+ or -) and "strand2" (+ or -), or one column indicating the structural variant class, "svclass": translocation, inversion, deletion, tandem-duplication. The column "svclass" should correspond to (Sanger BRASS convention): inversion (strands +/- or -/+ and mates on the same chromosome), deletion (strands +/+ and mates on the same chromosome), tandem-duplication (strands -/- and mates on the same chromosome), translocation (mates are on different chromosomes).

signatures

signatures should be a matrix or dataframe, signatures as columns, channels as rows. The mutation type of the signatures will be inferred automatically by checking the rownames. Use this parameter only if you want to use your own signatures. Leave NULL if you want to use the signatures provided by the package, for example by specifying a specific organ or signature_version.

rare_signatures

used only when fit_method=FitMS, and the signature parameter is also given. The parameter rare_signatures should be a matrix or dataframe, signatures as columns, channels as rows. The mutation type of the signatures will be inferred automatically by checking the rownames.

signature_version

either "COSMICv2", "COSMICv3.2", "RefSigv1" or "RefSigv2". If not specified, "RefSigv2 will be used. The mutation type is automatically inferred from the catalogue.

signature_names

if no signatures have been provided using the signatures and rare_signatures parameters, and if no organ is specified, then signature_names can be used to specify a list of signature names, which should match the corresponding mutation type (inferred automatically) and reference signatures requested using the signature_version parameter.

fit_method

either Fit or FitMS. Notice that automatic selection of signatures in FitMS is currently available only for SNV mutations or catalogues, signature_version=RefSigv2 and specifying an organ. Alternatively, FitMS can be used by specifying both signatures (which will be considered common signatures) and rare_signatures parameters.

optimisation_method

KLD or NNLS

useBootstrap

set to TRUE to use bootstrap

nboot

number of bootstraps to use, more bootstraps more accurate results

exposureFilterType

use either fixedThreshold or giniScaledThreshold. When using fixedThreshold, exposures will be removed based on a fixed percentage with respect to the total number of mutations (threshold_percent will be used). When using giniScaledThreshold each signature will used a different threshold calculated as (1-Gini(signature))*giniThresholdScaling

threshold_percent

threshold in percentage of total mutations in a sample, only exposures larger than threshold are considered. Set it to -1 to deactivate.

threshold_nmuts

threshold in number of mutations in a sample, only exposures larger than threshold are considered.Set it to -1 to deactivate.

giniThresholdScaling

scaling factor for the threshold type giniScaledThreshold, which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScaling, and will be used as a percentage of mutations in a sample that the exposure of "signature" need to be larger than. Set it to -1 to deactivate.

giniThresholdScaling_nmuts

scaling factor for the threshold type giniScaledThreshold, which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScaling_nmuts, and will be used as number of mutations in a sample that the exposure of "signature" need to be larger than. Set to -1 to deactivate.

multiStepMode

this is a FitMS parameter. Use one of the following: "constrainedFit", "partialNMF", "errorReduction", or "cossimIncrease".

threshold_p.value

p-value to determine whether an exposure is above the threshold_percent. In other words, this is the empirical probability that the exposure is lower than the threshold

commonSignatureTier

is either T1, T2 or T3. The default option is T1. For each organ, T1 indicates to use the common organ-specific signatures, while T2 indicates to use the corresponding reference signatures. In general, T1 should be more appropriate for organs where there are no mixed organ-specific signatures, e.g. GEL-Ovary_common_SBS1+18, while T2 might be more suitable for when such mixed signatures are present, so that each signature can be fitted, e.g. fitting the two signatures SBS1 and SBS18, instead of a single GEL-Ovary_common_SBS1+18. T3 is an intermediate option between T1 and T2, where only the mixed organ signatures are replaced with the corresponding reference signatures. This parameter affects both the organ signatures used in Fit and the common signatures used in FitMS

rareSignatureTier

is either T0, T1, T2, T3 or T4. The default option is T2. For each organ, T0 are rare signatures that were observed in the requested organ, including low quality signatures (QC amber and red signatures). T1 are high quality (QC green) rare signatures that were observed in the requested organ. T2-T4 signatures extend the rare signatures set to what has been observed also in other organs. T2 includes all QC green signatures found in other organs, with the additional restriction in the case of SBS that the additional signatures were classified as rare at least twice in Degasperi et al. 2022 Science. T3 includes all QC green signatures (if not SBS, T3=T2). T4 includes all signatures including QC amber and red. In general we advise to use the rare T2 tier.

residualNegativeProp

maximum proportion of mutations (w.r.t. total mutations in a sample) that can be in the negative part of a residual when using the constrained least squares fit when using multiStepMode=constrainedFit

minResidualMutations

minimum number of mutations in a residual when using constrainedFit or partialNMF. Deactivated by default.

minCosSimRareSig

minimum cosine similarity between a residual and a rare signature for considering the rare signature as a candidate for a sample when using constrainedFit or partialNMF

minErrorReductionPerc

minimum percentage of error reduction for a signature to be considered as candidate when using the errorReduction method. The error is computed as mean absolute deviation

minCosSimIncrease

minimum cosine similarity increase for a signature to be considered as candidate when using the cossimIncrease method

maxRareSigsPerSample

maximum number of rare signatures that should be searched in each sample. In most situations, leaving this at 1 should be enough.

rareCandidateSelectionCriteria

MaxCosSim or MinError. FitMS parameter. Whenever there is more than one rare signature that passes the multiStepMode criteria, then the best candidate rare signature is automatically selected using the rareCandidateSelectionCriteria. Candidate rare signatures can be manually selected using the function fitMerge. The parameter rareCandidateSelectionCriteria is set to MinError by default. Error is computed as the mean absolute deviation of channels.

noFit

if TRUE, terminate the pipeline early without running signature Fit. This is useful if one only wants to generate catalogues from mutation lists.

nparallel

to use parallel specify >1

randomSeed

set an integer random seed

verbose

use FALSE to suppress messages

Details

The pipeline will produce some feedback in the form or info, warning, and error messages. Please check the output to see whether everything worked as planned.

Value

returns the fit object with activities/exposures of the signatures in the given sample and other information

Examples

res <- signatureFit_pipeline(catalogues,"Breast")
plotFitResults(res$fitResults,"results/")

Nik-Zainal-Group/signature.tools.lib documentation built on April 13, 2025, 5:50 p.m.