HRDetect_pipeline: HRDetect Pipeline

View source: R/HRDetect.R

HRDetect_pipelineR Documentation

HRDetect Pipeline

Description

Run the HRDetect pipeline. This function allows for flexible input specification to the HRDetect pipeline that computes the HRDetect BRCAness probability score as published in Davies et al. 2017. It requires an input data frame "data_matrix", which contains a sample in each row and one of six necessary features in each column. The six features can be computed by the pipeline if the necessary input files are provided. The six features are: 1) proportion of deletions at microhomology (del.mh.prop), 2) number of mutations of substitution signature 3 (SNV3), 3) number of mutations of rearrangemet signature 3 (SV3), 4) number of mutations of rearrangemet signature 5 (SV5), 5) HRD LOH index (hrd), 6) number of mutations of substitution signature 8 (SNV8). For example, if the HRD LOH index has already been calculated, these can be added to the input data_matrix, or if the SNV catalogues have already been calculated, these can be supplied using the SNV_catalogues parameter while setting SNV3 and SNV8 columns as "NA". Also, it is possible to provide different data for different samples. For example, one can provide SNV3 and SNV8 number of mutations for some samples in data_matrix, while setting SNV3 and SNV8 to NA for other samples, and providing either SNV catalogues and/or SNV VCF files for these samples. The function will return the HRDetect BRCAness probability score for all the samples for which enough data are available to calculate all six necessary features. Along with the score, the contribution of each feature to the score will be provided. In addition, an updated data_matrix and other other data that have been calculated during the execution of the pipeline will be returned as well. The input data_matrix can also be omitted, and the required HRDetect features will be computed from the other input files. Signature fit with bootstrap and bootstrap HRDetect scores can be requested using the bootstrapSignatureFit and bootstrapHRDetectScores parameters.

Usage

HRDetect_pipeline(
  data_matrix = NULL,
  genome.v = "hg19",
  SNV_catalogues = NULL,
  SV_catalogues = NULL,
  SNV_vcf_files = NULL,
  SNV_tab_files = NULL,
  Indels_vcf_files = NULL,
  Indels_tab_files = NULL,
  CNV_tab_files = NULL,
  SV_bedpe_files = NULL,
  organ = NULL,
  SNV_signature_version = "RefSigv2",
  SV_signature_version = "RefSigv1",
  SNV_signature_names = NULL,
  SV_signature_names = NULL,
  rareCandidateSelectionCriteria = "MinError",
  subs_fit_obj = NULL,
  rearr_fit_obj = NULL,
  customNameSNV3 = NULL,
  customNameSNV8 = NULL,
  customNameSV3 = NULL,
  customNameSV5 = NULL,
  SNV_commonSignatureTier = "T1",
  SV_commonSignatureTier = "T1",
  SNV_rareSignatureTier = "T2",
  optimisation_method = "KLD",
  exposureFilterTypeFit = "fixedThreshold",
  giniThresholdScalingFit = 10,
  giniThresholdScaling_nmutsFit = -1,
  threshold_percentFit = 5,
  threshold_nmutsFit = -1,
  bootstrapSignatureFit = TRUE,
  nbootFit = 100,
  threshold_p.valueFit = 0.05,
  bootstrapHRDetectScores = FALSE,
  SNV_maxRareSigs = 1,
  nparallel = 1,
  randomSeed = NULL
)

Arguments

data_matrix

data frame containing a sample for each row and the six necessary features as columns. Columns should be labelled with the following names: del.mh.prop, SNV3, SV3, SV5, hrd, SNV8. Row names of the data frame should correspond to the sample names. If the values of the features need to be computed, set them to NA and provide additional data (e.g. catalogues, VCF/BEDPE/TAB files as specified in this documentation page).

genome.v

genome version to use when constructing the SNV catalogue and classifying indels. Set it to either "hg19" or "hg38".

SNV_catalogues

data frame containing 96-channel substitution catalogues. A sample for each column and the 96-channels as rows. Row names should have the correct channel names (see for example tests/testthat/test.snv.tab) and the column names should be the sample names so that each catalogue can be matched with the corresponding row in the data_matrix input.

SV_catalogues

data frame containing 32-channel substitution catalogues. A sample for each column and the 32-channels as rows. Row names should have the correct channel names (see for example tests/testthat/test.cat) and the column names should be the sample names so that each catalogue can be matched with the corresponding row in the data_matrix input.

SNV_vcf_files

list of file names corresponding to SNV VCF files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should only contain SNV and should already be filtered according to the user preference, as all SNV in the file will be used and no filter will be applied.

SNV_tab_files

list of file names corresponding to SNV TAB files to be used to construct 96-channel substitution catalogues. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should only contain SNV and should already be filtered according to the user preference, as all SNV in the file will be used and no filter will be applied. The files should contain a header in the first line with the following columns: chr, position, REF, ALT.

Indels_vcf_files

list of file names corresponding to Indels VCF files to be used to classify Indels and compute the proportion of indels at micro-homology. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should only contain indels (no SNV) and should already be filtered according to the user preference, as all indels in the file will be used and no filter will be applied.

Indels_tab_files

list of file names corresponding to Indels TAB files to be used to classify Indels and compute the proportion of indels at micro-homology. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should only contain indels (no SNV) and should already be filtered according to the user preference, as all indels in the file will be used and no filter will be applied. Each File contains indels from a single sample and the following minimal columns: chr, position, REF, ALT.

CNV_tab_files

list of file names corresponding to CNV TAB files (similar to ASCAT format) to be used to compute the HRD-LOH index. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should contain a header in the first line with the following columns: 'seg_no', 'Chromosome', 'chromStart', 'chromEnd', 'total.copy.number.inNormal', 'minor.copy.number.inNormal', 'total.copy.number.inTumour', 'minor.copy.number.inTumour'

SV_bedpe_files

list of file names corresponding to SV (Rearrangements) BEDPE files to be used to construct 32-channel rearrangement catalogues. This should be a named vector, where the names indicate the sample name, so that each file can be matched to the corresponding row in the data_matrix input. The files should contain a rearrangement for each row (two breakpoint positions should be on one row as determined by a pair of mates of paired-end sequencing) and should already be filtered according to the user preference, as all rearrangements in the file will be used and no filter will be applied. The files should contain a header in the first line with the following columns: "chrom1", "start1", "end1", "chrom2", "start2", "end2" and "sample" (sample name). In addition, either two columns indicating the strands of the mates, "strand1" (+ or -) and "strand2" (+ or -), or one column indicating the structural variant class, "svclass": translocation, inversion, deletion, tandem-duplication. The column "svclass" should correspond to (Sanger BRASS convention): inversion (strands +/- or -/+ and mates on the same chromosome), deletion (strands +/+ and mates on the same chromosome), tandem-duplication (strands -/- and mates on the same chromosome), translocation (mates are on different chromosomes)..

organ

when using RefSigv1 or RefSigv2 as SNV_signature_version, organ-specific signatures will be used. Use one of the following organs: "Biliary", "Bladder", "Bone_SoftTissue", "Breast", "Cervix" (v1 only), "CNS", "Colorectal", "Esophagus", "Head_neck", "Kidney", "Liver", "Lung", "Lymphoid", "NET" (v2 only), "Oral_Oropharyngeal" (v2 only), "Ovary", "Pancreas", "Prostate", "Skin", "Stomach", "Uterus". If a certain organ is not available for either SNV or SV signatures, or if the organ parameter is unspecified, then the pipeline will attempt to use the corresponding reference signatures instead, and SNV_signature_names and/or SV_signature_names can be used to specify a subset of signature names. Alternatively, try to set this to "Other" to use a curated set of signatures.

SNV_signature_version

version of single base substitution signatures to use, either "COSMICv2", "COSMICv3.2", "RefSigv1" (Degasperi et al. 2020, Nature Cancer) or "RefSigv2" (Degasperi et al. 2022, Science)

SV_signature_version

version of rearrangement signatures to use, only "RefSigv1" (Degasperi et al. 2020, Nature Cancer) currently available

SNV_signature_names

when organ is not specified, you can use this to specify a list of SNV signature names to select from the set of signatures determined by the SNV_signature_version option

SV_signature_names

when organ is not specified, you can use this to specify a list of SNV signature names to select from the set of signatures determined by the SV_signature_version option

rareCandidateSelectionCriteria

MaxCosSim or MinError. FitMS parameter. Whenever there is more than one rare signature that passes the multiStepMode criteria, then the best candidate rare signature is automatically selected using the rareCandidateSelectionCriteria. The parameter rareCandidateSelectionCriteria is set to MinError by default. Error is computed as the mean absolute deviation of channels.

subs_fit_obj

Fit or FitMS result object. This parameter should be used when the user wants to customise the subs fit outside the HRDetect pipeline. If custom signatures were used, parameters customNameSNV3 and customNameSNV8 can be used to specify which custom signatures correspond to the HRDetect parameters SNV3 and SNV8.

rearr_fit_obj

Fit or FitMS result object. This parameter should be used when the user wants to customise the rearrangements fit outside the HRDetect pipeline. If custom signatures were used, parameters customNameSV3 and customNameSV5 can be used to specify which custom signatures correspond to the HRDetect parameters SV3 and SV5.

customNameSNV3

custom signature name that will be considered as SNV3 input for HRDetect. Useful for when subs_fit_obj is provided and custom signatures are used.

customNameSNV8

custom signature name that will be considered as SNV8 input for HRDetect. Useful for when subs_fit_obj is provided and custom signatures are used.

customNameSV3

custom signature name that will be considered as SV3 input for HRDetect. Useful for when rearr_fit_obj is provided and custom signatures are used.

customNameSV5

custom signature name that will be considered as SV5 input for HRDetect. Useful for when rearr_fit_obj is provided and custom signatures are used.

SNV_commonSignatureTier

either T1 or T2. Used when fitting organ specific substitution signatures (organ is specified).For each organ, T1 indicates to use the common organ-specific signatures, while T2 indicates to use he corresponding reference signatures. In general, T1 should be more appropriate for organs where there are no mixed organ-specific signatures, e.g. SBS1+18 or SBS2+13, while T2 might be more suitable for when such mixed signatures are present, so that each signature can be fitted, e.g. fitting the two signatures SBS1 and SBS18, instead of a single SBS1+18. This parameter affects both the organ signatures used in Fit and the common signatures used in FitMS

SV_commonSignatureTier

either T1 or T2. Used when fitting organ specific rearrangement signatures (organ is specified). See SNV_commonSignatureTier description.

SNV_rareSignatureTier

either T1 or T2. For each organ we provide two lists of rare signatures that can be used. Tier 1 (T1) are rare signatures that were observed in the requested organ. The problem with T1 is that it may be that a signature is not observed simply because there were not enough samples for a certain organ in the particular dataset that was used to extract the signatures. So in general we advise to use Tier 2 (T2) signatures, which extend the rare signature to a wider number of rare signatures.

optimisation_method

can be KLD (KL divergence), NNLS (non-negative least squares) or SA (simulated annealing)

exposureFilterTypeFit

use either fixedThreshold or giniScaledThreshold as exposure filter in signature fit. When using fixedThreshold, exposures will be removed based on a fixed percentage with respect to the total number of mutations (threshold_percentFit will be used). When using giniScaledThreshold each signature will used a different threshold calculated as (1-Gini(signature))*giniThresholdScalingFit

giniThresholdScalingFit

scaling factor for when exposureFilterTypeFit="giniScaledThreshold", which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScalingFit, and will be used as a percentage of mutations in a sample that the exposure of "signature" need to be larger than. Set it to -1 to deactivate.

giniThresholdScaling_nmutsFit

scaling factor for when exposureFilterTypeFit="giniScaledThreshold", which is based on the Gini score of a signature. The threshold is computed as (1-Gini(signature))*giniThresholdScaling_nmutsFit, and will be used as number of mutations in a sample that the exposure of "signature" need to be larger than. Set to -1 to deactivate.

threshold_percentFit

threshold in percentage of total mutations in a sample for when exposureFilterTypeFit="fixedThreshold". Only exposures larger than or equal to the threshold are considered, the others are set to zero. Set it to -1 to deactivate.

threshold_nmutsFit

threshold in number of mutations in a sample for when exposureFilterTypeFit="fixedThreshold". Only exposures larger than or equal to the threshold are considered, the others are set to zero. Set it to -1 to deactivate.

bootstrapSignatureFit

set to TRUE to compute bootstrap signature fits, otherwise FALSE will compute a single fit. If a sample has a low number of mutations, then the bootstrap procedure can alter the catalogue a lot, in which case a single fit is advised

nbootFit

number of bootstraps to use, more bootstraps more accurate results

threshold_p.valueFit

p-value to determine whether an exposure is above the threshold_percent. In other words, this is the empirical probability that the exposure is lower than the threshold

bootstrapHRDetectScores

perform HRDetect score with bootstrap. This requires mutations or catalogues for subs/rearr to compute the bootstrap fit, and indels mutations to bootstrap the indels classification. HRD-LOH can still be provided using the input data_matrix.

SNV_maxRareSigs

the maximum number of rare substitution signatures allowed in a sample (default SNV_maxRareSigs=1) when using FitMS to fit SNV signatures. FitMS is used if the organ parameter is specified and SNV_signature_version=RefSigv2 (which is the default).

nparallel

how many parallel threads to use.

randomSeed

set an integer random seed

Details

Single Nucleotide Variations. Columns in data_matrix relative to SNV are SNV3 and SNV8. Values corresponding to number of SNV3 and SNV8 mutations in each sample can be provided in the data frame data_matrix. Alternatively, SNV catalogues can be provided for the samples (96-channels as rows and samples as columns) or can be constructed providing a list of either SNV VCF files or SNV TAB files. The SNV catalogues will then be used to estimate signature exposures, using either RefSig or COSMIC signatures. The signature version can be specified using the SNV_signature_version parameter and a specific subset of signatures can be requested via the SNV_signature_names parameter. If an organ is specified, the pipeline will attempt to use organ specific signatures. If the signature version is RefSigv2 and an organ is specified, signature fit will be performed using the FitMS algorithm (Degasperi et al. 2022, Science).

Structural Variants (Rearrangements). Columns in data_matrix relative to SV are SV3 and SV5. Values corresponding to number of SV3 and SV5 rearrangements in each sample can be provided in the data frame data_matrix. Alternatively, SV catalogues can be provided for the samples (32-channels as rows and samples as columns) or can be constructed providing a list of SV BEDPE files. The SV catalogues will then be used to estimate signature exposures, using RefSig signatures. The signature version can be specified using the SNV_signature_version parameter and a specific subset of signatures can be requested via the SV_signature_names parameter. If an organ is specified, the pipeline will attempt to use organ specific signatures.

If signature fit for SNV or SV has already been performed using the Fit or FitMS functions, the resulting objects can be passed directly using the subs_fit_obj and rearr_fit_obj parameters. If the objects contain bootstrap fits, these will be used when the bootstrap HRDetect score is requested. The HRDetect_pipeline function will attempt to extract values for SNV3, SNV8, SV3, SV5, using the following signature names: SNV3 = "SBS3", "Signature3", "RefSig3"; SNV8 = "SBS8", "Signature8", "RefSig8"; SV3 = "RS3","RefSigR3"; SV5 = "RS5", "RefSigR5", "RefSigR9". If custom signature names have been used in the subs_fit_obj and rearr_fit_obj fits, then the custom names can be provided using the parameters customNameSNV3, customNameSNV8, customNameSV3 and customNameSV5.

Deletions at Micro-homology (Indels). The column in data_matrix corresponding to the proportion of deletions at micro-homology is del.mh.prop. The proportion of deletions at micro-homology for the samples can be calculated by the pipeline if the user provides Indels VCF files.

HRD-LOH index (CNV). The column in data_matrix corresponding to the HRD-LOH index is hrd. The HRD-LOH index for the samples can be calculated by the pipeline if the user provides copy numbers TAB files.

The pipeline will produce some feedback in the form or info, warning, and error messages. Please check the output to see whether everything worked as planned.

Value

return a list that contains $data_matrix (updated input data_matrix with additional computed features), $hrdetect_output (data frame with HRDetect BRCAness Probability and contribution of the features), $SNV_catalogues (input SNV_catalogues updated with additional computed substitution catalogues if any), $SV_catalogues (input SV_catalogues updated with additional computed rearrangement catalogues if any)

References

A. Degasperi, T. D. Amarante, J. Czarnecki, S. Shooter, X. Zou, D. Glodzik, ... H. Davies, S. Nik-Zainal. A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies, Nature Cancer, https://doi.org/10.1038/s43018-020-0027-5, 2020.

A. Degasperi, X. Zou, T. D. Amarante, ..., H. Davies, Genomics England Research Consortium, S. Nik-Zainal. Substitution mutational signatures in whole-genome-sequenced cancers in the UK population. Science, 2022.

Davies, H., Glodzik, D., Morganella, S., Yates, L. R., Staaf, J., Zou, X., ... Nik-Zainal, S. (2017). HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nature Medicine, 23(4), 517–525. https://doi.org/10.1038/nm.4292

Nik-Zainal, S., Davies, H., Staaf, J., Ramakrishna, M., Glodzik, D., Zou, X., ... Stratton, M. R. (2016). Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature, 534(7605), 1–20. https://doi.org/10.1038/nature17676

Huang, X., Wojtowicz, D., & Przytycka, T. M. (2017). Detecting Presence Of Mutational Signatures In Cancer With Confidence. bioRxiv, (October). https://doi.org/10.1101/132597


Nik-Zainal-Group/signature.tools.lib documentation built on April 13, 2025, 5:50 p.m.