superFreq: Wrapper to run default superFreq analysis

View source: R/analyse.R

superFreqR Documentation

Wrapper to run default superFreq analysis

Description

Wrapper to run default superFreq analysis

Usage

superFreq(
  metaDataFile,
  captureRegions = "",
  normalDirectory,
  Rdirectory,
  plotDirectory,
  reference,
  genome = "hg19",
  BQoffset = 33,
  cpus = 3,
  outputToTerminalAsWell = T,
  forceRedo = forceRedoNothing(),
  normalCoverageDirectory = "",
  systematicVariance = 0.03,
  maxCov = 150,
  cloneDistanceCut = -qnorm(0.01),
  dbSNPdirectory = "",
  cosmicDirectory = "",
  resourceDirectory = "superFreqResources",
  mode = "exome",
  splitRun = T,
  participants = "all",
  manualStoryMerge = F,
  vepCall = "vep",
  correctReferenceBias = T,
  filterOffTarget = T,
  rareGermline = T,
  exacPopulation = "all",
  cosmicSalvageRate = 0.001,
  annotationMethod = "VariantAnnotation",
  ploidyPriors = NULL,
  isPairedEnd = TRUE,
  countReadPairs = TRUE
)

Arguments

metaDataFile

Character. A path to a tab separated files with headers: BAM: path (absolute or relative from metaData file) to bam file of sample. An index file is required as well.
VCF: path (absolute or relative from metaData file) to vcf file of sample. This should include both somatic and germline variants. The variants will undergo further filtering, so it is not a problem if the .vcf includes false positives, although large number of entries in the .vcf increases run time.
NAME: unique identifier of the sample. The names will be converted to standard R names, which involves replacing separators (such as space, dash or underscore) with dots.
INDIVIDUAL: unique identifier of the indiviudal the sample comes from. This information is used pair up matched normal samples for somatic mutation filtering, and for identifying germline heterozygous SNPs. This column is also used to group samples that should be compared to each other: all samples from the same individual are analysed together, and mutations are tracked between the samples.
NORMAL: Should be YES if the sample is normal, NO otherwise. A normal sample is assumed to have no somatic mutations. In practice, a tumor burden below a few percent works effectively as a normal, but tumor burdens above 5 severely reduced sensitivity in cancer samples from the same indiviudal.
TIMEPOINT: Mostly used as label in plots together with the sample. Can be Diagnosis, Relapse, resistant or similar labels. Does not influence the analyss otherwise.

captureRegions

Character. A path to a bed file of the unpadded capture regions.

normalDirectory

Character. A path to a directory containing reference normals. The directory should contain a subdirectory called "bam" with (links to) .bam and index files for the reference normal samples. These samples can, but dont have to, be matched normals of the analysed samples. They need to be from the same capture as the analysed samples, and preferably sequenced in the same lab. At least two are required, but more are better (but slower). ~10 is a good number. Another subdirectory called R will be created next to the bam directory.

Rdirectory

Character. A path to the directory where data will be stored. The directory will be created if needed, but the parent directory must exist.

plotDirectory

Character. A path to the directory where plots and output is placed. The directory will be created if needed, but the parent directory must exist.

reference

Character. Path to the indexed fasta file the bams were aligned to.

genome

Character. The reference genome the samples are aligned to, such as "hg19", "hg38" or "mm10". Defaults to "hg19".

BQoffset

Integer. The offset of the base quality encoding. Defaults to 33.

cpus

Integer. Maximum number of parallel threads. Memory use increase with cpus, and speed up is limited at high cpus, due to parts of the analysis not being parallelised, or I/O limited. Suggested is 4 for RNA, 6 for exomes and 10 for genomes.

outputToTerminalAsWell

Boolean. If the rather verbose log output should be sent to terminal as well as to the logfile (in the Rdirectory). Default TRUE.

forceRedo

named list of Booleans. Controls which step are retrieved from previously saved data, and which are redone even if previous results are saved. If a step is forced to be redone, depending downstream steps are also redone. The named entries in the list should be:

forceRedoCount: The read counting over capture regions for the samples.
forceRedoNormalCount: The read counting over capture regions for the set of reference normals.
forceRedoFit: The differential coverage analysis.
forceRedoVolcanoes: Volcano plots of the differential coverage.
forceRedoDifferentRegions: Top table output of gained/lost genes/exons.
forceRedoSNPs: Deprecated, usually not called.
forceRedoVariants: Import and QC of the variants pointed to in the VCF.
forceRedoNormalSNPs: Deprecated, usually not called.
forceRedoNormalVariants: Import and QC of the variants in the set of reference normals.
forceRedoMatchFlag: QC of variants using the set of reference normals. Somatic SNV are also separated from germline.
forceRedoScatters: Scatter plots of the SNV frequencies between samples.
forceRedoOutputSomatic: Top tables of somatic SNVs by sample.
forceRedoNewVariants: Top table of new somatic SNVs compared to other samples.
forceRedoSNPprogression: Heatmap of SNV frequencies over multiple samples.
forceRedoCNV: Copy number calling.
forceRedoCNVplots: Copy number plots.
forceRedoSummary: Summary plot of SNVs and CNVs over all samples.
forceRedoStories: Clonal tracking of somatic mutations.
forceRedoRiver: Plots of the clonal evolution.
forceRedoVEP: VEP call (Variant Effect Predictor) and COSMIC linking. forceRedoSignature: The mutation signature analysis.

The default value is forceRedoNothing(), setting all entries to FALSE. To redo everything, set to forceRedoEverything().

systematicVariance

Numeric. A measure of the expected systematic variance in coverage log fold change. Larger value decreases false CNV calls, smaller values increase sensitivity. Defaults to 0.03.

maxCov

Integer. The coverage at which systematic variance of SNV frequencies are equal to the Poissonian variance. Larger values increase sensitivity in CNV calls, smaller values decrease false positives. Default 150.

dbSNPdirectory

Character. Deprecated. Use resourceDirectory instead.

cosmicDirectory

Character. Deprecated. Use resourceDirectory instead.

resourceDirectory

Character. The location of the directory where superFreq resources are located. Such as dbSNP, COSMIC, annotation. If the directory doesn't exist, it will be created and the data will be downloaded. Defaults to superFreqResources (in the directory where R is run).

mode

Character. The mode to run in. The default 'Exome' is almost always used. If running on RNA, switch to "RNA", which seems to work decently, but beware of limitations of the data.

splitRun

logical. If the superFreq run should be split by individual. This should always be left as default, especially when running multiple individuals. Default TRUE.

participants

character. When run in split mode, this allows you to run on a subset of individuals. Default is "all" which runs on all individuals.

manualStoryMerge

logical. A flag which gives you some manual control over the merging of mutations into clones. Requires babysitting and not recommended. Do not press this button. Default FALSE.

vepCall

character. The call for the variant effect predictor on the command line. Default "vep", otherwise try the path to variant_effect_predictor.pl.

correctReferenceBias

logical. Corrects for (mainly alignment-caused) bias towards the reference allele in VAF calculations. Default TRUE.

filterOffTarget

logical. Ignores any point variants more than 300bp outside the capture regions. Default TRUE.

rareGermline

logical. Display rare germline variants. This is a by-product of the clonal tracking and can yield important information about the sample, but due to ethics this is not always desired. This options allows a user to run superFreq while avoiding germline information. The user should keep in mind that superFreq, like most open source software, comes with no warranty and there is always a risk of bugs producing unintended behaviour. Also note that for individuals without matched normal it can be difficult to separate cancer variants present in most cells in all samples from germline variants. So rare germline variants may be misidentified as somatic variants without a matched normal.

exacPopulation

character. Which population to use for the exac data. This influences annotation, selection of heterozygous SNPs for variant calling and, in case of no matched normal, identification of somatic variants. Default 'all' uses the total allele frequency across all of (non-TCGA) ExAC. Other available options are 'AFR' (Africa), 'AMR' (America), 'EAS' (East Asia), 'FIN' (Finland), 'NFE' (Non-Finland Europe), 'OTH' (Other) and 'SAS' (South Asia).

annotationMethod

character. Which variant annotator to use. 'VEP' is the legacy method and requires a system installation of VEP. VariantAnnotation is contained in R and is significantly faster. Default 'VariantAnnotation'.

ploidyPriors

named numeric vector. Prior belief or knowledge that strongly biases the analysis to pick a ploidy close to the supplied prior. Names of the vector must match the samples you want to affect, but doesnt have to include all analysed samples. Entries not matching sample names are ignored. Note that this is sample-wide ploidy, including any normal contamination, so for example a 0.5 purity samples where the cancer has a AABB genotype genomewide will have a sample-wide ploidy of 3. If this is for a fix-the-analysis second run, it might be informative to look at the fitness curves in plotDirectory, diagnostics, ploidy of the first run to find the secondary local minimum that might be the ploidy you want to be used. Default NULL is a flat prior on ploidy, although the prior preference for less exotic copy numbers implicetely biases the analysis towards ploidies close to 2.

isPairedEnd

logical. Whether the data is actually paired end or not. featureCounts will throw an error if not accurate. Default TRUE.

countReadPairs

logical. Whether to count read pairs. Doing so requires featureCounts to re-sort the file which is slow for large files. This parameter is overriden for mode='genome' and is read pairs are never counted, btu can be used to further speed up exome and RNA runs. Default TRUE.

Details

This function runs a full SNV, CNV and clonality analysis of the input exome data. Output it sent to the plotDirectory, where diagnostics as well as analysis results are placed. At 12 cpus, the pipeline typically runs overnight on an individual with not too many samples. Large numbers of samples or somatic mutations can increase runtime further. Read more on the manual on https://github.com/ChristofferFlensburg/superFreq/blob/master/manual.pdf about how to set up the run and interpret the output.

Examples

## Not run: 
#minimal example to get it running, assuming data fits the defaults.
#you really want to skim through the other settings before running though.
metaDataFile = 'metaData.tsv'
captureRegions='captureRegions.bed'
normalDirectory = '../referenceNormals'
reference = '../reference/hg19.fa'
Rdirectory = 'R'
plotDirectory='plots'

data = superFreq(metaDataFile, captureRegions, normalDirectory,
                 Rdirectory, plotDirectory, reference)

## End(Not run)

ChristofferFlensburg/superFreq documentation built on Nov. 15, 2023, 6:15 a.m.