superFreq | R Documentation |
Wrapper to run default superFreq analysis
superFreq(
metaDataFile,
captureRegions = "",
normalDirectory,
Rdirectory,
plotDirectory,
reference,
genome = "hg19",
BQoffset = 33,
cpus = 3,
outputToTerminalAsWell = T,
forceRedo = forceRedoNothing(),
normalCoverageDirectory = "",
systematicVariance = 0.03,
maxCov = 150,
cloneDistanceCut = -qnorm(0.01),
dbSNPdirectory = "",
cosmicDirectory = "",
resourceDirectory = "superFreqResources",
mode = "exome",
splitRun = T,
participants = "all",
manualStoryMerge = F,
vepCall = "vep",
correctReferenceBias = T,
filterOffTarget = T,
rareGermline = T,
exacPopulation = "all",
cosmicSalvageRate = 0.001,
annotationMethod = "VariantAnnotation",
ploidyPriors = NULL,
isPairedEnd = TRUE,
countReadPairs = TRUE
)
metaDataFile |
Character. A path to a tab separated files with headers:
BAM: path (absolute or relative from metaData file) to bam file of sample.
An index file is required as well. |
captureRegions |
Character. A path to a bed file of the unpadded capture regions. |
normalDirectory |
Character. A path to a directory containing reference normals. The directory should contain a subdirectory called "bam" with (links to) .bam and index files for the reference normal samples. These samples can, but dont have to, be matched normals of the analysed samples. They need to be from the same capture as the analysed samples, and preferably sequenced in the same lab. At least two are required, but more are better (but slower). ~10 is a good number. Another subdirectory called R will be created next to the bam directory. |
Rdirectory |
Character. A path to the directory where data will be stored. The directory will be created if needed, but the parent directory must exist. |
plotDirectory |
Character. A path to the directory where plots and output is placed. The directory will be created if needed, but the parent directory must exist. |
reference |
Character. Path to the indexed fasta file the bams were aligned to. |
genome |
Character. The reference genome the samples are aligned to, such as "hg19", "hg38" or "mm10". Defaults to "hg19". |
BQoffset |
Integer. The offset of the base quality encoding. Defaults to 33. |
cpus |
Integer. Maximum number of parallel threads. Memory use increase with cpus, and speed up is limited at high cpus, due to parts of the analysis not being parallelised, or I/O limited. Suggested is 4 for RNA, 6 for exomes and 10 for genomes. |
outputToTerminalAsWell |
Boolean. If the rather verbose log output should be sent to terminal as well as to the logfile (in the Rdirectory). Default TRUE. |
forceRedo |
named list of Booleans. Controls which step are retrieved from previously
saved data, and which are redone even if previous results are saved.
If a step is forced to be redone, depending downstream steps are also
redone. The named entries in the list should be: forceRedoCount: The read counting over capture regions for the samples. The default value is forceRedoNothing(), setting all entries to FALSE. To redo everything, set to forceRedoEverything(). |
systematicVariance |
Numeric. A measure of the expected systematic variance in coverage log fold change. Larger value decreases false CNV calls, smaller values increase sensitivity. Defaults to 0.03. |
maxCov |
Integer. The coverage at which systematic variance of SNV frequencies are equal to the Poissonian variance. Larger values increase sensitivity in CNV calls, smaller values decrease false positives. Default 150. |
dbSNPdirectory |
Character. Deprecated. Use resourceDirectory instead. |
cosmicDirectory |
Character. Deprecated. Use resourceDirectory instead. |
resourceDirectory |
Character. The location of the directory where superFreq resources are located. Such as dbSNP, COSMIC, annotation. If the directory doesn't exist, it will be created and the data will be downloaded. Defaults to superFreqResources (in the directory where R is run). |
mode |
Character. The mode to run in. The default 'Exome' is almost always used. If running on RNA, switch to "RNA", which seems to work decently, but beware of limitations of the data. |
splitRun |
logical. If the superFreq run should be split by individual. This should always be left as default, especially when running multiple individuals. Default TRUE. |
participants |
character. When run in split mode, this allows you to run on a subset of individuals. Default is "all" which runs on all individuals. |
manualStoryMerge |
logical. A flag which gives you some manual control over the merging of mutations into clones. Requires babysitting and not recommended. Do not press this button. Default FALSE. |
vepCall |
character. The call for the variant effect predictor on the command line. Default "vep", otherwise try the path to variant_effect_predictor.pl. |
correctReferenceBias |
logical. Corrects for (mainly alignment-caused) bias towards the reference allele in VAF calculations. Default TRUE. |
filterOffTarget |
logical. Ignores any point variants more than 300bp outside the capture regions. Default TRUE. |
rareGermline |
logical. Display rare germline variants. This is a by-product of the clonal tracking and can yield important information about the sample, but due to ethics this is not always desired. This options allows a user to run superFreq while avoiding germline information. The user should keep in mind that superFreq, like most open source software, comes with no warranty and there is always a risk of bugs producing unintended behaviour. Also note that for individuals without matched normal it can be difficult to separate cancer variants present in most cells in all samples from germline variants. So rare germline variants may be misidentified as somatic variants without a matched normal. |
exacPopulation |
character. Which population to use for the exac data. This influences annotation, selection of heterozygous SNPs for variant calling and, in case of no matched normal, identification of somatic variants. Default 'all' uses the total allele frequency across all of (non-TCGA) ExAC. Other available options are 'AFR' (Africa), 'AMR' (America), 'EAS' (East Asia), 'FIN' (Finland), 'NFE' (Non-Finland Europe), 'OTH' (Other) and 'SAS' (South Asia). |
annotationMethod |
character. Which variant annotator to use. 'VEP' is the legacy method and requires a system installation of VEP. VariantAnnotation is contained in R and is significantly faster. Default 'VariantAnnotation'. |
ploidyPriors |
named numeric vector. Prior belief or knowledge that strongly biases the analysis to pick a ploidy close to the supplied prior. Names of the vector must match the samples you want to affect, but doesnt have to include all analysed samples. Entries not matching sample names are ignored. Note that this is sample-wide ploidy, including any normal contamination, so for example a 0.5 purity samples where the cancer has a AABB genotype genomewide will have a sample-wide ploidy of 3. If this is for a fix-the-analysis second run, it might be informative to look at the fitness curves in plotDirectory, diagnostics, ploidy of the first run to find the secondary local minimum that might be the ploidy you want to be used. Default NULL is a flat prior on ploidy, although the prior preference for less exotic copy numbers implicetely biases the analysis towards ploidies close to 2. |
isPairedEnd |
logical. Whether the data is actually paired end or not. featureCounts will throw an error if not accurate. Default TRUE. |
countReadPairs |
logical. Whether to count read pairs. Doing so requires featureCounts to re-sort the file which is slow for large files. This parameter is overriden for mode='genome' and is read pairs are never counted, btu can be used to further speed up exome and RNA runs. Default TRUE. |
This function runs a full SNV, CNV and clonality analysis of the input exome data. Output it sent to the plotDirectory, where diagnostics as well as analysis results are placed. At 12 cpus, the pipeline typically runs overnight on an individual with not too many samples. Large numbers of samples or somatic mutations can increase runtime further. Read more on the manual on https://github.com/ChristofferFlensburg/superFreq/blob/master/manual.pdf about how to set up the run and interpret the output.
## Not run:
#minimal example to get it running, assuming data fits the defaults.
#you really want to skim through the other settings before running though.
metaDataFile = 'metaData.tsv'
captureRegions='captureRegions.bed'
normalDirectory = '../referenceNormals'
reference = '../reference/hg19.fa'
Rdirectory = 'R'
plotDirectory='plots'
data = superFreq(metaDataFile, captureRegions, normalDirectory,
Rdirectory, plotDirectory, reference)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.