pipe.VariantCalls: Find Variant Alleles (aka SNPs) in Alignment Data

pipe.VariantCallsR Documentation

Find Variant Alleles (aka SNPs) in Alignment Data

Description

Pipeline step to detect and quantify SNPs from aligmment data. Or the lower level function. Calls the SAMTOOLS utility to extract MPILEUP details from BAM files and passes that to the BCFTOOLS variant caller.

Usage

pipe.VariantCalls(sampleIDset, annotationFile = "Annotation.txt", optionsFile = "Options.txt", 
		speciesID = getCurrentSpecies(), results.path = NULL, seqIDset = NULL, 
		start = NULL, stop = NULL, prob.variant = 0.5, 
		snpCallMode = c("consensus", "all", "multiallelic"), min.depth = 1, 
		mpileupArgs = "", vcfArgs = "", comboSamplesName = "Combined", verbose = TRUE)

BAM.variantCalls(files, seqID, fastaFile, start = NULL, stop = NULL, prob.variant = 0.5, 
		min.depth = 1, max.depth = 10000, min.gap.fraction = 0.25, 
		mpileupArgs = "", vcfArgs = "", ploidy = 1, geneMap = getCurrentGeneMap(),
		snpCallMode = c("consensus", "all", "multiallelic"), verbose = TRUE)

Arguments

sampleIDset

Vector of SampleIDs to call SNPs for. Note that the underlying BCFTOOLS variant calling methods are quite different when given a single sample versus multiple BAM files at one time. Most consistent and reliable results are had when the function is called on a single sample at a time, and then merging all SNP calls after the fact.

files

Character vector of full pathname BAM files.

annotationFile

File of sample annotation details, which specifies all needed sample-specific information about the samples under study. See DuffyNGS_Annotation.

optionsFile

File of processing options, which specifies all processing parameters that are not sample specific. See DuffyNGS_Options.

speciesID

The SpeciesID of the target species to call SNPs for. By default, use the current species.

results.path

The top level folder path for writing result files to. By default, read from the Options file entry 'results.path'.

seqIDset

Optional character vector of SeqIDs. Default is to call SNPs for all chromosome, in parallel if possible.

seqID

Character string of a single SeqID, that must exist as a named contig in the FASTA file.

fastaFile

Character string of the full pathname to one genomic FASTA file.

start
stop

Optional numeric limits for the chromosomal region to be inspected.

prob.variant

Numeric probability for deciding if a potential SNP site should be returned as real. Passed down as the BCFTOOLS CALL "-p" option.

snpCallMode

Controls the behavior of the BCFTOOLS CALL command. As SAMTOOLS evolves their SNP calling algorithms, we need to maintain some flexibility. In practice, the SNP calling algorithms do a terrible job on haploid highly variant genomes like plasmodia, so we tend to use the most generic straightforward algorithm. The "all" mode is shorthand for doing both algorithms and the merging their results. Extra slow, and not very useful.

min.depth
max.depth
max.gap.fraction

Numeric arguments passed down as the SAMTOOLS MPILEUP "-m" and "-d" and "-F" options, respectively.

mpileupArgs

Other optional arguments passed down to SAMTOOLS MPILEUP.

vcfArgs

Other optional arguments passed down to BCFTOOLS CALL.

ploidy

Designates the organism being SNP called as being either haploid (1) or diploid (2).

comboSamplesName

Only used when calling multiple samples at one time. Used as folder and file name prefix.

Details

As a general rule, clinical samples that are often mixed infections cause the SNP calling tools to perform very poorly. To counter that trend, we often use very lax permissive setting at this step and have the SNP caller return as many potential SNP sites as possible, and then use more rigorous post-SNP-calling analysis to whittle that down to true SNPs.

This functionality can be called either by the high level pipe, which writes a folder of results files, or as a low level wrapper that operates directly on BAM and FASTA files directly, which returns a single data frame.

Value

for pipe.VariantCalls, a subfolder of files is written under the VariantCalls subfolder:

VCF.txt

A file of potential SNP variant allele sites for each chromosome, containing all the columns of details generated by BCFTOOLS CALL. These include all the cryptic scoring and quality metrics and the comma separated list of alternate alleles.

Summary.VCF.txt

One final file of SNP sites, after merging all chromosomes and cleaning up much of the BCFTOOLS details. Includes a column "ALT_AA" that tries to suggest if the SNP changes the amino acid sequence of the protein.

For BAM.variantCalls, a data frame, as in the chromosomal .VCF.txt files above.

Note

It is imperative that the genome FASTA file specified in the genomicFastaFile field of the option table exactly match the genome used to construct the Bowtie2 index that was used in the genomic alignment pipeline step. There is no easy way to verify that on the fly during SNP calling.

Author(s)

Bob Morrison

See Also

pipe.VariantSummary for joining all chromosome SNP results into a single file for one sample, with optional cleaning/filtering.

pipe.VariantComparison for finding SNPs that are diffentially detected between groups.


robertdouglasmorrison/DuffyNGS documentation built on March 24, 2024, 4:16 p.m.