callVariantsSingle: Single sample variant calling

Description Usage Arguments Details Value Author(s) Examples

View source: R/call.variants.single.R

Description

A simple single sample variant calling function (calling SNVs and deletions)

Usage

1
callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)

Arguments

data

A list with elements Counts (a 4d integer array of size [1:12, 1:2, 1:k, 1:n]), Coverage (a 3d integer array of size [1:2, 1:k, 1:n]), Deletions (a 3d integer array of size [1:2, 1:k, 1:n]), Reference (a 1d integer vector of size [1:n]) – see Details.

sampledata

A data.frame with k rows (one for each sample) and columns Column and (Sample. The tally file should contain this information as a group attribute, see getSampleData for an example.

samples

The samples on which variants should be called, by default all samples specified in sampledata are used

errorRate

The expected error rate of the sequencing technology that was used, for illumina this should be 1/1000

minSupport

minimal support required for a position to be considered variant

minAF

minimal allelic frequency for an allele at a position to be cosidered a variant

minStrandSupport

minimal per-strand support for a position to be considered variant

mergeDels

Boolean flag to specify that adjacent deletion calls should be merged

aggregator

Aggregator function for merging statistics of adjacent deletion calls, defaults to mean, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages etc.

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling (if Deletions is not present no deletion calls will be made). This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsSingle implements a simple single sample variant callign approach for SNVs and deletions (if Deletions is a dataset present in the data parameter. The function applies three essential filters to the provided data, requiring:

- minSupport total support for the variant at the position - minStrandSupport support for the variant on each strand - an allele freqeuncy of at least minAF (for pure diploid samples this can be set relatively high, e.g. 0.3, for calling potentially homozygous variants a value of 0.8 or higher might be used)

Calls are annotated with the p-Value of a binom.test of the present support and coverage given the error rate provided in the errorRate parameter, no filtering is done on this annotation.

Adjacent deletion calls are merged based in the value of the mergeDels parameter and their statistics are aggregated with the function supplied in the aggregator parameter.

Value

This function returns a data.frame containing annotated calls with the following slots:

Chrom

The chromosome the potential variant / deletion is on

Start

The starting position of the variant / deletion

End

The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)

Sample

The sample in which the variant was called

altAllele

The alternate allele for SNVs (deletions will have a "-" in that slot)

refAllele

The reference allele for SNVs (deletions will have the deleted sequence here as extracted from the Reference dataset, if the tally file contains a sparse representation of the reference, i.e. only positions with mismatches show a reference value the missing values are substituted with "N"'s. It is strongly suggested to write the whole reference into the tally file prior to deletion calling - see writeReference for details)

SupFwd

Support for the variant in the sample on the forward strand

SupRev

Support for the variant in the sample on the reverse strand

CovFwd

Coverage of the variant position in the sample on the forward strand

CovRev

Coverage of the variant position in the sample on the reverse strand

AF_Fwd

Allele frequency of the variant in the sample on the forward strand

AF_Rev

Allele frequency of the variant in the sample on the reverse strand

Support

Total Support of the variant - i.e. SupFwd + SupRev

Coverage

Total Coverage of the variant position - i.e. CovFwd + CovRev

AF

Total allele frequency of the variant, i.e. Support / Coverage

fBackground

Background frequency of the variant in all samples but the one the variant is called in

pErrorFwd

Probablity of the observed support and coverage given the error rate on the forward strand

pErrorRev

Probablity of the observed support and coverage given the error rate on the reverse strand

pError

Probablity of the observed support and coverage given the error rate on both strands combined

pError

Coverage of the variant position in the Control sample on the forward strand

pStrand

p-Value of a fisher.test on the contingency matrix matrix(c(CovFwd,CovRev,SupFwd,SupRev), nrow = 2) at this position - low values could indicate strand bias

Author(s)

Paul Pyl

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

h5vc documentation built on Nov. 8, 2020, 4:56 p.m.