callVariantsSingle: Single sample variant calling
In h5vc: Managing alignment tallies using a hdf5 backend

Description Usage Arguments Details Value Author(s) Examples

A simple single sample variant calling function (calling SNVs and deletions)

1	callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)

`data`	A `list` with elements `Counts` (a 4d `integer` array of size [1:12, 1:2, 1:k, 1:n]), `Coverage` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Deletions` (a 3d `integer` array of size [1:2, 1:k, 1:n]), `Reference` (a 1d `integer` vector of size [1:n]) – see Details.
`sampledata`	A `data.frame` with `k` rows (one for each sample) and columns `Column` and (`Sample`. The tally file should contain this information as a group attribute, see `getSampleData` for an example.
`samples`	The samples on which variants should be called, by default all samples specified in sampledata are used
`errorRate`	The expected error rate of the sequencing technology that was used, for illumina this should be `1/1000`
`minSupport`	minimal support required for a position to be considered variant
`minAF`	minimal allelic frequency for an allele at a position to be cosidered a variant
`minStrandSupport`	minimal per-strand support for a position to be considered variant
`mergeDels`	Boolean flag to specify that adjacent deletion calls should be merged
`aggregator`	Aggregator function for merging statistics of adjacent deletion calls, defaults to `mean`, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages etc.

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling (if Deletions is not present no deletion calls will be made). This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsSingle implements a simple single sample variant callign approach for SNVs and deletions (if Deletions is a dataset present in the data parameter. The function applies three essential filters to the provided data, requiring:

- minSupport total support for the variant at the position - minStrandSupport support for the variant on each strand - an allele freqeuncy of at least minAF (for pure diploid samples this can be set relatively high, e.g. 0.3, for calling potentially homozygous variants a value of 0.8 or higher might be used)

Calls are annotated with the p-Value of a binom.test of the present support and coverage given the error rate provided in the errorRate parameter, no filtering is done on this annotation.

Adjacent deletion calls are merged based in the value of the mergeDels parameter and their statistics are aggregated with the function supplied in the aggregator parameter.

This function returns a data.frame containing annotated calls with the following slots:

`Chrom`	The chromosome the potential variant / deletion is on
`Start`	The starting position of the variant / deletion
`End`	The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
`Sample`	The sample in which the variant was called
`altAllele`	The alternate allele for SNVs (deletions will have a `"-"` in that slot)
`refAllele`	The reference allele for SNVs (deletions will have the deleted sequence here as extracted from the `Reference` dataset, if the tally file contains a sparse representation of the reference, i.e. only positions with mismatches show a reference value the missing values are substituted with `"N"`'s. It is strongly suggested to write the whole reference into the tally file prior to deletion calling - see `writeReference` for details)
`SupFwd`	Support for the variant in the sample on the forward strand
`SupRev`	Support for the variant in the sample on the reverse strand
`CovFwd`	Coverage of the variant position in the sample on the forward strand
`CovRev`	Coverage of the variant position in the sample on the reverse strand
`AF_Fwd`	Allele frequency of the variant in the sample on the forward strand
`AF_Rev`	Allele frequency of the variant in the sample on the reverse strand
`Support`	Total Support of the variant - i.e. `SupFwd + SupRev`
`Coverage`	Total Coverage of the variant position - i.e. `CovFwd + CovRev`
`AF`	Total allele frequency of the variant, i.e. `Support / Coverage`
`fBackground`	Background frequency of the variant in all samples but the one the variant is called in
`pErrorFwd`	Probablity of the observed support and coverage given the error rate on the forward strand
`pErrorRev`	Probablity of the observed support and coverage given the error rate on the reverse strand
`pError`	Probablity of the observed support and coverage given the error rate on both strands combined
`pError`	Coverage of the variant position in the `Control` sample on the forward strand
`pStrand`	p-Value of a `fisher.test` on the contingency matrix `matrix(c(CovFwd,CovRev,SupFwd,SupRev), nrow = 2)` at this position - low values could indicate strand bias

Paul Pyl

  library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant