snpSummary: Counts and distribution statistics for SNPs in a VCF object

snpSummaryR Documentation

Counts and distribution statistics for SNPs in a VCF object

Description

Counts and distribution statistics for SNPs in a VCF object

Usage

  ## S4 method for signature 'CollapsedVCF'
snpSummary(x, ...)

Arguments

x

A CollapsedVCF object.

...

Additional arguments to methods.

Details

Genotype counts, allele counts and Hardy Weinberg equilibrium (HWE) statistics are calculated for single nucleotide variants in a CollapsedVCF object. HWE has been established as a useful quality filter on genotype data. This equilibrium should be attained in a single generation of random mating. Departures from HWE are indicated by small p values and are almost invariably indicative of a problem with genotype calls.

The following caveats apply:

  • No distinction is made between phased and unphased genotypes.

  • Only diploid calls are included.

  • Only ‘valid’ SNPs are included. A ‘valid’ SNP is defined as having a reference allele of length 1 and a single alternate allele of length 1.

Variants that do not meet these criteria are set to NA.

Value

The object returned is a data.frame with seven columns.

g00

Counts for genotype 00 (homozygous reference).

g01

Counts for genotype 01 or 10 (heterozygous).

g11

Counts for genotype 11 (homozygous alternate).

a0Freq

Frequency of the reference allele.

a1Freq

Frequency of the alternate allele.

HWEzscore

Z-score for departure from a null hypothesis of Hardy Weinberg equilibrium.

HWEpvalue

p-value for departure from a null hypothesis of Hardy Weinberg equilibrium.

Author(s)

Chris Wallace <cew54@cam.ac.uk>

See Also

genotypeToSnpMatrix, probabilityToSnpMatrix

Examples

  fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
  vcf <- readVcf(fl, "hg19")

  ## The return value is a data.frame with genotype counts
  ## and allele frequencies.
  df <- snpSummary(vcf)
  df

  ## Compare to ranges in the VCF object:
  rowRanges(vcf)

  ## No statistics were computed for the variants in rows 3, 4 
  ## and 5. They were omitted because row 3 has two alternate 
  ## alleles, row 4 has none and row 5 is not a SNP.

Bioconductor/VariantAnnotation documentation built on Nov. 2, 2024, 7:22 a.m.