readGenotypeMatrix-methods: Read from VCF File

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A fast lightweight function that reads from a VCF file and returns the result as a GenotypeMatrix object

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## S4 method for signature 'TabixFile,GRanges'
readGenotypeMatrix(file, regions, subset,
                   noIndels=TRUE, onlyPass=TRUE,
                   na.limit=1, MAF.limit=1,
                   na.action=c("impute.major", "omit", "fail"),
                   MAF.action=c("invert", "omit", "ignore", "fail"),
                   sex=NULL)
## S4 method for signature 'TabixFile,missing'
readGenotypeMatrix(file, regions, ...)
## S4 method for signature 'character,GRanges'
readGenotypeMatrix(file, regions, ...)
## S4 method for signature 'character,missing'
readGenotypeMatrix(file, regions, ...)

Arguments

file

a TabixFile object or a character string with a file name of the VCF file to read from; if file is a file name, the method internally creates a TabixFile object for this file name.

regions

a GRanges object that specifies which genomic regions to read from the VCF file; if missing, the entire VCF file is read.

subset

a numeric vector with indices or a character vector with names of samples to restrict to; if specified, only these samples' genotypes are read from the VCF file and all other samples are ignored and omitted from the GenotypeMatrix object that is returned. Moreover, minor allele frequencies (MAFs) are only computed from the genotypes of the samples specified by subset.

noIndels

if TRUE (default), only single-nucleotide variants (SNVs) are considered and indels are skipped.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is “PASS”.

na.limit

all variants with a missing value ratio above this threshold will be omitted from the output object.

MAF.limit

all variants with an MAF above this threshold will be omitted from the output object.

na.action

if “impute.major”, all missing values will be imputed by major alleles in the output object. If “omit”, all variants containing missing values will be omitted in the output object. If “fail”, the function stops with an error if a variant contains any missing values.

MAF.action

if “invert”, all variants with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. If “omit”, all variants with an MAF greater than 0.5 are omitted in the output object. If “ignore”, no action is taken and MAFs greater than 0.5 are kept as they are. If “fail”, the function stops with an error if any variant has an MAF greater than 0.5.

sex

if NULL, all samples are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as subset or as the VCF file has samples, this argument is interpreted as the sex of the samples. In this case, the genotypes corresponding to male samples are doubled before further processing. This is designed for mixed-sex analyses of the X chromosome outside of the pseudoautosomal regions.

...

for the three latter methods above, all other parameters are passed on to the method with signature TabixFile,GRanges.

Details

This method uses the tabix API provided by the Rsamtools package to read from a VCF file, parses the result into a sparse matrix along with positional information, and returns the result as a GenotypeMatrix object. Reading can be restricted to certain regions by specifying the regions object. Note that it might not be possible to read a very large VCF file as a whole.

For all variants, filters in terms of missing values and MAFs can be applied. Moreover, variants with MAFs greater than 0.5 can filtered out or inverted. For details, see descriptions of parameters na.limit, MAF.limit, na.action, and MAF.action above.

Value

returns an object of class GenotypeMatrix

Author(s)

Ulrich Bodenhofer bodenhofer@bioinf.jku.at

References

http://www.bioinf.jku.at/software/podkat

http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42

Li, H., Handsaker, B., Wysoker, A., Fenell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.

See Also

GenotypeMatrix

Examples

1
2
3
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
readGenotypeMatrix(vcfFile)
readGenotypeMatrix(vcfFile, onlyPass=FALSE, MAF.action="ignore")

podkat documentation built on Nov. 8, 2020, 6:55 p.m.