genotypeMatrix-methods: Constructors for Creating 'GenotypeMatrix' Objects

genotypeMatrix-methodsR Documentation

Constructors for Creating GenotypeMatrix Objects

Description

Create GenotypeMatrix object from (sparse) matrix object and positions of variants

Usage

## S4 method for signature 'ANY,GRanges,missing'
genotypeMatrix(Z, pos, seqnames,
       ploidy=2, na.string=NULL, na.limit=1, MAF.limit=1,
       na.action=c("impute.major", "omit", "fail"),
       MAF.action=c("invert", "omit", "ignore", "fail"),
       sex=NULL)
## S4 method for signature 'ANY,numeric,character'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'ANY,character,missing'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'ANY,missing,missing'
genotypeMatrix(Z, pos, seqnames, subset,
       noIndels=TRUE, onlyPass=TRUE, sex=NULL, ...)
## S4 method for signature 'eSet,numeric,character'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'eSet,character,missing'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'eSet,character,character'
genotypeMatrix(Z, pos, seqnames, ...)

Arguments

Z

an object of class dgCMatrix, a numeric matrix, a character matrix, an object of class VCF, or an object of class eSet (see details below)

pos

an object of class GRanges, a numeric vector, or a character vector (see details below)

seqnames

a character vector (see details below)

ploidy

determines the ploidy of the genome for the computation of minor allele frequencies (MAFs) and the possible inversion of columns with an MAF exceeding 0.5; the elements of Z may not exceed this value.

subset

a numeric vector with indices or a character vector with names of samples to restrict to

na.limit

all columns with a missing value ratio above this threshold will be omitted from the output object.

MAF.limit

all columns with an MAF above this threshold will be omitted from the output object.

na.action

if “impute.major”, all missing values will be imputed by major alleles in the output object. If “omit”, all columns containing missing values will be omitted in the output object. If “fail”, the function stops with an error if Z contains any missing values.

MAF.action

if “invert”, all columns with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. For numerical Z, this is accomplished by subtracting the column from the ploidy value. If “omit”, all columns with an MAF greater than 0.5 are omitted in the output object. If “ignore”, no action is taken and MAFs greater than 0.5 are kept as they are. If “fail”, the function stops with an error if Z contains any column with an MAF greater than 0.5.

noIndels

if TRUE (default), only single nucleotide variants (SNVs) are considered and indels are skipped; only works if the ALT column is present in the VCF object Z, otherwise a warning is shown and the noIndels argument is ignored.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is “PASS”; only works if the FILTER column is present in the VCF object Z, otherwise a warning is shown and the onlyPass argument is ignored.

na.string

if not NULL, all “.” entries in the character matrix or VCF genotype are replaced with this string before parsing the matrix.

sex

if NULL, all rows of Z are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as Z has rows, this argument is interpreted as the sex of the samples. In this case, the rows corresponding to male samples are doubled before further processing. This is designed for mixed-sex analyses of the X chromosome outside of the pseudoautosomal regions.

...

all additional arguments are passed on internally to the genotypeMatrix method with signature ANY,GRanges,missing.

Details

This method provides different ways of constructing an object of class GenotypeMatrix from other types of objects. The typical case is when a matrix object is combined with positional information. The first three variants listed above work with Z being a dgCMatrix object, a numeric matrix, or a character matrix.

If Z is a dgCMatrix object or a matrix, rows are interpreted as samples and columns are interpreted as variants. For dgCMatrix objects and numeric matrices, matrix entries are interpreted as the numbers of minor alleles (with 0 meaning only major alleles). In this case, minor allele frequencies (MAFs) are computed as column sums divided by the number of alleles, i.e. the number of samples/rows multiplied by the ploidy parameter. If Z is a character matrix, the matrix entries need to comply to the format of the “GT” field in VCF files. MAFs are computed as the actual relative frequency of minor alleles among all alleles in a column. For a diploid genome, therefore, this results in the same MAF estimate as mentioned above. However, some VCF readers, most importantly readVcf from the VariantAnnotation package, replace missing genotypes by a single “.” even for non-haploid genomes, which would result in a wrong MAF estimate. To correct for this, the na.string parameter is available. If not NULL, all “.” entries in the matrix are replaced by na.string before parsing the matrix. The correct setting for a diploid genome would be “./.”.

Positional information can be passed to the function in three different ways:

  • by supplying a GRanges object as pos argument and omitting the seqnames argument,

  • by supplying a numeric vector of positions as pos argument and sequence/chromosome names as seqnames argument, or

  • by supplying a character vector with entries of the format “seqname:pos” as pos argument and omitting the seqnames argument.

In all three cases, the lengths of the arguments pos and seqnames (if not omitted) must match the number of columns of Z.

If the arguments pos and seqnames are not specified, argument Z can (and must) be an object of class VCF (cf. package VariantAnnotation). In this case, the genotypeMatrix method extracts both the genotype matrix and positional information directly from the VCF object. Consequently, the VCF object Z must contain genotype information. If so, the genotype matrix is parsed and converted as described above for character matrices. Moreover, indels and variants that did not pass all quality filters can be skipped (see description of arguments noIndels and onlyPass above).

For all variants, filters in terms of missing values and MAFs can be applied. Moreover, variants with MAFs greater than 0.5 can filtered out or inverted. For details, see descriptions of parameters na.limit, MAF.limit, na.action, and MAF.action above.

For convenience, genotypeMatrix also allows for converting SNP genotype matrices stored in eSet objects, e.g. SnpSet objects or SnpSetIllumina objects (cf. package beadarraySNP). If genotypeMatrix is called with an eSet object as first argument Z, the method first checks whether there is a slot call in assayData(Z) and whether it is a matrix. If so, this matrix is interpreted as follows: 1 corresponds to genotype “AA”, 2 corresponds to the genotype “Aa”, and 3 corresponds to the genotype “aa”, where “A” is the major allele and “a” is the minor allele. If pos is a numeric vector and seqnames is a character vector or if pos is a character vector and seqnames is missing, then these two arguments are interpreted as described above. However, if pos and seqnames are both single strings (character vectors of length 1), then pos is interpreted as the name of the feature data column that contains positional information and seqnames is interpreted as the feature data column that contains the chromosome on which each variant is located. Correspondingly, featureData(Z)[[pos]] must be available and must be a numeric vector. Correspondingly, featureData(Z)[[seqnames]] must be available and must be a character vector (or a data type that can be cast to a character vector).

Value

returns an object of class GenotypeMatrix

Author(s)

Ulrich Bodenhofer

References

https://github.com/UBod/podkat

https://github.com/samtools/hts-specs

Obenchain, V., Lawrence, M., Carey, V., Gogarten, S., Shannon, P., and Morgan, M. (2014) VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076-2078. DOI: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/bioinformatics/btu168")}.

See Also

GenotypeMatrix, dgCMatrix, GRanges

Examples

## create a toy example
A <- matrix(rbinom(50, 2, prob=0.2), 5, 10)
sA <- as(A, "dgCMatrix")
pos <- sort(sample(1:10000, ncol(A)))
seqname <- "chr1"

## variant with 'GRanges' object
gr <- GRanges(seqnames=seqname, ranges=IRanges(start=pos, width=1))
gtm <- genotypeMatrix(A, gr)
gtm
as.matrix(gtm)
variantInfo(gtm)
MAF(gtm)

## variant with 'pos' and 'seqnames' object
genotypeMatrix(sA, pos, seqname)

## variant with 'seqname:pos' strings passed through 'pos' argument
spos <- paste(seqname, pos, sep=":")
spos
genotypeMatrix(sA, spos)

## read data from VCF file using 'readVcf()' from the 'VariantAnnotation'
## package
if (require(VariantAnnotation))
{
    vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
    sp <- ScanVcfParam(info=NA, genome="GT", fixed=c("ALT", "FILTER"))
    vcf <- readVcf(vcfFile, genome="hgA", param=sp)
    rowRanges(vcf)

    ## call constructor for 'VCF' object
    gtm <- genotypeMatrix(vcf)
    gtm
    variantInfo(gtm)

    ## alternatively, extract information from 'VCF' object and use
    ## variant with character matrix and 'GRanges' positions
    ## note that, in 'VCF' objects, rows correspond to variants and
    ## columns correspond to samples, therefore, we have to transpose the
    ## genotype
    gt <- t(geno(vcf)$GT)
    gt[1:5, 1:5]
    gr <- rowRanges(vcf)
    gtm <- genotypeMatrix(gt, gr)
    as.matrix(gtm[1:20, 1:5, recomputeMAF=TRUE])
}

UBod/podkat documentation built on May 5, 2024, 6:37 a.m.