genotypeMatrix-methods | R Documentation |
GenotypeMatrix
ObjectsCreate GenotypeMatrix
object from
(sparse) matrix object and positions of variants
## S4 method for signature 'ANY,GRanges,missing'
genotypeMatrix(Z, pos, seqnames,
ploidy=2, na.string=NULL, na.limit=1, MAF.limit=1,
na.action=c("impute.major", "omit", "fail"),
MAF.action=c("invert", "omit", "ignore", "fail"),
sex=NULL)
## S4 method for signature 'ANY,numeric,character'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'ANY,character,missing'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'ANY,missing,missing'
genotypeMatrix(Z, pos, seqnames, subset,
noIndels=TRUE, onlyPass=TRUE, sex=NULL, ...)
## S4 method for signature 'eSet,numeric,character'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'eSet,character,missing'
genotypeMatrix(Z, pos, seqnames, ...)
## S4 method for signature 'eSet,character,character'
genotypeMatrix(Z, pos, seqnames, ...)
Z |
an object of class |
pos |
an object of class |
seqnames |
a character vector (see details below) |
ploidy |
determines the ploidy of the genome for the computation of
minor allele frequencies (MAFs) and the possible inversion of
columns with an MAF exceeding
0.5; the elements of |
subset |
a numeric vector with indices or a character vector with names of samples to restrict to |
na.limit |
all columns with a missing value ratio above this threshold will be omitted from the output object. |
MAF.limit |
all columns with an MAF above this threshold will be omitted from the output object. |
na.action |
if “impute.major”, all missing values will
be imputed by major alleles in the output object. If
“omit”, all columns containing missing values will be
omitted in the output object. If “fail”, the function stops
with an error if |
MAF.action |
if “invert”, all columns with an MAF
exceeding 0.5 will be inverted in the sense that all minor alleles
will be replaced by major alleles and vice versa. For numerical
|
noIndels |
if |
onlyPass |
if |
na.string |
if not |
sex |
if |
... |
all additional arguments are passed on internally to the
|
This method provides different ways of constructing an object of
class GenotypeMatrix
from other types of objects.
The typical case is when a matrix object is combined with positional
information. The first three variants listed above work with Z
being a dgCMatrix
object, a numeric matrix, or
a character matrix.
If Z
is a dgCMatrix
object or a matrix,
rows are interpreted as samples and columns are interpreted as variants.
For dgCMatrix
objects and numeric
matrices, matrix entries are interpreted as the numbers of minor alleles
(with 0 meaning only major alleles). In this case, minor allele
frequencies (MAFs) are computed as column sums divided by the number
of alleles, i.e. the number of samples/rows multiplied by the
ploidy
parameter. If Z
is a character matrix, the matrix
entries need to comply to the format of the “GT” field in
VCF files. MAFs are computed as the actual relative frequency of
minor alleles among all alleles in a column. For a diploid genome,
therefore, this results in the same MAF estimate as mentioned above.
However, some VCF readers, most importantly readVcf
from the
VariantAnnotation package, replace missing genotypes by a
single “.” even for non-haploid genomes, which would result in a
wrong MAF estimate. To correct for this, the na.string
parameter is available. If not NULL
, all “.” entries
in the matrix are replaced by na.string
before parsing the
matrix. The correct setting for a diploid genome would be
“./.”.
Positional information can be passed to the function in three different ways:
by supplying a GRanges
object as
pos
argument and omitting the seqnames
argument,
by supplying a numeric vector of positions as pos
argument and sequence/chromosome names as seqnames
argument, or
by supplying a character vector with entries of the format
“seqname:pos” as pos
argument and omitting the
seqnames
argument.
In all three cases, the lengths of the arguments pos
and
seqnames
(if not omitted) must match the number of columns of
Z
.
If the arguments pos
and seqnames
are not specified,
argument Z
can (and must) be an object of class
VCF
(cf. package VariantAnnotation). In this case,
the genotypeMatrix
method extracts both the genotype matrix and
positional information directly from the VCF
object. Consequently, the VCF
object Z
must contain genotype
information. If so, the genotype matrix is parsed and converted as
described above for character matrices. Moreover, indels and variants
that did not pass all quality filters can be
skipped (see description of arguments noIndels
and
onlyPass
above).
For all variants, filters in terms of missing values and MAFs can be
applied. Moreover, variants with MAFs greater than 0.5 can filtered
out or inverted. For details, see descriptions of parameters
na.limit
, MAF.limit
, na.action
, and
MAF.action
above.
For convenience, genotypeMatrix
also allows for converting
SNP genotype matrices stored in eSet
objects,
e.g. SnpSet
objects or SnpSetIllumina
objects (cf. package beadarraySNP). If genotypeMatrix
is
called with an eSet
object as first argument
Z
, the method first checks whether there is a slot call
in assayData(Z)
and whether it is a matrix. If so, this matrix
is interpreted as follows: 1 corresponds to genotype “AA”,
2 corresponds to the genotype “Aa”, and 3 corresponds to the
genotype “aa”, where “A” is the major allele and
“a” is the minor allele. If pos
is a numeric vector and
seqnames
is a character vector or if pos
is a character
vector and seqnames
is missing, then these two arguments
are interpreted as described above. However, if pos
and seqnames
are both single strings (character vectors of
length 1), then pos
is interpreted as the name of the
feature data column that contains positional information and
seqnames
is interpreted as the feature data column that
contains the chromosome on which each variant is located.
Correspondingly, featureData(Z)[[pos]]
must be available
and must be a numeric vector. Correspondingly,
featureData(Z)[[seqnames]]
must be available and must be
a character vector (or a data type that can be cast to a character
vector).
returns an object of class GenotypeMatrix
Ulrich Bodenhofer
https://github.com/UBod/podkat
https://github.com/samtools/hts-specs
Obenchain, V., Lawrence, M., Carey, V., Gogarten, S., Shannon, P., and Morgan, M. (2014) VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076-2078. DOI: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/bioinformatics/btu168")}.
GenotypeMatrix
,
dgCMatrix
, GRanges
## create a toy example
A <- matrix(rbinom(50, 2, prob=0.2), 5, 10)
sA <- as(A, "dgCMatrix")
pos <- sort(sample(1:10000, ncol(A)))
seqname <- "chr1"
## variant with 'GRanges' object
gr <- GRanges(seqnames=seqname, ranges=IRanges(start=pos, width=1))
gtm <- genotypeMatrix(A, gr)
gtm
as.matrix(gtm)
variantInfo(gtm)
MAF(gtm)
## variant with 'pos' and 'seqnames' object
genotypeMatrix(sA, pos, seqname)
## variant with 'seqname:pos' strings passed through 'pos' argument
spos <- paste(seqname, pos, sep=":")
spos
genotypeMatrix(sA, spos)
## read data from VCF file using 'readVcf()' from the 'VariantAnnotation'
## package
if (require(VariantAnnotation))
{
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
sp <- ScanVcfParam(info=NA, genome="GT", fixed=c("ALT", "FILTER"))
vcf <- readVcf(vcfFile, genome="hgA", param=sp)
rowRanges(vcf)
## call constructor for 'VCF' object
gtm <- genotypeMatrix(vcf)
gtm
variantInfo(gtm)
## alternatively, extract information from 'VCF' object and use
## variant with character matrix and 'GRanges' positions
## note that, in 'VCF' objects, rows correspond to variants and
## columns correspond to samples, therefore, we have to transpose the
## genotype
gt <- t(geno(vcf)$GT)
gt[1:5, 1:5]
gr <- rowRanges(vcf)
gtm <- genotypeMatrix(gt, gr)
as.matrix(gtm[1:20, 1:5, recomputeMAF=TRUE])
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.