locateVariants | R Documentation |
Variant location with respect to gene function
locateVariants(query, subject, region, ...)
## S4 method for signature 'VCF,TxDb,VariantType'
locateVariants(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
## S4 method for signature 'GRanges,TxDb,VariantType'
locateVariants(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
query |
A IntegerRanges, GRanges or VCF object containing the variants. Metadata columns are allowed but ignored. NOTE: Zero-width ranges are treated as width-1 ranges; start values are decremented to equal the end value. |
subject |
A TxDb or |
region |
An instance of one of the 8 VariantType classes:
When using |
... |
Additional arguments passed to methods |
cache |
An |
ignore.strand |
A |
asHits |
A |
The ranges in query
should reflect the position(s) of the
reference allele. For snps the range will be of width 1. For range
insertions or deletions the reference allele could be a sequence
such as GGTG in which case the width of the range should be 4.
Possible locations are ‘coding’, ‘intron’, ‘threeUTR’, ‘fiveUTR’, ‘intergenic’, ‘spliceSite’, or ‘promoter’.
Overlap operations for ‘coding’, ‘intron’, ‘threeUTR’, and ‘fiveUTR’ require variants to fall completely within the defined region to be classified as such.
To be classified as a ‘spliceSite’ the variant must overlap with any portion of the first 2 or last 2 nucleotides in an intron.
‘intergenic’ variants are ranges that do not fall within a
defined gene region. ‘transcripts by gene’ are extracted from
the annotation and overlapped with the variant positions. Variants with
no overlaps are classified as intergenic
. When available, gene
IDs for the flanking genes are provided as PRECEDEID
and
FOLLOWID
. upstream
and downstream
arguments define
the acceptable distance from the query for the flanking genes.
PRECEDEID
and FOLLOWID
results are lists and contain all
genes that fall within the defined distance. See the examples for how
to compute the distance from ranges to PRECEDEID and FOLLOWID.
‘promoter’ variants fall within a specified range upstream and
downstream of the transcription start site. Ranges values can be set
with the upstream
and downstream
arguments when creating
the PromoterVariants()
or AllVariants()
classes.
The subject
can be a TxDb
or GRangesList
object. When using a GRangesList
the type of data required
is driven by the VariantType
class. Below is a description of
the appropriate GRangesList
for each VariantType
.
coding (CDS) by transcript
introns by transcript
five prime UTR by transcript
three prime UTR by transcript
transcripts by gene
introns by transcript
list of transcripts
no GRangeList method available
When processing multiple VCF files performance is enhanced by specifying
an environment as the cache
argument. This cache is used to store
and reuse extracted components of the subject (TxDb) required by the
function. The first call to the function (i.e., processing the first
VCF file in a list of many) populates the cache; repeated calls
to locateVariants
will access these objects from the cache vs
re-extracting the same information.
A GRanges
object with a row for each variant-transcript match.
Strand of the output is from the subject
hit
except in the case of IntergenicVariants. For intergenic, multiple precede
and follow gene ids are returned for each variant. When
ignore.strand=TRUE
the return strand is *
because
genes on both strands are considered and it is possible to have a mixture.
When ignore.strand=FALSE
the strand will match the query
because only genes on the same strand are considered.
Metadata columns are LOCATION
, QUERYID
,
TXID
, GENEID
, PRECEDEID
, FOLLOWID
and
CDSID
. Results are ordered by QUERYID
, TXID
and
GENEID
. Columns are described in detail below.
LOCATION
Possible locations are ‘coding’, ‘intron’, ‘threeUTR’, ‘fiveUTR’, ‘intergenic’, ‘spliceSite’ and ‘promoter’.
To be classified as ‘coding’, ‘intron’, ‘threeUTR’ or ‘fiveUTR’ the variant must fall completely within the region.
‘intergenic’ variants do not fall within a transcript. The
‘GENEID’ for these positions are NA
. Lists of flanking
genes that fall within the distance defined by upstream
and
downstream
are given as ‘PRECEDEID’ and ‘FOLLOWID’.
By default, the gene ID is returned in the ‘PRECEDEID’ and
‘FOLLOWID’ columns. To return the transcript ids instead set
idType = "tx"
in the IntergenicVariants()
constructor.
A ‘spliceSite’ variant overlaps any portion of the first 2 or last 2 nucleotides of an intron.
LOCSTART, LOCEND
Genomic position in LOCATION-centric coordinates.
If LOCATION is 'intron', these are intron-centric coordinates,
if LOCATION is 'coding' then cds-centric. All coordinates are
relative to the start of the transcript. SpliceSiteVariants,
IntergenicVariants and PromoterVariants have no formal
extraction 'by transcript' so for these variants LOCSTART and
LOCEND are NA. Coordinates are computed with mapToTranscripts
;
see ?mapToTranscripts
in the GenomicFeatures package for details.
QUERYID
The QUERYID
column provides a map back to the row in the
original query
. If the query
was a VCF
object this
index corresponds to the row in the GRanges
object returned by
the rowRanges
accessor.
TXID
The transcript id taken from the TxDb
object.
CDSID
The coding sequence id(s) taken from the TxDb
object.
GENEID
The gene id taken from the TxDb
object.
PRECEDEID
IDs for all genes the query precedes within the defined
upstream
and downstream
distance. Only applicable
for ‘intergenic’ variants. By default this column contains gene ids;
to return transcript ids set idType = "tx"
in
the IntergenicVariants
constructor.
FOLLOWID
IDs for all genes the query follows within the defined
upstream
and downstream
distance. Only applicable
for ‘intergenic’ variants. By default this column contains gene ids;
to return transcript ids set idType = "tx"
in
the IntergenicVariants
constructor.
All ID values will be ‘NA’ for variants with a location of
transcript_region
or NA
.
Valerie Obenchain
The readVcf function.
The predictCoding function.
The promoters function on the intra-range-methods man page in the GenomicRanges package.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## ---------------------------------------------------------------------
## Variants in all gene regions
## ---------------------------------------------------------------------
## Read variants from a VCF file.
fl <- system.file("extdata", "gl_chr1.vcf",
package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## Often the seqlevels in the VCF file do not match those in the TxDb.
head(seqlevels(vcf))
head(seqlevels(txdb))
intersect(seqlevels(vcf), seqlevels(txdb))
## Rename seqlevels with renameSeqlevesl().
vcf <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf)))
## Confirm.
intersect(seqlevels(vcf), seqlevels(txdb))
## Overlaps for all possible variant locations.
loc_all <- locateVariants(vcf, txdb, AllVariants())
table(loc_all$LOCATION)
## ---------------------------------------------------------------------
## Variants in intergenic regions
## ---------------------------------------------------------------------
## Intergenic variants do not overlap a gene range in the
## annotation and therefore 'GENEID' is always NA. Flanking genes
## that fall within the 'upstream' and 'downstream' distances are
## reported as PRECEDEID and FOLLOWID.
region <- IntergenicVariants(upstream=70000, downstream=70000)
loc_int <- locateVariants(vcf, txdb, region)
mcols(loc_int)[c("LOCATION", "PRECEDEID", "FOLLOWID")]
## Distance to the flanking genes can be computed for variants that
## have PRECEDEID(s) or FOLLOWID(s). Each variant can have multiple
## flanking id's so we first expand PRECEDEID and the corresponding
## variant ranges.
p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
exp_ranges <- rep(loc_int, elementNROWS(loc_int$PRECEDEID))
## Compute distances with the distance method defined in GenomicFeatures.
## Help page can be found at ?`distance,GenomicRanges,TxDb-method`.
## The method returns NA for ids that cannot be collapsed into a single
## range (e.g., genes with ranges on multiple chromosomes).
distance(exp_ranges, txdb, id=p_ids, type="gene")
## To search for distance by transcript id set idType='tx' in the
## IntergenicVariants() constructor, e.g.,
## locateVariants(vcf, txdb, region=IntergenicVariants(idType="tx"))
## Unlist ids and expand ranges as before to get p_ids and exp_ranges.
## Then call distance() with type = "tx":
## distance(exp_ranges, txdb, id=p_ids, type="tx")
## ---------------------------------------------------------------------
## GRangesList as subject
## ---------------------------------------------------------------------
## When 'subject' is a GRangesList the GENEID is unavailable and
## will always be reported as NA. This is because the GRangesList
## objects are extractions of region-by-transcript, not region-by-gene.
## Not run:
cdsbytx <- cdsBy(txdb)
locateVariants(vcf, cdsbytx, CodingVariants())
intbytx <- intronsByTranscript(txdb)
locateVariants(vcf, intbytx, IntronVariants())
## End(Not run)
## ---------------------------------------------------------------------
## Using the cache
## ---------------------------------------------------------------------
## When processing multiple VCF files, the 'cache' can be used
## to store the extracted components of the TxDb
## (i.e., cds by tx, introns by tx etc.). This avoids having to
## re-extract these GRangesLists during each loop.
## Not run:
myenv <- new.env()
files <- list(vcf1, vcf2, vcf3)
lapply(files,
function(fl) {
vcf <- readVcf(fl, "hg19")
## modify seqlevels to match TxDb
seqlevels(vcf_mod) <- paste0("chr", seqlevels(vcf))
locateVariants(vcf_mod, txdb, AllVariants(), cache=myenv)
})
## End(Not run)
## ---------------------------------------------------------------------
## Parallel implmentation
## ---------------------------------------------------------------------
## Not run:
library(BiocParallel)
## A connection to a TxDb object is established when
## the package is loaded. Because each process reading from an
## sqlite db must have a unique connection the TxDb
## object cannot be passed as an argument when running in
## parallel. Instead the package must be loaded on each worker.
## The overhead of the multiple loading may defeat the
## purpose of running the job in parallel. An alternative is
## to instead pass the appropriate GRangesList as an argument.
## The details section on this man page under the heading
## 'Subject as GRangesList' explains what GRangesList is
## appropriate for each variant type.
## A. Passing a GRangesList:
fun <- function(x, subject, ...)
locateVariants(x, subject, IntronVariants())
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
grl <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
mclapply(c(vcf, vcf), fun, subject=grl)
## B. Passing a TxDb:
## Forking:
## In the case of forking, the TxDb cannot be loaded
## in the current workspace.
## To detach the NAMESPACE:
## unloadNamespace("TxDb.Hsapiens.UCSC.hg19.knownGene")
fun <- function(x) {
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(x, TxDb.Hsapiens.UCSC.hg19.knownGene,
IntronVariants())
}
mclapply(c(vcf, vcf), fun)
## Clusters:
cl <- makeCluster(2, type = "SOCK")
fun <- function(query, subject, region) {
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(query, TxDb.Hsapiens.UCSC.hg19.knownGene, region)
}
parLapply(cl, c(vcf, vcf), fun, region=IntronVariants())
stopCluster(cl)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.