Description Usage Arguments Details Value Author(s) See Also Examples
Variant location with respect to gene function
1 2 3 4 5 6 7 | locateVariants(query, subject, region, ...)
## S4 method for signature 'VCF,TxDb,VariantType'
locateVariants(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
## S4 method for signature 'GRanges,TxDb,VariantType'
locateVariants(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
|
query |
A IntegerRanges, GRanges or VCF object containing the variants. Metadata columns are allowed but ignored. NOTE: Zero-width ranges are treated as width-1 ranges; start values are decremented to equal the end value. |
subject |
A TxDb or |
region |
An instance of one of the 8 VariantType classes:
When using |
... |
Additional arguments passed to methods |
cache |
An |
ignore.strand |
A |
asHits |
A |
The ranges in query should reflect the position(s) of the
reference allele. For snps the range will be of width 1. For range
insertions or deletions the reference allele could be a sequence
such as GGTG in which case the width of the range should be 4.
Possible locations are ‘coding’, ‘intron’, ‘threeUTR’, ‘fiveUTR’, ‘intergenic’, ‘spliceSite’, or ‘promoter’.
Overlap operations for ‘coding’, ‘intron’, ‘threeUTR’, and ‘fiveUTR’ require variants to fall completely within the defined region to be classified as such.
To be classified as a ‘spliceSite’ the variant must overlap with any portion of the first 2 or last 2 nucleotides in an intron.
‘intergenic’ variants are ranges that do not fall within a
defined gene region. ‘transcripts by gene’ are extracted from
the annotation and overlapped with the variant positions. Variants with
no overlaps are classified as intergenic. When available, gene
IDs for the flanking genes are provided as PRECEDEID and
FOLLOWID. upstream and downstream arguments define
the acceptable distance from the query for the flanking genes.
PRECEDEID and FOLLOWID results are lists and contain all
genes that fall within the defined distance. See the examples for how
to compute the distance from ranges to PRECEDEID and FOLLOWID.
‘promoter’ variants fall within a specified range upstream and
downstream of the transcription start site. Ranges values can be set
with the upstream and downstream arguments when creating
the PromoterVariants() or AllVariants() classes.
The subject can be a TxDb or GRangesList
object. When using a GRangesList the type of data required
is driven by the VariantType class. Below is a description of
the appropriate GRangesList for each VariantType.
coding (CDS) by transcript
introns by transcript
five prime UTR by transcript
three prime UTR by transcript
transcripts by gene
introns by transcript
list of transcripts
no GRangeList method available
When processing multiple VCF files performance is enhanced by specifying
an environment as the cache argument. This cache is used to store
and reuse extracted components of the subject (TxDb) required by the
function. The first call to the function (i.e., processing the first
VCF file in a list of many) populates the cache; repeated calls
to locateVariants will access these objects from the cache vs
re-extracting the same information.
A GRanges object with a row for each variant-transcript match.
Strand of the output is from the subject hit
except in the case of IntergenicVariants. For intergenic, multiple precede
and follow gene ids are returned for each variant. When
ignore.strand=TRUE the return strand is * because
genes on both strands are considered and it is possible to have a mixture.
When ignore.strand=FALSE the strand will match the query
because only genes on the same strand are considered.
Metadata columns are LOCATION, QUERYID,
TXID, GENEID, PRECEDEID, FOLLOWID and
CDSID. Results are ordered by QUERYID, TXID and
GENEID. Columns are described in detail below.
LOCATIONPossible locations are ‘coding’, ‘intron’, ‘threeUTR’, ‘fiveUTR’, ‘intergenic’, ‘spliceSite’ and ‘promoter’.
To be classified as ‘coding’, ‘intron’, ‘threeUTR’ or ‘fiveUTR’ the variant must fall completely within the region.
‘intergenic’ variants do not fall within a transcript. The
‘GENEID’ for these positions are NA. Lists of flanking
genes that fall within the distance defined by upstream and
downstream are given as ‘PRECEDEID’ and ‘FOLLOWID’.
By default, the gene ID is returned in the ‘PRECEDEID’ and
‘FOLLOWID’ columns. To return the transcript ids instead set
idType = "tx" in the IntergenicVariants()
constructor.
A ‘spliceSite’ variant overlaps any portion of the first 2 or last 2 nucleotides of an intron.
LOCSTART, LOCENDGenomic position in LOCATION-centric coordinates.
If LOCATION is 'intron', these are intron-centric coordinates,
if LOCATION is 'coding' then cds-centric. All coordinates are
relative to the start of the transcript. SpliceSiteVariants,
IntergenicVariants and PromoterVariants have no formal
extraction 'by transcript' so for these variants LOCSTART and
LOCEND are NA. Coordinates are computed with mapToTranscripts;
see ?mapToTranscripts in the GenomicFeatures package for details.
QUERYIDThe QUERYID column provides a map back to the row in the
original query. If the query was a VCF object this
index corresponds to the row in the GRanges object returned by
the rowRanges accessor.
TXIDThe transcript id taken from the TxDb object.
CDSIDThe coding sequence id(s) taken from the TxDb object.
GENEIDThe gene id taken from the TxDb object.
PRECEDEIDIDs for all genes the query precedes within the defined
upstream and downstream distance. Only applicable
for ‘intergenic’ variants. By default this column contains gene ids;
to return transcript ids set idType = "tx" in
the IntergenicVariants constructor.
FOLLOWIDIDs for all genes the query follows within the defined
upstream and downstream distance. Only applicable
for ‘intergenic’ variants. By default this column contains gene ids;
to return transcript ids set idType = "tx" in
the IntergenicVariants constructor.
All ID values will be ‘NA’ for variants with a location of
transcript_region or NA.
Valerie Obenchain
The readVcf function.
The predictCoding function.
The promoters function on the intra-range-methods man page in the GenomicRanges package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## ---------------------------------------------------------------------
## Variants in all gene regions
## ---------------------------------------------------------------------
## Read variants from a VCF file.
fl <- system.file("extdata", "gl_chr1.vcf",
package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## Often the seqlevels in the VCF file do not match those in the TxDb.
head(seqlevels(vcf))
head(seqlevels(txdb))
intersect(seqlevels(vcf), seqlevels(txdb))
## Rename seqlevels with renameSeqlevesl().
vcf <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf)))
## Confirm.
intersect(seqlevels(vcf), seqlevels(txdb))
## Overlaps for all possible variant locations.
loc_all <- locateVariants(vcf, txdb, AllVariants())
table(loc_all$LOCATION)
## ---------------------------------------------------------------------
## Variants in intergenic regions
## ---------------------------------------------------------------------
## Intergenic variants do not overlap a gene range in the
## annotation and therefore 'GENEID' is always NA. Flanking genes
## that fall within the 'upstream' and 'downstream' distances are
## reported as PRECEDEID and FOLLOWID.
region <- IntergenicVariants(upstream=70000, downstream=70000)
loc_int <- locateVariants(vcf, txdb, region)
mcols(loc_int)[c("LOCATION", "PRECEDEID", "FOLLOWID")]
## Distance to the flanking genes can be computed for variants that
## have PRECEDEID(s) or FOLLOWID(s). Each variant can have multiple
## flanking id's so we first expand PRECEDEID and the corresponding
## variant ranges.
p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
exp_ranges <- rep(loc_int, elementNROWS(loc_int$PRECEDEID))
## Compute distances with the distance method defined in GenomicFeatures.
## Help page can be found at ?`distance,GenomicRanges,TxDb-method`.
## The method returns NA for ids that cannot be collapsed into a single
## range (e.g., genes with ranges on multiple chromosomes).
distance(exp_ranges, txdb, id=p_ids, type="gene")
## To search for distance by transcript id set idType='tx' in the
## IntergenicVariants() constructor, e.g.,
## locateVariants(vcf, txdb, region=IntergenicVariants(idType="tx"))
## Unlist ids and expand ranges as before to get p_ids and exp_ranges.
## Then call distance() with type = "tx":
## distance(exp_ranges, txdb, id=p_ids, type="tx")
## ---------------------------------------------------------------------
## GRangesList as subject
## ---------------------------------------------------------------------
## When 'subject' is a GRangesList the GENEID is unavailable and
## will always be reported as NA. This is because the GRangesList
## objects are extractions of region-by-transcript, not region-by-gene.
## Not run:
cdsbytx <- cdsBy(txdb)
locateVariants(vcf, cdsbytx, CodingVariants())
intbytx <- intronsByTranscript(txdb)
locateVariants(vcf, intbytx, IntronVariants())
## End(Not run)
## ---------------------------------------------------------------------
## Using the cache
## ---------------------------------------------------------------------
## When processing multiple VCF files, the 'cache' can be used
## to store the extracted components of the TxDb
## (i.e., cds by tx, introns by tx etc.). This avoids having to
## re-extract these GRangesLists during each loop.
## Not run:
myenv <- new.env()
files <- list(vcf1, vcf2, vcf3)
lapply(files,
function(fl) {
vcf <- readVcf(fl, "hg19")
## modify seqlevels to match TxDb
seqlevels(vcf_mod) <- paste0("chr", seqlevels(vcf))
locateVariants(vcf_mod, txdb, AllVariants(), cache=myenv)
})
## End(Not run)
## ---------------------------------------------------------------------
## Parallel implmentation
## ---------------------------------------------------------------------
## Not run:
library(BiocParallel)
## A connection to a TxDb object is established when
## the package is loaded. Because each process reading from an
## sqlite db must have a unique connection the TxDb
## object cannot be passed as an argument when running in
## parallel. Instead the package must be loaded on each worker.
## The overhead of the multiple loading may defeat the
## purpose of running the job in parallel. An alternative is
## to instead pass the appropriate GRangesList as an argument.
## The details section on this man page under the heading
## 'Subject as GRangesList' explains what GRangesList is
## appropriate for each variant type.
## A. Passing a GRangesList:
fun <- function(x, subject, ...)
locateVariants(x, subject, IntronVariants())
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
grl <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
mclapply(c(vcf, vcf), fun, subject=grl)
## B. Passing a TxDb:
## Forking:
## In the case of forking, the TxDb cannot be loaded
## in the current workspace.
## To detach the NAMESPACE:
## unloadNamespace("TxDb.Hsapiens.UCSC.hg19.knownGene")
fun <- function(x) {
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(x, TxDb.Hsapiens.UCSC.hg19.knownGene,
IntronVariants())
}
mclapply(c(vcf, vcf), fun)
## Clusters:
cl <- makeCluster(2, type = "SOCK")
fun <- function(query, subject, region) {
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(query, TxDb.Hsapiens.UCSC.hg19.knownGene, region)
}
parLapply(cl, c(vcf, vcf), fun, region=IntronVariants())
stopCluster(cl)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.