predictCoding-methods: Predict amino acid coding changes for variants
In VariantAnnotation: Annotation of Genetic Variants

Description Usage Arguments Details Value Author(s) References See Also Examples

Predict amino acid coding changes for variants a coding regions

## S4 method for signature 'CollapsedVCF,TxDb,ANY,missing'
predictCoding(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
## S4 method for signature 'ExpandedVCF,TxDb,ANY,missing'
predictCoding(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
## S4 method for signature 'IntegerRanges,TxDb,ANY,DNAStringSet'
predictCoding(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
## S4 method for signature 'GRanges,TxDb,ANY,DNAStringSet'
predictCoding(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
## S4 method for signature 'VRanges,TxDb,ANY,missing'
predictCoding(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)

`query`	A VCF, IntegerRanges, GRanges or `VRanges` instance containing the variants to be annotated. If `query` is a IntegerRanges or `VRanges` it is coerced to a GRanges. If a VCF is provided the `GRanges` returned by the `rowRanges()` accessor will be used. All metadata columns are ignored. When `query` is not a `VCF` object a `varAllele` must be provided. The `varAllele` must be a `DNAStringSet` the same length as the `query`. If there are multiple alternate alleles per variant the `query` must be expanded to one row per each alternate allele. See examples. NOTE: Variants are expected to conform to the VCF specs as described on the 1000 Genomes page (see references). Indels must include the reference allele; zero-width ranges are not valid and return no matches.
`subject`	A TxDb object that serves as the annotation. GFF files can be converted to TxDb objects with `makeTxDbFromGFF()` in the `GenomicFeatures` package.
`seqSource`	A `BSgenome` instance or an FaFile to be used for sequence extraction.
`varAllele`	A DNAStringSet containing the variant (alternate) alleles. The length of `varAllele` must equal the length of `query`. Missing values are represented by a zero width empty element. Ranges with missing `varAllele` values are ignored; those with an ‘N’ character are not translated. When the `query` is a `VCF` object the `varAllele` argument will be missing. This value is taken internally from the `VCF` with `alt(<VCF>)`.
`...`	Additional arguments passed to methods. Arguments `genetic.code` and `if.fuzzy.codon` are supported for the `translate` function.
`ignore.strand`	A `logical` indicating if strand should be ignored when performing overlaps. When `ignore.strand == TRUE` the `query` strand is set to '*' and can overlap with any strand of the `subject`. The return `GRanges` reflects the strand of the `subject` hit, however, the positions and alleles reported are computed as if both were from the '+' strand. `ignore.strand == FALSE` requires the `query` and `subject` to have compatible strand. Again the return `GRanges` reports the strand of the `subject` hit but in this case positions and alleles are computed according to the strand of the `subject`.

This function returns the amino acid coding for variants that fall completely ‘within’ a coding region. The reference sequences are taken from a fasta file or BSgenome. The width of the reference is determined from the start position and width of the range in the query. For guidance on how to represent an insertion, deletion or substitution see the 1000 Genomes VCF format (references).

Variant alleles are taken from the varAllele when supplied. When the query is a VCF object the varAllele will be missing. This value is taken internally from the VCF with alt(<VCF>). The variant allele is substituted into the reference sequences and transcribed. Transcription only occurs if the substitution, insertion or deletion results in a new sequence length divisible by 3.

When the query is an unstranded (*) GRanges predictCoding will attempt to find overlaps on both the positive and negative strands of the subject. When the subject hit is on the negative strand the return varAllele is reverse complemented. The strand of the returned GRanges represents the strand of the subject hit.

A GRanges with a row for each variant-transcript match. The result includes only variants that fell within coding regions. The strand of the output GRanges represents the strand of the subject hit.

At a minimum, the metadata columns (accessible with mcols) include,

varAllele: Variant allele. This value is reverse complemented for an unstranded query that overlaps a subject on the negative strand.
QUERYID: Map back to the row in the original query
TXID: Internal transcript id from the annotation
CDSID: Internal coding region id from the annotation
GENEID: Internal gene id from the annotation
CDSLOC: Variant location in coding region-based coordinates. This position is relative to the start of the coding (cds) region defined in the subject annotation.
PROTEINLOC: Variant codon triplet location in coding region-based coordinates. This position is relative to the start of the coding (cds) region defined in the subject annotation.
CONSEQUENCE: Possible values are ‘synonymous’, ‘nonsynonymous’, ‘frameshift’, ‘nonsense’ and ‘not translated’. Variant sequences are translated only when the codon sequence is a multiple of 3. The value will be ‘frameshift’ when a sequence is of incompatible length. ‘not translated’ is used when varAllele is missing or there is an ‘N’ in the sequence. ‘nonsense’ is used for premature stop codons.
REFCODON: The reference codon sequence. This range is typically greater than the width of the range in the GRanges because it includes all codons involved in the sequence modification. If the reference sequence is of width 2 but the alternate allele is of width 4 then at least two codons must be included in the REFSEQ.
VARCODON: This sequence is the result of inserting, deleting or replacing the position(s) in the reference sequence alternate allele. If the result of this substitution is not a multiple of 3 it will not be translated.
REFAA: The reference amino acid column contains the translated REFSEQ. When translation is not possible this value is missing.
VARAA: The variant amino acid column contains the translated VARSEQ. When translation is not possible this value is missing.

Michael Lawrence and Valerie Obenchain

http://www.1000genomes.org/wiki/analysis/variant-call-format/ http://vcftools.sourceforge.net/specs.html

readVcf, locateVariants, refLocsToLocalLocs getTranscriptSeqs

  library(BSgenome.Hsapiens.UCSC.hg19)
  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene 

  ## ----------------------------
  ## VCF object as query 
  ## ----------------------------
  ## Read variants from a VCF file 
  fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
  vcf <- readVcf(fl, "hg19")

  ## Rename seqlevels in the VCF object to match those in the TxDb.
  vcf <- renameSeqlevels(vcf, "chr22")
  ## Confirm common seqlevels
  intersect(seqlevels(vcf), seqlevels(txdb))
 
  ## When 'query' is a VCF object the varAllele argument is missing.
  coding1 <- predictCoding(vcf, txdb, Hsapiens)
  head(coding1, 3)

  ## Exon-centric or cDNA locations:
  exonsbytx <- exonsBy(txdb, "tx")

  cDNA <- mapToTranscripts(coding1, exonsbytx)
  mcols(cDNA)$TXID <- names(exonsbytx)[mcols(cDNA)$transcriptsHits]
  cDNA <- cDNA[mcols(cDNA)$TXID == mcols(coding1)$TXID[mcols(cDNA)$xHits]]

  ## Make sure cDNA is parallel to coding1
  stopifnot(identical(mcols(cDNA)$xHits, seq_along(coding1)))

  coding1$cDNA <- ranges(cDNA)

  ## ----------------------------
  ## GRanges object as query 
  ## ----------------------------
  ## A GRanges can also be used as the 'query'. The seqlevels in the VCF
  ## were adjusted in previous example so the GRanges extracted with
  ## has the correct seqlevels.
  rd <- rowRanges(vcf)
 
  ## The GRanges must be expanded to have one row per alternate allele. 
  ## Variants 1, 2 and 10 have two alternate alleles.
  altallele <- alt(vcf)
  eltROWS <- elementNROWS(altallele)
  rd_exp <- rep(rd, eltROWS)
 
  ## Call predictCoding() with the expanded GRanges and the unlisted
  ## alternate allele as the 'varAllele'.
  coding2 <- predictCoding(rd_exp, txdb, Hsapiens, unlist(altallele))