Annotating Variants

BiocStyle::markdown()
# Ensure that any errors cause the Vignette build to fail.
library(knitr)
opts_chunk$set(error=FALSE)
apiKey <- Sys.getenv("GOOGLE_API_KEY")
if (nchar(apiKey) == 0) {
  warning(paste("To build this vignette, please setup the environment variable",
                "GOOGLE_API_KEY with the public API key from your Google",
                "Developer Console before loading the GoogleGenomics package,",
                "or run GoogleGenomics::authenticate."))
  knitr::knit_exit()
}

Working with Variants

Google Genomics implements the GA4GH variants API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/variants

library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate

By default, this function retrieves variants for a small genomic region from the 1,000 Genomes phase 1 variants.

variants <- getVariants()
length(variants)

We can see that r length(variants) individual variants were returned and that the JSON response was deserialized into an R list object:

class(variants)
mode(variants)

The top level field names are:

names(variants[[1]])

The variant contains nested calls:

length(variants[[1]]$calls)

With top level call field names:

names(variants[[1]]$calls[[1]])

And examining a call for a particular variant:

variants[[1]]$referenceName
variants[[1]]$start
variants[[1]]$alternateBases
variants[[1]]$calls[[1]]

This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert variants in this list representation to VRanges from r Biocpkg("VariantAnnotation"):

variantsToVRanges(variants)

Annotating Variants

Next let's use package r Biocpkg("VariantAnnotation") to annotate some specific 1,000 Genomes Phase 1 variants.

library(VariantAnnotation)

Note that the parameters start and end are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in r Biocpkg("GoogleGenomics"), by default, transform the returned data to 1-based coordinates.

granges <- getVariants(variantSetId="10473108253681171589",
                       chromosome="22",
                       start=50300077,
                       end=50303000,
                       converter=variantsToGRanges)

Now locate the protein coding variants in each:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
granges_locations <- locateVariants(granges, txdb, CodingVariants())
granges_locations

And predict the effect of the protein coding variants:

library(BSgenome.Hsapiens.UCSC.hg19)
granges_coding <- predictCoding(rep(granges, elementNROWS(granges$ALT)),
                                txdb,
                                seqSource=Hsapiens,
                                varAllele=unlist(granges$ALT, use.names=FALSE))

granges_coding

Finally, add gene information:

library(org.Hs.eg.db)
annots <- select(org.Hs.eg.db,
                 keys=granges_coding$GENEID,
                 keytype="ENTREZID",
                 columns=c("SYMBOL", "GENENAME","ENSEMBL"))
cbind(elementMetadata(granges_coding), annots)

Provenance

sessionInfo()


Try the GoogleGenomics package in your browser

Any scripts or data that you put into this service are public.

GoogleGenomics documentation built on May 2, 2019, 12:54 a.m.