In vjcitn/terravar: Explore human genetic variation data in Terra

suppressPackageStartupMessages({
library(terravar)
library(ldblock)
library(GenomicFiles)
library(VariantAnnotation)
library(snpStats)
library(BiocStyle)
library(ggplot2)
})

Overview

The Terra system is a cloud-based environment for genomic data analysis. A number of examples of analysis of reference data are collected in a notebook library. The terravar package can be used in terra. terravar was conceived as a vehicle for comparing diverse approaches to analyzing genetic variants in cohorts.

Population stratification in 1000 genomes

In this section we will acquire a modest collection of SNP calls from chr17, and project the genotype configurations via principal components to expose population substructure.

Create references to the AWS VCF repository

Bioconductor's r Biocpkg("ldblock") package includes utilities for working with collections of VCF, typically decomposed by chromosome.

library(ldblock)
library(GenomicFiles)
library(VariantAnnotation)

Read a slice of VCF corresponding to a region of chr17

EBI has updated 1000 genomes calls. We will interrogate a tabix-indexed VCF for chr17 via HTTP. An object that organizes multiple chromosomes is created using stack1kg from the r Biocpkg("ldblock") package.

st = stack1kg("17")

We use a ScanVcfParam to define a slice of genome.

sp = ScanVcfParam(geno="GT", 
    which=GRanges("17", IRanges(32e6,33.75e6)))
myread = readVcfStack(st, param=sp) 
myread

Transform the VCF content to rare allele counts

We use David Clayton's `r Biocpkg("snpStats") package tools for compact representation of (possibly uncertain) genotype calls. This enables us to filter to SNP with MAF exceeding a given threshold.

library(snpStats)
mymat = genotypeToSnpMatrix(myread)
cs = col.summary(mymat[[1]])
sum(cs[,"MAF"]>.1, na.rm=TRUE)

Build a matrix of allele counts

kpsnp = which(cs[,"MAF"]>.1)
cmat = matrix(0, nr=nrow(mymat[[1]]), nc=length(kpsnp))
for (i in seq_len(length(kpsnp))) {
  cmat[,i] = as(mymat[[1]][,kpsnp[i]], "numeric")
  }
rownames(cmat) = rownames(mymat[[1]])

Compute PCA and plot

pp = prcomp(cmat)
library(ggplot2)
data("geog_1kg")
rownames(geog_1kg) = geog_1kg[,1]
newdf = data.frame(pp$x[,1:4], pop=geog_1kg[rownames(pp$x), "Population"],
    superpop = geog_1kg[rownames(pp$x), "superpop"])

ggplot(newdf, aes(x=PC1, y=PC2, colour=superpop)) + geom_point()

Use BiocSklearn for faster approximate PCA

library(BiocSklearn)
pp2 = getTransformed(skIncrPCA(cmat))
rownames(pp2) = rownames(pp$x)
colnames(pp2) = paste0("PC", 1:ncol(pp2))
rownames(geog_1kg) = geog_1kg[,1]
newdf2 = data.frame(pp2[,1:4], pop=geog_1kg[rownames(pp2), "Population"],
    superpop = geog_1kg[rownames(pp2), "superpop"])
ggplot(newdf2, aes(x=PC1, y=PC2, colour=superpop)) + geom_point()