suppressPackageStartupMessages({ library(terravar) library(ldblock) library(GenomicFiles) library(VariantAnnotation) library(snpStats) library(BiocStyle) library(ggplot2) })
The Terra system is a cloud-based environment for genomic data analysis. A number of examples of analysis of reference data are collected in a notebook library. The terravar package can be used in terra. terravar was conceived as a vehicle for comparing diverse approaches to analyzing genetic variants in cohorts.
In this section we will acquire a modest collection of SNP calls from chr17, and project the genotype configurations via principal components to expose population substructure.
Bioconductor's r Biocpkg("ldblock")
package includes
utilities for working with collections of VCF, typically
decomposed by chromosome.
library(ldblock) library(GenomicFiles) library(VariantAnnotation)
EBI has updated 1000 genomes calls.
We will interrogate a tabix-indexed VCF for
chr17 via HTTP. An object that organizes
multiple chromosomes is created using stack1kg
from the r Biocpkg("ldblock")
package.
st = stack1kg("17")
We use a ScanVcfParam
to define a slice of genome.
sp = ScanVcfParam(geno="GT", which=GRanges("17", IRanges(32e6,33.75e6))) myread = readVcfStack(st, param=sp) myread
We use David Clayton's `r Biocpkg("snpStats") package tools for compact representation of (possibly uncertain) genotype calls. This enables us to filter to SNP with MAF exceeding a given threshold.
library(snpStats) mymat = genotypeToSnpMatrix(myread) cs = col.summary(mymat[[1]]) sum(cs[,"MAF"]>.1, na.rm=TRUE)
kpsnp = which(cs[,"MAF"]>.1) cmat = matrix(0, nr=nrow(mymat[[1]]), nc=length(kpsnp)) for (i in seq_len(length(kpsnp))) { cmat[,i] = as(mymat[[1]][,kpsnp[i]], "numeric") } rownames(cmat) = rownames(mymat[[1]])
pp = prcomp(cmat) library(ggplot2) data("geog_1kg") rownames(geog_1kg) = geog_1kg[,1] newdf = data.frame(pp$x[,1:4], pop=geog_1kg[rownames(pp$x), "Population"], superpop = geog_1kg[rownames(pp$x), "superpop"])
ggplot(newdf, aes(x=PC1, y=PC2, colour=superpop)) + geom_point()
library(BiocSklearn) pp2 = getTransformed(skIncrPCA(cmat)) rownames(pp2) = rownames(pp$x) colnames(pp2) = paste0("PC", 1:ncol(pp2)) rownames(geog_1kg) = geog_1kg[,1] newdf2 = data.frame(pp2[,1:4], pop=geog_1kg[rownames(pp2), "Population"], superpop = geog_1kg[rownames(pp2), "superpop"]) ggplot(newdf2, aes(x=PC1, y=PC2, colour=superpop)) + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.