In vjcitn/survey1kg: tools for surveying and querying 1000 genomes

Introduction

The 1000 genomes data, archived in Amazon S3, is a good first proving ground for large scale data exploration and analysis in genomics. The data have evolved and grown, and various representations are available at various levels of resolution.

This package includes software to help find and interrogate 1000 genomes resources. Here are a couple of illustrative examples.

Example 1: acquiring README documents

There are a number of short descriptive documents at the top level of the bucket.

library(survey1kg)
get_readme(justList=TRUE)

The default action is to get the description of populations, and 'cat' it to standard output.

get_readme()

Example 2: Enumerating bucket contents

The aws.s3 package will traverse the entire bucket. By default it acquires information on 1000 elements.

b1list = get_bucket(bucket = '1000genomes')
length(b1list)
sapply(b1list[1:10], "[[", "Key")

We can get details on known components that are deeper in the bucket hierarchy using the 'prefix' parameter. We don't know how much information is managed on cell line NA12878, so we pass a high value of max.

b2list = get_bucket(bucket = '1000genomes', prefix='phase3/data/NA12878', max=10000)
length(b2list)
ks = sapply(b2list, "[[", "Key")
table(sapply(strsplit(ks, "/"), "[", 4))

Let's press on to see what is available for exome alignments for this sample:

b2list = get_bucket(bucket = '1000genomes', 
  prefix='phase3/data/NA12878/exome_alignment')
ks = sapply(b2list, "[[", "Key")
as.character(gsub("phase3/data/NA12878/exome_alignment/", "", ks))

Example 3: Verifying data availability for all samples

We used s3cmd to list the sample names in the phase3/data subfolder.

data(phase3_ids)
head(phase3_ids)
ptmp = function(id) sub("%%ID%%", id, "phase3/data/%%ID%%/high_coverage_alignment/")
prefs = sapply(phase3_ids[1:10], function(x) ptmp(x))
chks = lapply(prefs, function(x) get_bucket(bucket = '1000genomes', prefix=x))
sapply(chks, length)

This shows that not all phase3 samples have high coverage alignments. How many do?

Some tables of interest

igsr_samples

The data portal offers a listing of 3900 sample records.

data(igsr_samples)
head(igsr_samples)
attr(igsr_samples, "src")

It is not clear how to map some of them to the phase3_ids, that are derived from querying the bucket directly.

noo = which(!(igsr_samples[,1] %in% phase3_ids)) 
table(igsr_samples[noo,"Population.code"])

relateds

Some samples were found to be related to others.

data(relateds)
relateds

Getting URLs for sequencing data

The overall repository structure and diversity of file names, which often include metadata about date of analysis, creates difficulty for on-the-fly programmatic interrogation. A start at generating some useful URLs for low coverage DNA sequencing data is illustrated here. You have to know relevant sample names and dates for this to work. We may want to precompute the dates from a one-time search.

demlow = lowcov_bam_urls(c("NA12878", "NA12812"), c("20121211", "20130415"))

We can rapidly grab the header of the first BAM file referenced here. It includes metadata about the process generating the alignment data.

suppressPackageStartupMessages({ 
 library(Rsamtools)
})
h78 = scanBamHeader(demlow[1])
h78[[1]][[2]][[91]]
h78[[1]][[2]][[102]]
h78[[1]][[2]][[103]]

Obtaining URLs for exome sequences

In this section we will show how to drill into aws.s3 survey of phase3 files, which is very cumbersome, and we would like a better approach.

b1list = get_bucket(bucket = '1000genomes', prefix="phase3/data/", max=10000)
st = sapply(b1list, "[[", "Key") 
pa = grep("\\.mapped.*exome.*bam$", st, value=TRUE)

With this code chunk we find 181 exome sequence files. They are saved in exomeBam_181.rda in the data part of this package.

Obtaining sequence content and quality

data(exomeBam_181)
library(erma)
orm = range(genemodel("ORMDL3"))
seqlevelsStyle(orm) = "NCBI"  # match the BAM
library(Rsamtools)
mypar = ScanBamParam(which=orm, what=c("seq", "qual"))
al1 = readGAlignments(exomeBam181[1], param=mypar)
al1