The 1000 genomes data, archived in Amazon S3, is a good first proving ground for large scale data exploration and analysis in genomics. The data have evolved and grown, and various representations are available at various levels of resolution.
This package includes software to help find and interrogate 1000 genomes resources. Here are a couple of illustrative examples.
There are a number of short descriptive documents at the top level of the bucket.
library(survey1kg) get_readme(justList=TRUE)
The default action is to get the description of populations, and 'cat' it to standard output.
get_readme()
The aws.s3
package will traverse the entire bucket. By default it acquires information
on 1000 elements.
b1list = get_bucket(bucket = '1000genomes') length(b1list) sapply(b1list[1:10], "[[", "Key")
We can get details on known components that are deeper in the bucket hierarchy
using the 'prefix' parameter. We don't know how much information is
managed on cell line NA12878, so we pass a high value of max
.
b2list = get_bucket(bucket = '1000genomes', prefix='phase3/data/NA12878', max=10000) length(b2list) ks = sapply(b2list, "[[", "Key") table(sapply(strsplit(ks, "/"), "[", 4))
Let's press on to see what is available for exome alignments for this sample:
b2list = get_bucket(bucket = '1000genomes', prefix='phase3/data/NA12878/exome_alignment') ks = sapply(b2list, "[[", "Key") as.character(gsub("phase3/data/NA12878/exome_alignment/", "", ks))
We used s3cmd to list the sample names in the phase3/data subfolder.
data(phase3_ids) head(phase3_ids) ptmp = function(id) sub("%%ID%%", id, "phase3/data/%%ID%%/high_coverage_alignment/") prefs = sapply(phase3_ids[1:10], function(x) ptmp(x)) chks = lapply(prefs, function(x) get_bucket(bucket = '1000genomes', prefix=x)) sapply(chks, length)
This shows that not all phase3 samples have high coverage alignments. How many do?
The data portal offers a listing of 3900 sample records.
data(igsr_samples) head(igsr_samples) attr(igsr_samples, "src")
It is not clear
how to map some of them to the phase3_ids
, that are derived from
querying the bucket directly.
noo = which(!(igsr_samples[,1] %in% phase3_ids)) table(igsr_samples[noo,"Population.code"])
Some samples were found to be related to others.
data(relateds)
relateds
The overall repository structure and diversity of file names, which often include metadata about date of analysis, creates difficulty for on-the-fly programmatic interrogation. A start at generating some useful URLs for low coverage DNA sequencing data is illustrated here. You have to know relevant sample names and dates for this to work. We may want to precompute the dates from a one-time search.
demlow = lowcov_bam_urls(c("NA12878", "NA12812"), c("20121211", "20130415"))
We can rapidly grab the header of the first BAM file referenced here. It includes metadata about the process generating the alignment data.
suppressPackageStartupMessages({ library(Rsamtools) }) h78 = scanBamHeader(demlow[1]) h78[[1]][[2]][[91]] h78[[1]][[2]][[102]] h78[[1]][[2]][[103]]
In this section we will show how to drill into aws.s3 survey of phase3 files, which is very cumbersome, and we would like a better approach.
b1list = get_bucket(bucket = '1000genomes', prefix="phase3/data/", max=10000) st = sapply(b1list, "[[", "Key") pa = grep("\\.mapped.*exome.*bam$", st, value=TRUE)
With this code chunk we find 181 exome sequence files. They are saved in exomeBam_181.rda in the data part of this package.
data(exomeBam_181) library(erma) orm = range(genemodel("ORMDL3")) seqlevelsStyle(orm) = "NCBI" # match the BAM library(Rsamtools) mypar = ScanBamParam(which=orm, what=c("seq", "qual")) al1 = readGAlignments(exomeBam181[1], param=mypar) al1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.