Introduction

This package explores how to use metadata, large-scale curation of RNA-seq quantifications, and a scalable data service to explore processed transcriptomic data with AnVIL.

SRA metadata service: omicidx

NCBI Sequence Read Archive (SRA) collects outcomes of NIH-funded sequencing experiments. Metadata about SRA assets are available "in the cloud", but interrogation of these assets involves commercial database operations and costs are not readily predicted. See, for example, these comments on AWS Athena pricing.

Dr. Sean Davis of the University of Colorado has established a process for harvesting SRA metadata. The Omicidx API can be used freely to obtain detailed metadata at various levels: studies, experiments, runs.

recount3

From the recount3 project:

Recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio.

The raw sequencing data were processed with the Monorail system as described in the recount3 paper which created the coverage bigWig files and the recount-unified text files. While these raw output files are available through IDIES SciServer, for ease of statistical analysis, we provide through the recount3 R/Bioconductor package an interface that builds RangedSummarizedExperiment R objects for gene, exon, and exon-exon junction counts for each study. Furthermore, snapcount enables query-based access of the recount3 and recount2 data. The coverage bigWig files can be used for annotation-agnostic expression analyses using for example megadepth, derfinder and other tools.

HDF Highly Scalable Data Service

From the HDF Group description of HSDS:

HSDS is a REST-based solution for reading and writing complex binary data formats within object-based storage environments such as the Cloud. Developed to make large datasets accessible in a manner that’s both fast and cost-effective, HSDS stores HDF5 file using a sharded data schema, but provides the functionality traditionally offered by the HDF5 library as accessible by any HTTP client. HSDS is open source software, licensed under the Apache License 2.0. Managed HSDS products, support, and consulting is offered through HDF Group’s Kita Data Products & Solutions.

Using the omicidx metadata service

The omicidx metadata service has a REST API that can process queries in lucene syntax. An R client for the API was produced by Samuel Gamboa-Tuz, using the OpenAPI generator.

suppressPackageStartupMessages({
library(SraInAnVIL)
library(DT)
})

We set up a query, retrieve the 'hits' component, and produce an interactive table. Some columns have substructure and are dropped from the searchable display.

tst1 = query_metadata(
   'cancer AND library_strategy:"RNA-Seq" AND sample.organism:"Homo sapiens"',
    component="ExperimentsSraExperimentsGet", size=100)
tab1 = do.call(rbind, tst1$hits)
datatable(dplyr::select(tab1, !c(library_construction_protocol,
   identifiers,attributes,study,xrefs,sample)))

Here are illustrations of substructure in the first hit:

t(tab1[1, "sample"])
t(tab1[1,"sample"])["attributes",][[1]][1:3,]

Using recount3

Recount3 resources can be used in a number of ways. The fundamental data resource is the collection of BigWig files with coverage vectors. This can be regarded as "annotation-free" data on RNA abundance. Later developments of this package will explore this.

Read counts are summarized at the gene level after alignment using STAR and summarization using megadepth. These counts are made available at the SRA project level using create_rse.

In the following we select a study and produce a SummarizedExperiment.

tab1[100,]$study[,c("title", "accession")]
suppressPackageStartupMessages({
library(recount3)
})
suppressMessages({
aproj = available_projects()
})
sum(aproj$n_samples)
table(aproj$organism)
erpind = grep("ERP013700", aproj[,1])
suppressMessages({
ovse = create_rse(aproj[erpind,])
ovse
})

We extract the count data for YY1 gene and examine values over samples.

yy1ind_recount = which(rowData(ovse)$gene_name == "YY1")
plot(yyvec_reco <- as.numeric(assay(ovse[yy1ind_recount,])))
names(yyvec_reco) = ovse$sra.experiment_acc
head(yyvec_reco)

Using HumanTranscriptomeCompendium

Dr Sean Davis of U Colorado Anschuetz School of Medicine devised a cloud-based workflow for producing salmon-based quantifications of all RNA-seq experiments in SRA up to 2018. A subset of gene-level quantifications were transformed to HDF5 and imported to the HDF Scalable Data Service with the assistance of John Readey of the HDF Group.

This system gives us a view of 181134 RNA-seq experiments in a unified RangedSummarizedExperiment structure. The SummarizedExperiment is produced with htx_load(), which can be run in any R session with an internet connection; the "back end" is in the AWS cloud.

suppressPackageStartupMessages({
library(HumanTranscriptomeCompendium)
library(rhdf5client)
library(Matrix)
library(DelayedArray)
})
hco = addRD(htx_load())
options(digits=3)
assay(hco)

NCBI Study Accession numbers can be used to pull out samples of interest.

hcoov = hco[ , which(hco$study_accession == "ERP013700")]
yy1ind_htx = which(rowData(hcoov)$gene_name=="YY1")
yyvec_hco = as.numeric(assay(hcoov[yy1ind_htx,]))
names(yyvec_hco) = colnames(hcoov)
head(yyvec_hco)

We compare the salmon-based quantifications in HSDS to the STAR/megadepth quantifications from recount3:

okdat = intersect(names(yyvec_hco), names(yyvec_reco))
plot(yyvec_hco[okdat], yyvec_reco[okdat])


vjcitn/SraInAnVIL documentation built on Dec. 23, 2021, 4:05 p.m.