knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(BiocStyle)
This vignette reviews basic aspects of interacting with omicidx to obtain metadata about metagenomics studies available in NCBI SRA.
In Nov 2020 we added some material at the end about the API for surveying nextflow outputs.
To get things rolling, let's assume we know the identifier for an SRA study of interest: SRP114847. How we discover identifiers for studies or families of studies will be discussed later.
We can obtain a rectangular table of information about SRP114847 by querying a specific endpoint of the omicidx API. We'll use the "runs for study" endpoint and then discuss why and how afterwards.
library(tibble) metadata=as_tibble(jsonlite::fromJSON(txt= url('https://api.omicidx.cancerdatasci.org/sra/studies/SRP114847/runs?size=500'), flatten=TRUE)$hits) dim(metadata)
So there are 219 rows returned. We can visit the study level page at SRA to find:
The mention of 219 runs suggests that the API did its job correctly.
On the basis of this query, the metadata
table could have
no more than 500 rows. What are the fields returned by our query?
colnames(metadata)
There is some structure in the field names. There are four families of fields, base-level, study-level, experiment-level, and sample-level. The study/experiment/sample hierarchy reflects the fundamental data model of NCBI SRA.
It is somewhat curious that the field names do not include the token run
.
It seems reasonable to say that 'run' is a synonym for 'experiment', but the
relationship is not always one-to-one. In this case,
we see that each row returned has a unique experiment accession
number:
length(unique(metadata$experiment.accession))
A challenging aspect of SRA metadata is the diversity of sample attributes
recorded, which varies from study to study. The metadata
sample.attributes
field is nested within the metadata table, and consists of a list of
tag-value pairs. As examples:
metadata$sample.attributes[1:2]
This is somewhat cumbersome but it allows for diversity in metadata capture from sample to sample.
To transform the sample.attributes
to a table with one
row per run, tidyr pivoting is very efficient:
sampatts = metadata %>% dplyr::select(accession, sample.attributes) %>% tidyr::unnest(sample.attributes) %>% tidyr::pivot_wider(names_from='tag', values_from='value') DT::datatable(sampatts)
With this interactive table we can quickly see that there is diversity
in geographic location (sort on lat_lon
), and diet (sort on host_diet
),
but not on env_material
(sort on it).
So, with very little code, we've learned a lot about SRP114847. Now let's learn more about the API.
The OmicIDX system is developed using OpenAPI 3.0. It is possible to learn about the features of OmicIDX by parsing a JSON document.
ref = jsonlite::fromJSON("https://api.omicidx.cancerdatasci.org/openapi.json") names(ref)
The paths
element is really useful for programming with the API.
names(ref$paths)
It has its own substructure, which you can explore using
an expandable tree -- try listviewer::jsonedit(ref)
.
For now we'll focus on the endpoint we used.
ourendpt = "/sra/studies/{accession}/experiments" as.data.frame(cbind(qualname=ref$paths[[ourendpt]]$get$parameters$name, ref$paths[[ourendpt]]$get$parameters$schema)) # wanted datatable but threw warning
We found that refs$paths
has lots of information that our software
can use to figure out what it can do with the OmicIDX API. Here we'll add
a lucene query for the sra/studies endpoint. We'll modify the default
number of hits to retrieve, which is 10, specifying that we want up to 15 records.
We'll use q=microbiome%20AND%20cancer
as a Lucene query.
mcq = jsonlite::fromJSON( "https://api.omicidx.cancerdatasci.org/sra/studies?q=microbiome%20AND%20cancer&size=15", flatten=TRUE) as_tibble(mcq$hits)
sra/studies
endpointLet's use the refs
entity computed above to see what
we can request using the sra/studies
endpoint.
ourendpt = "/sra/studies" as.data.frame(cbind(qualname=ref$paths[[ourendpt]]$get$parameters$name, ref$paths[[ourendpt]]$get$parameters$schema)) # wanted datatable but threw warning
The q
parameter is used to specify queries in the Lucene query format. We'll
return to that later.
Consider the content of the
inst/curated folder
of the curatedMetagenomicDataCuration repository. This is a collection of
tsv files with sample attributes and NCBI_accession
numbers for the samples.
Each tsv file has a name in format [tag]_[year]
where tag
is often an author name.
The first folder is called AsnicarF_2017
. There are two files: the AsnicarF_2017_metadata.tsv
and a python program. Other folders may have only the tsv file.
Let's get information on a specific sample in this study. We read from the tsv file in github the SRR number of the first row: SRR4052021. This is a token that can be used with the /sra/runs endpoint of OmicIDX.
pref = "https://api.omicidx.cancerdatasci.org/" do_one = "sra/runs/SRR4052021" lk1 = jsonlite::fromJSON(paste0(pref, do_one)) names(lk1)
So there is a straightforward way to get detailed metadata for an SRR. Massaging the output here, to get that which is useful for metagenomic analysis, is an exercise.
We encapsulated code we used above to get
metadata on the AsnicarF_2017
study in the
function samp_atts_by_study
, that we'll show
and demonstrate here. We determined the
SRP number from the call above.
suppressPackageStartupMessages({ library(md4mg) }) samp_atts_by_study sa_asnicar = samp_atts_by_study() dim(sa_asnicar) DT::datatable(sa_asnicar)
One could say that this is the uncurated sample metadata for this study. Of some concern is the fact that there are 28 records here but only 25 records in the tsv file.
This is very basic exploration of results of the updated
workflow. Note that the first 10 records are different
from the remainder: the first 10 have metadata_tsv entries and
no trace$script
entries.
library(httr) base_url = function() "https://cmgd-telemetry-whnnxetv4q-uc.a.run.app" nf_metadata_names = function(url=base_url(), limit=5, offset=0) { x = GET(sprintf(paste(url, "/nextflow/events?limit=%s&offset=%s", sep=""), limit, offset)) sapply(content(x)$hits, function(z) z$metadata$parameters$metadata_tsv) } nf_metadata_wf_cmdline = function(url=base_url(), limit=5, offset=0) { x = GET(sprintf(paste(url, "/nextflow/events?limit=%s&offset=%s", sep=""), limit, offset)) sapply(content(x)$hits, function(z) z$metadata$workflow$commandLine) } nf_trscript = function(url=base_url(), limit=5, offset=0) { x = GET(sprintf(paste(url, "/nextflow/events?limit=%s&offset=%s", sep=""), limit, offset)) sapply(content(x)$hits, function(z) z$trace$script) } nf_accessions = function(url=base_url(), limit=5, offset=0) { tmp = nf_trscript(url=url, limit=limit, offset=offset) gsub(".*accessions: (.*\\]).*", "\\1", tmp) } nf_rowhash = function(url=base_url(), limit=5, offset=0) { tmp = nf_trscript(url=url, limit=limit, offset=offset) gsub(".*rowhash: (.*)....sampleinfo.txt.*", "\\1", tmp) } nf_metadata_names(limit=5) nf_accessions(limit=5, offset=100) nf_rowhash(limit=5, offset=100)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.