library(knitr) opts_chunk$set(cache=TRUE)
SRAdb allows for efficient searching through metadata associated with files deposited to the Sequence Read Archives (SRA). Some dbGaP studies contain genomic data that are found in SRA. To access the metadata of dpGaP files in SRA, download SRAdb and search the 'sra' table for any rows that contain accessions starting with 'phs' found in the 'study_alias' column.
library(SRAdb) library(tidyverse) library(stringr) library(dbGaPdb)
(Make sure you have >30GB of free drive space!)
if(!file.exists('~/SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile(destdir='~/',destfile='SRAmetadb.sqlite.gz') sqlfile = '~/SRAmetadb.sqlite' sra_con <- dbConnect(SQLite(), sqlfile)
dbGaPdb_path <- pull_dbGaPdb_sqlite(local_path = '~/') dbGaPdb <- src_sqlite(dbGaPdb_path)
sra_in_dbGaP <- dbGetQuery(sra_con, "select * from study where study_alias like 'phs%'") dim(dbGaP_SRA)
sra_in_dbGaP
and dbGaP metadata to make join-ablesra_in_dbGaP <- sra_in_dbGaP %>% rowwise() %>% mutate(study_alias2 = str_split(study_alias, '_')[[1]][1]) study_info <- tbl(dbGaPdb, 'study_info') dfstudy_info <- study_info %>% data.frame() dfstudy_info <- dfstudy_info %>% rowwise() %>% mutate(study_alias = str_split(this_study_accession, '\\.')[[1]][1])
Study accession codes found in the dpGaPdp metadata include study version information (e.g., phs000514.v1
). However, SRAdb study accession codes do not follow the same nomenclature (e.g., phs000307_49
). The above lines of code trim the inconsistency and generate a new column that holds consistent values between the two databases.
SRAdb_dbGaP <- left_join(dfstudy_info, sra_in_dbGaP, by = c('study_alias' = 'study_alias2'))
After the SRAdb and the dbGaP metadata are linked, we are able to construct queries on variables across the two databases.
SRAdb_dbGaP %>% unique() %>% group_by(disease, study_alias, ) %>% summarise(count=n())
rna_studies <- sra_in_dbGaP %>% filter(grepl('RNA', library_strategy, ignore.case=T)) %>% pull(study_alias2) %>% unique() emphysema_studies <- tables$study_variable_info %>% filter(grepl('emphysema', description)) %>% pull(study_alias) %>% unique() emphysema_studies %in% rna_studies %>% table()
The example code first subsets the sra_in_dbGap
object by library strategy. The dbGap metadata are then filtered by the keyword emphysema
in the description column. The output is a table of all emphysema related RNA-seq files found in SRAdb that were generated by a dbGaP documented study.
#print the output (not possible right now since the DBs are local files on my laptop) emphysema_studies %in% rna_studies %>% table()
interesting_studies <- tables$study_variable_info %>% select(study_accession, variable_accession, male_count, female_count, description) %>% filter(grepl('cancer', description, ignore.case=T)) %>% filter(as.numeric(male_count) > 50, as.numeric(female_count) > 1000) %>% rowwise() %>% mutate(study_accession2 = str_split(study_accession, '\\.')[[1]][1]) %>% group_by(study_accession2) %>% summarise(Count = n()) %>% pull(study_accession2) print('') interesting_studies
Here, the dbGaP metadata are filtered by a keyword in the description column ('cancer'). The studies are further filtered by a numeric value in the male_count
and female_count
columns.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.