library(knitr)
opts_chunk$set(cache=TRUE)

Linking dbGaPdb with SRAdb

SRAdb allows for efficient searching through metadata associated with files deposited to the Sequence Read Archives (SRA). Some dbGaP studies contain genomic data that are found in SRA. To access the metadata of dpGaP files in SRA, download SRAdb and search the 'sra' table for any rows that contain accessions starting with 'phs' found in the 'study_alias' column.

Load libraries

library(SRAdb)
library(tidyverse)
library(stringr)
library(dbGaPdb)

Download and load SRAdb

(Make sure you have >30GB of free drive space!)

if(!file.exists('~/SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile(destdir='~/',destfile='SRAmetadb.sqlite.gz')
sqlfile = '~/SRAmetadb.sqlite'
sra_con <- dbConnect(SQLite(), sqlfile)

Load in dbGaPdb metadata as described in the vignette introduction

dbGaPdb_path <- pull_dbGaPdb_sqlite(local_path = '~/')
dbGaPdb <- src_sqlite(dbGaPdb_path)

Query SRAdb for any files containing references to dbGaP

sra_in_dbGaP <- dbGetQuery(sra_con, "select * from study where study_alias like 'phs%'")
dim(dbGaP_SRA)

Process sra_in_dbGaP and dbGaP metadata to make join-able

sra_in_dbGaP <- sra_in_dbGaP %>% 
              rowwise() %>% 
              mutate(study_alias2 = str_split(study_alias, '_')[[1]][1])

study_info <- tbl(dbGaPdb, 'study_info')
dfstudy_info <- study_info %>% data.frame()

dfstudy_info <- dfstudy_info %>% 
    rowwise() %>% 
    mutate(study_alias = str_split(this_study_accession, '\\.')[[1]][1])

Study accession codes found in the dpGaPdp metadata include study version information (e.g., phs000514.v1). However, SRAdb study accession codes do not follow the same nomenclature (e.g., phs000307_49). The above lines of code trim the inconsistency and generate a new column that holds consistent values between the two databases.

SRAdb_dbGaP <- left_join(dfstudy_info, sra_in_dbGaP, by = c('study_alias' = 'study_alias2'))

After the SRAdb and the dbGaP metadata are linked, we are able to construct queries on variables across the two databases.

SRAdb_dbGaP %>% unique() %>% group_by(disease, study_alias, ) %>% summarise(count=n())

Example 1

How many studies with emphysema information have RNA-seq data?

rna_studies <- sra_in_dbGaP %>% 
  filter(grepl('RNA', library_strategy, ignore.case=T)) %>% 
  pull(study_alias2) %>% 
  unique()

emphysema_studies <- tables$study_variable_info %>% 
  filter(grepl('emphysema', description)) %>% 
  pull(study_alias) %>% 
  unique()

emphysema_studies %in% rna_studies %>% table()

The example code first subsets the sra_in_dbGap object by library strategy. The dbGap metadata are then filtered by the keyword emphysema in the description column. The output is a table of all emphysema related RNA-seq files found in SRAdb that were generated by a dbGaP documented study.

#print the output (not possible right now since the DBs are local files on my laptop)
emphysema_studies %in% rna_studies %>% table()

Example 2

How many cancer studies conatain more than 50 males and 1000 females in the cohort?

interesting_studies <- tables$study_variable_info %>% 
  select(study_accession, variable_accession, male_count, female_count, description) %>%
  filter(grepl('cancer', description, ignore.case=T)) %>% 
  filter(as.numeric(male_count) > 50, as.numeric(female_count) > 1000) %>% 
  rowwise() %>% 
  mutate(study_accession2 = str_split(study_accession, '\\.')[[1]][1]) %>% 
  group_by(study_accession2) %>% 
  summarise(Count = n()) %>% 
  pull(study_accession2)
print('')
interesting_studies

Here, the dbGaP metadata are filtered by a keyword in the description column ('cancer'). The studies are further filtered by a numeric value in the male_count and female_count columns.



dbGaPdb/dbGaPdb documentation built on May 24, 2019, 2:04 a.m.