library(knitr)
opts_chunk$set(cache=TRUE)

Load libraries

library(dbGaPdb)
library(RSQLite)
library(dplyr)
library(dbplyr)
library(stringr)

Downloading the dbGaPdb sqlite file and connecting to it via RSQLite

We strongly suggest you provide a path for the pull_dbGaPdb_sqlite() script so you know exactly where the sqlite file is. The below is taken from the quick start from the readme

dbGaPdb_file = 'dbGaPdb.sqlite'
if(!file.exists(dbGaPdb_file))
  dbGaPdb_file <- pull_dbGaPdb_sqlite(local_path='~/')
dbGaPdb <- src_sqlite(dbGaPdb_file)
# If you've already downloaded the sqlite file
## dbGaPdb <- src_sqlite('~/PATH/your_dbGaP.sqlite')

Bits

The dbGaPdb metadata is stored as a sqlite file. It is updated each week, so if you want the latest metadata, then re-run pull_dbGaPdb_sqlite(). Otherwise you can use your downloaded sqlite file indefinitely. As of 2017-08-16, the sqlite file is ~108 mb compressed and ~547 mb uncompressed. It must be uncompressed to use.

What tables are available?

dbGaPdb

dbplyr

We will be using dplyr via dbplyr to query the sql database. This allows for, in our opinion, a gentler introduction to querying sqlite for people who don't have previous experience writing sql queries. A good tutorial on dplyr and sql is located here

SQL experts

Use this convention to run sql queries directly on the sqlite database

tbl(dbGaPdb, sql('SELECT study_accession, dataset_accession, short_description FROM study_dataset_info LIMIT 5' ))

Link to two of the tables

study_info_SQL <- tbl(dbGaPdb, "study_info")
study_variable_info_SQL <- tbl(dbGaPdb, "study_variable_info")

Peek

study_info_SQL %>% head()
study_variable_info_SQL %>% head()

Filter to find number of variables with more than 10,000 females

study_variable_info_SQL %>% 
  filter(as.numeric(female_count) > 10000)

Filter and group by studies to find number of studies where there are any variables with > 10,000 females

study_variable_info_SQL %>% 
  filter(as.numeric(female_count) > 10000) %>% 
  group_by(study_accession) %>% 
  summarise(`Number of variables with >10,0000 females`=n())

Aggregate by version-less studies

You may have noticed that there are many versions (e.g phs000200.v4 and phs000200.v5). If you want to aggregate or study at the study level, we'll need to remove the .vNUM part with a mutate. To do this operation we have to convert the data to a data frame to run the rowwise() and mutate() function. But now we can see that we have study_variable_info_SQL %>% filter(as.numeric(female_count) > 10000) %>% data.frame() %>% rowwise() %>% mutate(study_accession2 = str_split(study_accession, '\\.')[[1]][1]) %>% group_by(study_accession2) %>% summarise(Number of variables with >10,0000 females=n()) %>% nrow() unique studies.

study_variable_info_SQL %>% 
  filter(as.numeric(female_count) > 10000) %>% 
  data.frame() %>% 
  rowwise() %>% 
  mutate(study_accession2 = str_split(study_accession, '\\.')[[1]][1]) %>% 
  group_by(study_accession2) %>% 
  summarise(`Number of variables with >10,0000 females`=n())

What diseases are studied across the studies?

tbl(dbGaPdb, "study_info") %>% 
  select(root_study_accession, disease) %>% 
  data.frame() %>% 
  rowwise() %>% 
  mutate(study_accession2 = str_split(root_study_accession, '\\.')[[1]][1]) %>% 
  select(study_accession2, disease) %>% 
  unique()

Loading whole tables into memory as a R data frame

You cannot run the full gamut of dplyr or base R queries on tbl(SQLite), as several (e.g. tail(), sample_n()) require the data to be in a data frame (in memory). As the database isn't not too huge, most computers can handle having the tables loaded as a data frame. After this you have a data frame and can run any base R or dplyr queries

# study_variable_info <- study_variable_info_SQL %>% data.frame()
study_variable_info_SQL %>% 
  select(study_accession, description) %>% 
  collect() %>% 
  sample_n(10)


dbGaPdb/dbGaPdb documentation built on May 24, 2019, 2:04 a.m.