knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
last edit: 10/7/2019
SQL database are very commonly used in the storage of very large
genomic data resources. Many useful tools, such as DBI, dbplyr
have provided convenient interfaces for R users to check and
manipulate the data. These tools represent the SQL tables in tidy
formats and support lazy and quick aggregation operations (e.g,
union, etc.) for tables from same resources. Cross
database aggregation is also supported when opted (using
but become very expensive due to the internal copying process of a
whole table into the other connection. Use of advanced functions often
involves specialized SQL knowledge which brings challenges for common
R users. The interoperability of existing bioinformatics tools are
suboptimal, e.g., the SummarizedExperiment container for
representation of sequencing or genotyping experiments that many
modern bioinformatics pipelines are based.
The SQLDataFrame package was developed using familiar DataFrame-like
paradigm and lazily represents the very large dataset from different
SQL databases, such as SQLite and MySQL. The DataFrame-like interface
provides familiarity for common R users in easy data manipulations
such as square bracket subsetting, rbinding, etc. For modern R
users, it also recognizes the
tidy data analysis and dplyr
grammar by supporting
mutate, etc. More
importantly, database type-specific strategies were implemented in
SQLDataFrame to efficiently handle the cross-database operations
without incurring any internally expensive processes (especially for
database with write permission). Some previously difficult data
operations are made quick and easy in R, such as cross-database ID
matching and conversion, variant annotation extraction, etc. The
scalability and interoperability of SQLDataFrame are expected to
significantly promote the handling of very large genomic data
resources and facilitating the overall bioinformatics analysis.
Currently SQLDataFrame supports the DBI backend of SQLite, MySQL and
Google BigQuery, which are most commonly used SQL-based databases. In
the future or upon feature request, we would implement this package so
that users could choose to use different database backend for
Here is a list of commonly used backends (bolded are already supported!):
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SQLDataFrame")
The development version is also available to download from Github.
There are two ways to construct a
1) Provide an argument of
conn is a
valid DBIConnection from SQLite or MySQL;
dbtable specifies the
database table name that is going to be represented as
object. If only one table is available in the specified database name,
this argument could be left blank. The
dbkey argument is used to
specify the column name in the table which could uniquely identify all
the data observations (rows).
dbkey as specified above, and credentials to
build valid DBIConnections. for SQLite, the credential argument
dbname. For MySQL, the credential arguments are
password. Additional to the credentials, users must provide
type argument to specify the SQL database type. Supported types
are "SQLite" and "MySQL". If not specified, "SQLite" is used by
default. Supported database tables could be on-disk or remote on the
web or cloud.
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame") conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile) obj <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = "state")
construction from database credentials:
obj1 <- SQLDataFrame(dbname = dbfile, type = "SQLite", dbtable = "state", dbkey = "state") all.equal(obj, obj1)
Note that after reading the database table into
key columns will be kept as fixed columns showing on the left hand
| separating key column(s) with the other columns. The
colnames, and corresponding column subsetting will only
apply to the non-key-columns.
obj dim(obj) colnames(obj)
To make the
SQLDataFrame object as light and compact as possible,
there are only 5 slots contained in the object:
indexes. Metadata information could be
returned through these 5 slots using slot accessors or other utility
slotNames(obj) dbtable(obj) dbkey(obj) connSQLDataFrame(obj)
Besides, many useful common methods are defined on
object to make it a more DataFrame-like data structure. e.g., we can
dimnames() to return the row/colnames of the data. It returns an
unnamed list, with the first element being rownames which is always
NULL, and 2nd element being colnames (could also use
dim() method is defined to return the dimension of the
database table, which enables the
nrow()/ncol() to extract a
length() method is also defined which works same
Note that the
rownames(SQLDataFrame) would always be
rownames are not supported in
was implemented for the
[ subsetting with characters.
dim(obj) dimnames(obj) length(obj) ROWNAMES(obj)
NOTE that the
dbtable() accessor only works for a
object that the lazy tbl carried in
tblData slot corresponds to a
single database. If the
SQLDataFrame was generated from
saveSQLDataFrame() to save the lazy tbl to
disk so that
dbtable() will be activated.
dbtable(obj) aa <- rbind(obj[1:5, ], obj[6:10, ]) aa dbtable(aa) ## message bb <- saveSQLDataFrame(aa, dbname = tempfile(fileext=".db"), dbtable = "aa") connSQLDataFrame(bb) dbtable(bb)
We could also construct a
SQLDataFrame object directly from a file
makeSQLDataFrame function takes input of character value
of file name for common text files (.csv, .txt, etc.), write into
database tables, and open as
SQLDataFrame object. Users could
provide values for the
dbtable argument. If NULL,
default value for
dbname would be a temporary database file, and
dbtable would be the
basename(filename) without extension.
NOTE that the input file must have one or multiple columns that
could uniquely identify each observation (row) to be used the
SQLDataFrame. Also the file must be rectangular, i.e.,
rownames are not accepted. But users could save rownames as a separate
mtc <- tibble::rownames_to_column(mtcars)[,1:6] filename <- file.path(tempdir(), "mtc.csv") write.csv(mtc, file= filename, row.names = FALSE) aa <- makeSQLDataFrame(filename, dbkey = "rowname", sep = ",", overwrite = TRUE) aa connSQLDataFrame(aa) dbtable(aa)
With all the methods (
provided in the next section, the
SQLDataFrame always work like a
lazy representation until users explicitly call the
function for realization.
saveSQLDataFrame write the lazy tbl
tblData slot into an on-disk database table, and re-open
SQLDataFrame object from the new path.
It's also recommended that users call
saveSQLDataFrame frequently to
avoid too many lazy layers which slows down the data processing.
connSQLDataFrame(obj) dbtable(obj) obj1 <- saveSQLDataFrame(obj, dbname = tempfile(fileext = ".db"), dbtable = "obj_copy") connSQLDataFrame(obj1) dbtable(obj1)
[[,SQLDataFrame Behaves similarly to
[[,DataFrame and returns a
realized vector of values from a single column.
also defined to conveniently extract column values.
head(obj[]) head(obj[["region"]]) head(obj$size)
We can also get the key column values using character extraction.
SQLDataFrame instances can be subsetted in a similar way of
DataFrame following the usual R conventions, with numeric,
character or logical vectors; logical vectors are recycled to the
drop=FALSE explicitly for single column subsetting if
you want to return a
SQLDataFrame object, otherwise, the default
drop=TRUE would always return a realized value for that column.
obj[1:3, 1:2] obj[c(TRUE, FALSE), c(TRUE, FALSE), drop=FALSE] obj[1:3, "population", drop=FALSE] obj[, "population"] ## realized column value
Subsetting with character vector works for the
objects. With composite keys, users need to concatenate the key values
: for row subsetting (See the vignette for internal
implementation for more details).
rnms <- ROWNAMES(obj) obj[c("Alabama", "Colorado"), ]
obj1 <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = c("region", "population")) rnms <- ROWNAMES(obj1) obj1[c("South:3615.0", "West:365.0"), ]
List style subsetting is also allowed to extract certain columns from
SQLDataFrame object which returns
SQLDataFrame by default.
We have also enabled the S3 methods of
dplyr package, so that users could have the convenience in filtering
data observations and adding new columns.
obj1 %>% filter(division == "South Atlantic" & size == "medium")
obj1 %>% mutate(p1 = population/10, s1 = size)
To be consistent with
rbind methods were
union returns the
SQLDataFrame sorted by the
rbind keeps the
original orders of input objects.
dbfile1 <- system.file("extdata/test.db", package = "SQLDataFrame") con1 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile1) dbfile2 <- system.file("extdata/test1.db", package = "SQLDataFrame") con2 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile2) ss1 <- SQLDataFrame(conn = con1, dbtable = "state", dbkey = c("state")) ss2 <- SQLDataFrame(conn = con2, dbtable = "state1", dbkey = c("state")) ss11 <- ss1[sample(5), ] ss21 <- ss2[sample(10, 5), ]
obj1 <- union(ss11, ss21) obj1 ## reordered by the "dbkey()"
obj2 <- rbind(ss11, ss21) obj2 ## keeping the original order by updating the row index
*_join family methods was implemented for
objects, including the
anti_join, which provides the capability of merging database files
from different sources.
ss12 <- ss1[1:10, 1:2] ss22 <- ss2[6:15, 3:4] left_join(ss12, ss22) inner_join(ss12, ss22) semi_join(ss12, ss22) anti_join(ss12, ss22)
SQLDataFrame now supports the MySQL database tables through RMySQL,
for local MySQL servers, or remote ones on the web or cloud. The
saving are all supported. Aggregation operations are supported for
same or cross MySQL databases. Details please see the function
Here I'll show a simple use case for MySQL tables from ensembl.
library(RMySQL) ensbConn <- dbConnect(dbDriver("MySQL"), host="genome-mysql.soe.ucsc.edu", user = "genome", dbname = "xenTro9") enssdf <- SQLDataFrame(conn = ensbConn, dbtable = "xenoRefGene", dbkey = c("name", "txStart")) enssdf1 <- enssdf[1:20, 1:2] enssdf2 <- enssdf[11:30,3:4] res <- left_join(enssdf1, enssdf2)
SQLDataFrame has just added support for Google BigQuery
tables. Construction and queries using
filter are supported!
Also note that, the support of BigQuery tables has implemented
specialized strategy for efficient data representation. The
is assigned by default as
dbkey argument will be
ignored during construction.
library(bigrquery) bigrquery::bq_auth() ## use this to authorize bigrquery in the ## browser. bqConn <- DBI::dbConnect(dbDriver("bigquery"), project = "bigquery-public-data", dataset = "human_variant_annotation", billing = "") ## if not previous provided ## authorization, must specify a ## project name that was already ## linked with Google Cloud with ## billing info. sdf <- SQLDataFrame(conn = bqConn, dbtable = "ncbi_clinvar_hg38_20180701") sdf[1:5, 1:5] sdf %>% select(GENEINFO) sdf %>% filter(GENEINFO == "PYGL:5836") sdf %>% filter(reference_name == "21")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.