The scidb.dplyr R package defines a SciDB interface to the dplyr way to work with data backed by SciDB arrays (http://paradigm4.com). The package is still experimental, please contribute!
The package style follows the dplyr data frame implementation more closely than those of SQL-based external database back ends. We made this choice because SciDB does not present a SQL API.
install.packages("scidb")).Best performance is obtained with the upcoming 17.x SciDB release.
devtools::install_github("paradigm4/scidb.dplyr")
library(scidb.dplyr)
db = scidbconnect()
x  = as.scidb(db, iris)    # a 'scidb' object
d  = tbl(x)                # a 'tbl_scidb' object (dplyr)
select(d, Petal_Length, Species) %>% filter(Petal_Length < 1.4) %>% as.data.frame %>% head
##    i Petal_Length Species
## 1  3          1.3  setosa
## 2 14          1.1  setosa
## 3 15          1.2  setosa
## 4 17          1.3  setosa
## 5 23          1.0  setosa
## 6 36          1.2  setosa
Note the default includsion of SciDB array coordinate values above (the 'i'
variable in the output).  Use the optional only_attributes=TRUE option to the
collect() and as.data.frame() functions to only include data frame
variables (SciDB attributes) in the output.
d %>% group_by(Species) %>% summarise(mean=mean(Petal_Length)) %>% collect
## # A tibble: 3 × 4
##   instance_id value_no    Species  mean
##         <dbl>    <dbl>      <chr> <dbl>
## 1           0        0     setosa 1.462
## 2           0        1  virginica 5.552
## 3           0        2 versicolor 4.260
Now without the SciDB array coordinate values...
d %>% group_by(Species) %>% summarise(mean=mean(Petal_Length)) %>% collect(only_attributes=TRUE)
## # A tibble: 3 × 2
##      Species  mean
##        <chr> <dbl>
## 1     setosa 1.462
## 2  virginica 5.552
## 3 versicolor 4.260
d %>% group_by(Species) %>% summarise(mean=mean(Petal_Length)) %>% as.data.frame(only_attributes=TRUE)
##      Species  mean
## 1     setosa 1.462
## 2  virginica 5.552
## 3 versicolor 4.260
Which includes at least limited support for renaming SciDB array attributes as shown in the example below.
library(scidb.dplyr)
db <- scidbconnect()
x <- tbl(as.scidb(db, iris))
select(x, petal=starts_with("Pet"), sepal=starts_with("Sep")) %>% collect() %>% head()
## # A tibble: 6 × 5
##       i petal1 petal2 sepal1 sepal2
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1    1.4    0.2    5.1    3.5
## 2     2    1.4    0.2    4.9    3.0
## 3     3    1.3    0.2    4.7    3.2
## 4     4    1.5    0.2    4.6    3.1
## 5     5    1.4    0.2    5.0    3.6
## 6     6    1.7    0.4    5.4    3.9
as.data.frame collect compute dim
filter select summarise transmute group_by mutate slice
full_join inner_join left_join right_join
Plus experimental support for the select helper functions (?select_helpers).
See the https://github.com/Paradigm4/scidb.dplyr/blob/master/TODO file for a list of verbs not yet implemented (help/assistance is welcome)...
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.