knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
connector
is a package that provides a set of functions to connect to
different data sources (such as databases, APIs, and file systems) and read
and write data from them.
It is designed to be a generic package that is extensible, so that new data sources can be added easily.
This vignette demonstrates how to use the connector
package to connect to
either a file system or a database to access different types of data.
First let's load all the packages we're going to use in this vignette:
library(connector) library(connector.databricks) library(connector.sharepoint)
You may notice we're not loading only connector package, but connector.databricks
and connector.sharepoint
as well. These are packages that extend the functionality
of the connector
package to connect to Databricks
and SharePoint
, respectively.
Main function in this package is connect()
. This function based on a
configuration file or a list creates a connectors()
object with a connector
for each of the specified datasources. The configuration file can be in any format
that can be read through read_file()
, and contains a list. If a yaml file is
provided, expressions are evaluated when parsing it using yaml::read_yaml()
with eval.expr = TRUE
.
The input list (or configuration file) has to have the following structure:
Here is an example of a configuration file with different types of connections:
metadata: trial: "my_trial" project: "my_project" instance: "current" http_path: !!expr Sys.getenv("DATABRICKS_HTTP_PATH") catalog: !!expr Sys.getenv("DATABRICKS_CATALOG_NAME") sharepoint: "https://my_organisation.sharepoint.com" root_path: "/home/my_root_path" datasources: - name: "adam" backend: type: "connector::connector_fs" path: "{metadata.root_path}/{metadata.project}/{metadata.trial}/{metadata.instance}/adam" - name: "tfl" backend: type: "connector.databricks::connector_databricks_volume" catalog: "{metadata.catalog}" schema: "{metadata.project}_{metadata.trial}_tfl" path: "output" - name: "metadata" backend: type: "connector.databricks::connector_databricks_dbi" http_path: "{metadata.http_path}" catalog: "{metadata.catalog}" schema: "{metadata.project}_{metadata.trial}_metadata" - name: "output_sh" backend: type: "connector.sharepoint::connector_sharepoint" site_url: "{metadata.sharepoint}/sites/{metadata.project}_{metadata.trial}_output"
In this example we have multiple connections to multiple data sources. ADaM data is stored on the file system, metadata resides on Databricks Tables, TFL is also on Databricks, but on Volumes, because this is where the unstructured data is stored. Finally, the output is stored on SharePoint.
When we want to connect to datasources we can use the connect()
function.
# Load data connections db <- connect()
This creates list of connectors, which can be accessed by their names:
# ADaM data db$adam # TFL data db$tfl # Metadata db$metadata # Output on SharePoint db$output_sh
After the setup is done we can use these connections to read and write data from datasources.
The connector
packages provide a set of functions to read and write data from
the datasources. They all have similar interface, so it's easy to switch between
them.
Now, we will show how to read and write different types data from/to different
datasources. In these examples we will be using iris
and mtcars
datasets.
Here is an example of writing data to the ADaM table:
library(dplyr) # Manipulate data ## Iris data setosa <- iris |> filter(Species == "setosa") mean_for_all_iris <- iris |> group_by(Species) |> summarise_all(list(mean, median, sd, min, max)) ## Mtcars data cars <- mtcars |> filter(mpg > 22) mean_for_all_mtcars <- mtcars |> group_by(gear) |> summarise( across( everything(), list("mean" = mean, "median" = median, "sd" = sd, "min" = min, "max" = max), .names = "{.col}_{.fn}" ) ) |> tidyr::pivot_longer( cols = -gear, names_to = c(".value", "stat"), names_sep = "_" ) ## Store data db$adam |> write_cnt(setosa, "setosa", overwrite = TRUE) db$adam |> write_cnt(mean_for_all_iris, "mean_iris", overwrite = TRUE) db$adam |> write_cnt(cars, "cars_mpg", overwrite = TRUE) db$adam |> write_cnt(mean_for_all_mtcars, "mean_mtcars", overwrite = TRUE)
Now, let's read the data back manipulate it a bit and write it to the SharePoint and DataBricks. This way we can save different types of data in different formats.
library(gt) library(tidyr) library(ggplot2) # List and load data db$adam |> list_content_cnt() table <- db$adam |> read_cnt("mean_mtcars") gttable <- table |> gt(groupname_col = "gear") ## Save table db$tfl$write_cnt(gttable$`_data`, "tmeanallmtcars.csv") db$tfl$write_cnt(gttable, "tmeanallmtcars.rds") ## Using Sharepoint tmp_file <- tempfile(fileext = ".docx") gtsave(gttable, tmp_file) db$output_sh$upload_cnt(tmp_file, "tmeanallmtcars.docx") # Manipulate data setosa_fsetosa <- db$adam |> read_cnt("setosa") |> filter(Sepal.Length > 5) fsetosa <- ggplot(setosa) + aes(x = Sepal.Length, y = Sepal.Width) + geom_point() ## Using Databricks Volumes ## Store TFLs db$tfl$write_cnt(fsetosa$data, "fsetosa.csv") db$tfl$write_cnt(fsetosa, "fsetosa.rds") ## Using Sharepoint tmp_file <- tempfile(fileext = ".png") ggsave(tmp_file, fsetosa) db$output_sh$upload(tmp_file, "fsetosa.png") ## Using Databricks Volumes db$tfl$upload_cnt(contents = tmp_file, file_path = "fsetosa.png")
# Use a temporary directory as working directory for the example below tmp <- withr::local_tempdir() knitr::opts_knit$set(root.dir = tmp)
Here is an example anyone can run to see how the connector
package works.
We will use the configuration file provided below, which will only use file
system as type of connection for ADaM and TFL data.
'metadata: adam_path: !expr file.path(getwd(), "adam") tfl_path: !expr file.path(getwd(), "tfl") datasources: - name: "adam" backend: type: "connector::connector_fs" path: "{metadata.adam_path}" - name: "tfl" backend: type: "connector::connector_fs" path: "{metadata.tfl_path}" ' |> writeLines("_connector.yml")
_connector.yml:
`r readLines("_connector.yml")`
Now, let's run the example:
library(connector) library(dplyr) library(ggplot2) # Let's create ADaM and TFL directories dir.create("adam") dir.create("tfl") # Load data connections db <- connect() ## Iris data setosa <- iris |> filter(Species == "setosa") mean_for_all_iris <- iris |> group_by(Species) |> summarise_all(list(mean, median, sd, min, max)) ## Store data db$adam |> write_cnt(setosa, "setosa.rds") db$adam |> write_cnt(mean_for_all_iris, "mean_iris.rds") ## List and load data db$adam |> list_content_cnt() # Manipulate data setosa_fsetosa <- db$adam |> read_cnt("setosa") |> filter(Sepal.Length > 5) fsetosa <- ggplot(setosa) + aes(x = Sepal.Length, y = Sepal.Width) + geom_point() ## Store TFLs db$tfl$write_cnt(fsetosa$data, "fsetosa.csv") db$tfl$write_cnt(fsetosa, "fsetosa.rds") ## Store images tmp_file <- tempfile(fileext = ".png") ggsave(tmp_file, fsetosa) db$tfl$upload_cnt(tmp_file, "fsetosa.png") # Check if everything is written into temporary TFL directory db$tfl$list_content_cnt()
In this vignette we showed how to connect to different datasources, read and write
data from them. We used connector.databricks
and connector.sharepoint
packages
to extend the functionality of the connector
package to connect to Databricks
and SharePoint
, respectively.
This package is designed to be extensible, so that new data sources can be added easily.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.