connector
In connector: Streamlining Data Access in Clinical Research

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

connector is a package that provides a set of functions to connect to different data sources (such as databases, APIs, and file systems) and read and write data from them.

It is designed to be a generic package that is extensible, so that new data sources can be added easily.

Introduction

This vignette demonstrates how to use the connector package to connect to either a file system or a database to access different types of data.

First let's load all the packages we're going to use in this vignette:

library(connector)
library(connector.databricks)
library(connector.sharepoint)

You may notice we're not loading only connector package, but connector.databricks
and connector.sharepoint as well. These are packages that extend the functionality of the connector package to connect to Databricks and SharePoint, respectively.

Connector configuration

Main function in this package is connect(). This function based on a configuration file or a list creates a connectors() object with a connector for each of the specified datasources. The configuration file can be in any format that can be read through read_file(), and contains a list. If a yaml file is provided, expressions are evaluated when parsing it using yaml::read_yaml() with eval.expr = TRUE.

The input list (or configuration file) has to have the following structure:

Only metadata, env, connections, and datasources fields are allowed.
All elements must be named.
datasources is mandatory.
metadata and env must each be a list of named character vectors of length 1 specified.
datasources must each be a list of unnamed lists.
Each datasources must have the named character element name and the named list element backend
For each connection backend.type must be provided

Here is an example of a configuration file with different types of connections:

metadata:
  trial: "my_trial"
  project: "my_project"
  instance: "current"
  http_path:  !!expr Sys.getenv("DATABRICKS_HTTP_PATH")
  catalog:  !!expr Sys.getenv("DATABRICKS_CATALOG_NAME")
  sharepoint: "https://my_organisation.sharepoint.com"
  root_path: "/home/my_root_path"
datasources:
  - name: "adam"
    backend:
        type: "connector::connector_fs"
        path: "{metadata.root_path}/{metadata.project}/{metadata.trial}/{metadata.instance}/adam"
  - name: "tfl"
    backend:
      type: "connector.databricks::connector_databricks_volume"
      catalog: "{metadata.catalog}"
      schema: "{metadata.project}_{metadata.trial}_tfl"
      path: "output"

  - name: "metadata"
    backend:
      type: "connector.databricks::connector_databricks_dbi"
      http_path: "{metadata.http_path}"
      catalog: "{metadata.catalog}"
      schema: "{metadata.project}_{metadata.trial}_metadata"

  - name: "output_sh"
    backend:
      type: "connector.sharepoint::connector_sharepoint"
      site_url: "{metadata.sharepoint}/sites/{metadata.project}_{metadata.trial}_output"

In this example we have multiple connections to multiple data sources. ADaM data is stored on the file system, metadata resides on Databricks Tables, TFL is also on Databricks, but on Volumes, because this is where the unstructured data is stored. Finally, the output is stored on SharePoint.

Connect to datasources

When we want to connect to datasources we can use the connect() function.

# Load data connections
db <- connect()

This creates list of connectors, which can be accessed by their names:

# ADaM data
db$adam

# TFL data
db$tfl

# Metadata
db$metadata

# Output on SharePoint
db$output_sh

After the setup is done we can use these connections to read and write data from datasources.

Reading and writing data

The connector packages provide a set of functions to read and write data from the datasources. They all have similar interface, so it's easy to switch between them.

Now, we will show how to read and write different types data from/to different datasources. In these examples we will be using iris and mtcars datasets.

Here is an example of writing data to the ADaM table:

library(dplyr)

# Manipulate data

## Iris data
setosa <- iris |>
  filter(Species == "setosa")

mean_for_all_iris <- iris |>
  group_by(Species) |>
  summarise_all(list(mean, median, sd, min, max))

## Mtcars data
cars <- mtcars |>
  filter(mpg > 22)

mean_for_all_mtcars <- mtcars |>
  group_by(gear) |>
  summarise(
    across(
      everything(),
      list("mean" = mean, "median" = median, "sd" = sd, "min" = min, "max" = max),
      .names = "{.col}_{.fn}"
    )
  ) |>
  tidyr::pivot_longer(
    cols = -gear,
    names_to = c(".value", "stat"),
    names_sep = "_"
  )

## Store data
db$adam |>
  write_cnt(setosa, "setosa", overwrite = TRUE)

db$adam |>
  write_cnt(mean_for_all_iris, "mean_iris", overwrite = TRUE)

db$adam |>
  write_cnt(cars, "cars_mpg", overwrite = TRUE)

db$adam |>
  write_cnt(mean_for_all_mtcars, "mean_mtcars", overwrite = TRUE)

Now, let's read the data back manipulate it a bit and write it to the SharePoint and DataBricks. This way we can save different types of data in different formats.

library(gt)
library(tidyr)
library(ggplot2)

# List and load data
db$adam |>
  list_content_cnt()

table <- db$adam |>
  read_cnt("mean_mtcars")

gttable <- table |>
  gt(groupname_col = "gear")

## Save table
db$tfl$write_cnt(gttable$`_data`, "tmeanallmtcars.csv")
db$tfl$write_cnt(gttable, "tmeanallmtcars.rds")

## Using Sharepoint
tmp_file <- tempfile(fileext = ".docx")
gtsave(gttable, tmp_file)
db$output_sh$upload_cnt(tmp_file, "tmeanallmtcars.docx")

# Manipulate data
setosa_fsetosa <- db$adam |>
  read_cnt("setosa") |>
  filter(Sepal.Length > 5)

fsetosa <- ggplot(setosa) +
  aes(x = Sepal.Length, y = Sepal.Width) +
  geom_point()

## Using Databricks Volumes
## Store TFLs
db$tfl$write_cnt(fsetosa$data, "fsetosa.csv")
db$tfl$write_cnt(fsetosa, "fsetosa.rds")

## Using Sharepoint
tmp_file <- tempfile(fileext = ".png")
ggsave(tmp_file, fsetosa)
db$output_sh$upload(tmp_file, "fsetosa.png")

## Using Databricks Volumes
db$tfl$upload_cnt(contents = tmp_file, file_path = "fsetosa.png")

Working example

# Use a temporary directory as working directory for the example below
tmp <- withr::local_tempdir()
knitr::opts_knit$set(root.dir = tmp)

Here is an example anyone can run to see how the connector package works. We will use the configuration file provided below, which will only use file system as type of connection for ADaM and TFL data.

'metadata:
  adam_path: !expr file.path(getwd(), "adam")
  tfl_path: !expr file.path(getwd(), "tfl")

datasources:
  - name: "adam"
    backend:
      type: "connector::connector_fs"
      path: "{metadata.adam_path}"
  - name: "tfl"
    backend:
      type: "connector::connector_fs"
      path: "{metadata.tfl_path}"
' |> writeLines("_connector.yml")

_connector.yml:

`r readLines("_connector.yml")`

Now, let's run the example:

library(connector)
library(dplyr)
library(ggplot2)

# Let's create ADaM and TFL directories
dir.create("adam")
dir.create("tfl")

# Load data connections
db <- connect()

## Iris data
setosa <- iris |>
  filter(Species == "setosa")

mean_for_all_iris <- iris |>
  group_by(Species) |>
  summarise_all(list(mean, median, sd, min, max))

## Store data
db$adam |>
  write_cnt(setosa, "setosa.rds")

db$adam |>
  write_cnt(mean_for_all_iris, "mean_iris.rds")

## List and load data
db$adam |>
  list_content_cnt()

# Manipulate data
setosa_fsetosa <- db$adam |>
  read_cnt("setosa") |>
  filter(Sepal.Length > 5)

fsetosa <- ggplot(setosa) +
  aes(x = Sepal.Length, y = Sepal.Width) +
  geom_point()

## Store TFLs
db$tfl$write_cnt(fsetosa$data, "fsetosa.csv")
db$tfl$write_cnt(fsetosa, "fsetosa.rds")

## Store images
tmp_file <- tempfile(fileext = ".png")
ggsave(tmp_file, fsetosa)
db$tfl$upload_cnt(tmp_file, "fsetosa.png")

# Check if everything is written into temporary TFL directory
db$tfl$list_content_cnt()

Conclusion

In this vignette we showed how to connect to different datasources, read and write data from them. We used connector.databricks and connector.sharepoint packages to extend the functionality of the connector package to connect to Databricks and SharePoint, respectively.

This package is designed to be extensible, so that new data sources can be added easily.