knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
So far we've been using a local Spark connection for introducing the OmopOnSpark package. However, in practice, when working with patient-level health data our data will most likely be in the cloud-based Databricks plaform which is built around Apache Spark. Once we have created our cdm reference, the same code we have seen when working with a local Spark dataset will also work with Databricks. It is just that the way we connect will differ.
To create your connection to https://spark.posit.co/deployment/databricks-connect.html. Briefly, first you would save environmental variables.
usethis::edit_r_environ() DATABRICKS_HOST = "Enter here your Workspace URL" DATABRICKS_TOKEN = "Enter here your personal token"
With these saved you should now be able to connect with sparklyr, specifying your cluster ID.
library(sparklyr) con <- spark_connect( cluster_id = "Enter here your cluster ID", method = "databricks_connect" ) con
With this, we can check that everything is working and we have an open connection
connection_is_open(con)
With this, we we should be able to create a reference to a table. Let's say we our OMOP CDM data is in a catalog called "my_catalog" and a schema called "my_omop_schema". We should be able to create a reference to our person table.
library(dplyr) tbl(con, I("my_catalog.my_omop_schema.person"))
We should be able to collect the first five rows of this table into R
tbl(con, I("my_catalog.my_omop_schema.person")) |> head(5) |> collect()
As well as this we should be able to go in the other direction and copy data from R to a Spark dataframe.
spark_cars_df <- sdf_copy_to(con, cars, overwrite = TRUE) spark_cars_df
If these basics are working we should be well set-up to start working with OmopOnSpark. Here we would spefify our cdm schema as we've seen above. And now let's say we have another schema called "my_results_schema" where we want to save any study-specific tables. We'll use this when specifying the write schema. In addition, we can also give a write prefix and all the tables we create during the course of the working with this cdm reference will start with this prefix.
library(OmopOnSpark) cdm <- cdmFromSpark(con, cdmSchema = "my_catalog.my_omop_schema", writeSchema = "my_catalog.my_results_schema", writePrefix = "study_1_" ) cdm
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.