In RafiKurlansik/bricksteR: R Package for Making Databricks Easy to Use.

workspace <- "https://field-eng.cloud.databricks.com"

Degrees of Environment Control

"It's the wild west out here!"

To tame the rich open source ecosystem we must ask ourselves - how should we best organize R packages on Databricks? Any organizational strategy needs to account for the following scenarios at a minimum:

A "sandbox" environment for development
Which packages are system (or shared) packages vs. user packages in the sandbox
Package versioning for production jobs

The openness of the sandbox environment will vary across organizations, but in every case where more than one user will be on the system you'll need to think about sharing - which packages are users going to share across the system, and which will be isolated to each user? If you aren't careful here, one user can easily install a package that breaks another users' code. A recommended practice is to agree upon core, shared packages and then allow each user to install their own packages in a personal folder.

Lastly, prior to promoting code to run as part of a production job, you'll need to ensure that package updates don't break anything. In this case both code and dependencies need to be tightly version controlled. This vignette only focuses on setting up the development environment, so we'll save that for another time.

With that said, here's how the bricksteR package can help manage packages in the development environment.

R Packages and Libraries

Let's start with an analogy. The books at your local library are shared by the larger community - anyone in town can walk into the library and access its books. At home you likely have a personal library, which may include books that are not available to everyone in town.

The R packages you know and love are similar to books - they live in a library! This may be a shared, common library, or your own personal library. Where are these libraries located? R libraries live in directories on the filesystem, and when you use library() R will look for packages there. Furthermore, when you use install.packages() R will install the package in your library.

There is a special function to show the libraries that R is currently aware of - .libPaths(). From the official documentation:

.libPaths gets/sets the library trees within which packages are looked for.

This means that by modifying .libPaths() we can effectively tell R when to add a package to our common, shared library or to our own personal library.

Setting the Library

bricksteR contains the set_library() function to update the path that R looks to install and load packages from. It takes a single argument, path, that points to the library location.

# Install and load bricksteR
devtools::install_github("RafiKurlansik/bricksteR")
library(bricksteR)

# Set the path to a personal folder
set_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")

# See it added to the paths that R searched for packages
.libPaths()

For admins, you can use this function to install into a shared, system-wide library.

For individual users, this will let you curate your personal library and avoid collisions with other packages on the system. If someone installs a version of a package that conflicts with your work - no problem! If using a personal library, R will have no idea about what that other user is doing. Remember, problems generally only crop up when we install (and overwrite) packages into the shared folder.

Curate a Library

On Databricks, we recommend installing packages into a personal folder located on DBFS. This is because DBFS points to cloud storage and will persist packages through a cluster restart, from cluster to cluster, and will generally be faster than installing many packages every time a cluster comes online. To facilitate a smooth installation onto cloud storage bricksteR offers the curate() function!

curate() serves two key purposes:

Provide a single, simple interface to the common functions used to install packages in R. Today it supports install.packages, install_github, install_gitlab, and install_version.
Ensure that packages are safely installed and copied to DBFS to persist.

Here a few quick examples of how it is used.

# Don't forget to set_library()!
set_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")

# Install latest version of broom into personal library
curate("broom")

# Install older version of ggplot2
curate("ggplot2", version = "3.0.0", dependencies = F)

# Install a package from GitHub
curate(pkg = "RafiKurlansik/bricksteR", git_provider = "github", force = T)

# Check results
dir("/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")

You can learn more by checking out the documentation (?curate in RStudio), or inspecting the source code on GitHub.

Removing a Library

Let's say you set the wrong library and want to tell R to ignore it. You can use remove_library() to do so.

remove_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")

Installing Packages with the Databricks REST API

Note: This section is best used by smaller teams or single user clusters. If you are managing a large shared environment, stick to the first method described above.

This section explores the use of the Databricks REST API to programmatically manage R packages and their versions. For additional discussion please see the Package Management section of the R User Guide to Databricks.

Installing Packages on Interactive Clusters

In the simplest case one can always use install.packages() in their code but there are two tradeoffs - the package will not be installed across the cluster (only where the code was run), and it will not be added to the configuration going forward. When the cluster is restarted you'll have to run install.packages() again.

Adding a Single Package

When you use libraries_install() from bricksteR, the package is installed across the cluster and added to its configuration going forward.

library(bricksteR)

# Cluster to install on
cluster_id <- "0208-155424-fish21"
lib_response <- libraries_install(cluster_id, package = "broom", workspace = workspace)

libraries_install will use RStudio Public Package Manager (RPPM) as the default repo. This is because RStudio has graciously decided to share linux binaries with the community, and Databricks Runtime uses Ubuntu 16.04.

Multiple Packages and Their Versions

If you want to install a list of packages at one time, you can do so with a data.frame containing the package names and versions. In this example we use a snapshot from RPPM to choose the version of the second two packages from those dates.

pkg_df <- data.frame(package = c("torch", "merTools", "plyr"),
                     repo = c("https://packagemanager.rstudio.com/all/__linux__/xenial/latest",
                              "https://packagemanager.rstudio.com/all/__linux__/xenial/318",
                              "https://packagemanager.rstudio.com/all/__linux__/xenial/236"))

# Apply libraries_install() to each row of our pkg_df
results <- apply(pkg_df, 1, function(x) {
  libraries_install(package = x['package'],
                    repo = x['repo'],
                    cluster_id,
                    workspace)
  })

As you can see, we can check on the status of their installation by using get_library_statuses().

lib_statuses <- get_library_statuses(cluster_id, workspace)
lib_statuses$response_df

Custom Packages

If you aren't pulling your R package from a public repo, you'll need to upload it to DBFS first. You can do this with bricksteR by using the DBFS wrappers.

# Copy local source file to DBFS
dbfs_mv(from = "/Users/rafi.kurlansik/mypkg.tar.gz", to = "/dbfs/rafi.kurlansik/pkgs/)

# Install on cluster
install_response <- libraries_install(cluster_id,
                                      package = "/dbfs/rafi.kurlansik/pkgs/mypkg.tar.gz", 
                                      repo = NULL, 
                                      workspace = workspace)

Uninstalling Packages

libraries_uninstall() will remove the package from cluster configuration, pending a restart.

uninstall_response <- libraries_uninstall(cluster_id, package = "merTools", workspace)

Conclusion

By using libraries_install() and libraries_uninstall() for interactive clusters and (coming soon) configure_job() for jobs, you can easily tune the dependencies for your clusters and build the environments you want - be they stable or chaotic.

If you have any questions, feel free to reach out to me (rafi.kurlansik@databricks.com) or file a GitHub issue.

RafiKurlansik/bricksteR documentation built on Oct. 13, 2022, 6:58 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RafiKurlansik/bricksteR
R Package for Making Databricks Easy to Use.

In RafiKurlansik/bricksteR: R Package for Making Databricks Easy to Use.

Degrees of Environment Control

R Packages and Libraries

Setting the Library

Curate a Library

Removing a Library

Installing Packages with the Databricks REST API

Installing Packages on Interactive Clusters

Adding a Single Package

Multiple Packages and Their Versions

Custom Packages

Uninstalling Packages

Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

RafiKurlansik/bricksteR R Package for Making Databricks Easy to Use.

In RafiKurlansik/bricksteR: R Package for Making Databricks Easy to Use.

Degrees of Environment Control

R Packages and Libraries

Setting the Library

Curate a Library

Removing a Library

Installing Packages with the Databricks REST API

Installing Packages on Interactive Clusters

Adding a Single Package

Multiple Packages and Their Versions

Custom Packages

Uninstalling Packages

Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

RafiKurlansik/bricksteR
R Package for Making Databricks Easy to Use.