workspace <- "https://field-eng.cloud.databricks.com"
"It's the wild west out here!"
To tame the rich open source ecosystem we must ask ourselves - how should we best organize R packages on Databricks? Any organizational strategy needs to account for the following scenarios at a minimum:
The openness of the sandbox environment will vary across organizations, but in every case where more than one user will be on the system you'll need to think about sharing - which packages are users going to share across the system, and which will be isolated to each user? If you aren't careful here, one user can easily install a package that breaks another users' code. A recommended practice is to agree upon core, shared packages and then allow each user to install their own packages in a personal folder.
Lastly, prior to promoting code to run as part of a production job, you'll need to ensure that package updates don't break anything. In this case both code and dependencies need to be tightly version controlled. This vignette only focuses on setting up the development environment, so we'll save that for another time.
With that said, here's how the bricksteR
package can help manage packages in the development environment.
Let's start with an analogy. The books at your local library are shared by the larger community - anyone in town can walk into the library and access its books. At home you likely have a personal library, which may include books that are not available to everyone in town.
The R packages you know and love are similar to books - they live in a library! This may be a shared, common library, or your own personal library. Where are these libraries located? R libraries live in directories on the filesystem, and when you use library()
R will look for packages there. Furthermore, when you use install.packages()
R will install the package in your library.
There is a special function to show the libraries that R is currently aware of - .libPaths()
. From the official documentation:
.libPaths gets/sets the library trees within which packages are looked for.
This means that by modifying .libPaths()
we can effectively tell R when to add a package to our common, shared library or to our own personal library.
bricksteR
contains the set_library()
function to update the path that R looks to install and load packages from. It takes a single argument, path
, that points to the library location.
# Install and load bricksteR devtools::install_github("RafiKurlansik/bricksteR") library(bricksteR) # Set the path to a personal folder set_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages") # See it added to the paths that R searched for packages .libPaths()
For admins, you can use this function to install into a shared, system-wide library.
For individual users, this will let you curate your personal library and avoid collisions with other packages on the system. If someone installs a version of a package that conflicts with your work - no problem! If using a personal library, R will have no idea about what that other user is doing. Remember, problems generally only crop up when we install (and overwrite) packages into the shared folder.
On Databricks, we recommend installing packages into a personal folder located on DBFS. This is because DBFS points to cloud storage and will persist packages through a cluster restart, from cluster to cluster, and will generally be faster than installing many packages every time a cluster comes online. To facilitate a smooth installation onto cloud storage bricksteR
offers the curate()
function!
curate()
serves two key purposes:
Provide a single, simple interface to the common functions used to install packages in R. Today it supports install.packages
, install_github
, install_gitlab
, and install_version
.
Ensure that packages are safely installed and copied to DBFS to persist.
Here a few quick examples of how it is used.
# Don't forget to set_library()! set_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages") # Install latest version of broom into personal library curate("broom") # Install older version of ggplot2 curate("ggplot2", version = "3.0.0", dependencies = F) # Install a package from GitHub curate(pkg = "RafiKurlansik/bricksteR", git_provider = "github", force = T) # Check results dir("/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")
You can learn more by checking out the documentation (?curate
in RStudio), or inspecting the source code on GitHub.
Let's say you set the wrong library and want to tell R to ignore it. You can use remove_library()
to do so.
remove_library(path = "/dbfs/home/rafi.kurlansik@databricks.com/my_r_packages")
Note: This section is best used by smaller teams or single user clusters. If you are managing a large shared environment, stick to the first method described above.
This section explores the use of the Databricks REST API to programmatically manage R packages and their versions. For additional discussion please see the Package Management section of the R User Guide to Databricks.
In the simplest case one can always use install.packages()
in their code but there are two tradeoffs - the package will not be installed across the cluster (only where the code was run), and it will not be added to the configuration going forward. When the cluster is restarted you'll have to run install.packages()
again.
When you use libraries_install()
from bricksteR
, the package is installed across the cluster and added to its configuration going forward.
library(bricksteR) # Cluster to install on cluster_id <- "0208-155424-fish21" lib_response <- libraries_install(cluster_id, package = "broom", workspace = workspace)
libraries_install
will use RStudio Public Package Manager (RPPM) as the default repo. This is because RStudio has graciously decided to share linux binaries with the community, and Databricks Runtime uses Ubuntu 16.04.
If you want to install a list of packages at one time, you can do so with a data.frame
containing the package names and versions. In this example we use a snapshot from RPPM to choose the version of the second two packages from those dates.
pkg_df <- data.frame(package = c("torch", "merTools", "plyr"), repo = c("https://packagemanager.rstudio.com/all/__linux__/xenial/latest", "https://packagemanager.rstudio.com/all/__linux__/xenial/318", "https://packagemanager.rstudio.com/all/__linux__/xenial/236")) # Apply libraries_install() to each row of our pkg_df results <- apply(pkg_df, 1, function(x) { libraries_install(package = x['package'], repo = x['repo'], cluster_id, workspace) })
As you can see, we can check on the status of their installation by using get_library_statuses()
.
lib_statuses <- get_library_statuses(cluster_id, workspace) lib_statuses$response_df
If you aren't pulling your R package from a public repo, you'll need to upload it to DBFS first. You can do this with bricksteR
by using the DBFS wrappers.
# Copy local source file to DBFS dbfs_mv(from = "/Users/rafi.kurlansik/mypkg.tar.gz", to = "/dbfs/rafi.kurlansik/pkgs/) # Install on cluster install_response <- libraries_install(cluster_id, package = "/dbfs/rafi.kurlansik/pkgs/mypkg.tar.gz", repo = NULL, workspace = workspace)
libraries_uninstall()
will remove the package from cluster configuration, pending a restart.
uninstall_response <- libraries_uninstall(cluster_id, package = "merTools", workspace)
By using libraries_install()
and libraries_uninstall()
for interactive clusters and (coming soon) configure_job()
for jobs, you can easily tune the dependencies for your clusters and build the environments you want - be they stable or chaotic.
If you have any questions, feel free to reach out to me (rafi.kurlansik@databricks.com) or file a GitHub issue.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.