Cluster Management
In brickster: R Toolkit for 'Databricks'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

{brickster} has 1:1 mappings with the clusters REST API, enabling full control of Databricks clusters from your R session.

Cluster Creation

Clusters have a number of parameters and can be configured to match to needs of a given workload. db_cluster_create() facilitates creation of a cluster in a Databricks workspace for all cloud platforms (AWS, Azure, GCP).

Depending on the cloud you will need to change the node types and cloud_attrs to be one of; aws_attributes(), azure_attributes(), or gcp_attributes().

Below we will create a cluster on AWS and then step through using the other supporting functions.

library(brickster)

# create a small cluster on AWS with DBR 9.1 LTS
new_cluster <- db_cluster_create(
  name = "brickster-cluster",
  spark_version = "9.1.x-scala2.12",
  num_workers = 2,
  node_type_id = "m5a.xlarge",
  cloud_attrs = aws_attributes(
    ebs_volume_count = 3,
    ebs_volume_size = 100
  )
)

temp <- get_and_start_cluster(cluster_id = new_cluster$cluster_id)

Refer to documentation for details on how to use other parameters not mentioned here (e.g. spark_conf).

Before creating a cluster you may want to check the supported values for a number of the parameters. There are functions to assist with this:

| Function | Purpose | |----------------:|-------------------------------------------------------| | db_cluster_runtime_versions() | List of runtime versions available for the workspace, useful for finding relevant spark_version | | db_cluster_list_node_types() | List of supported node types available in workspace/region, useful for finding relevant node_type_id/driver_node_type_id | | db_cluster_list_zones() | AWS Only, lists availability zones (AZ) clusters can occupy |

db_cluster_get() will provide details for the cluster we just created, including information such as the state.

This can be useful as you may wish to wait for the cluster to be RUNNING , which is exactly what get_and_start_cluster() uses internally to wait until the cluster is running before completing.

cluster_info <- db_cluster_get(cluster_id = new_cluster$cluster_id)
cluster_info$state

Editing Clusters

You can edit Databricks clusters to change various parameters using db_cluster_edit(). For example, we may decide we want our cluster to autoscale between 2-8 nodes and add some tags.

# we are required to input all parameters
db_cluster_edit(
  cluster_id = new_cluster$cluster_id,
  name = "brickster-cluster",
  spark_version = "9.1.x-scala2.12",
  node_type_id = "m5a.xlarge",
  autoscale = cluster_autoscale(min_workers = 2, max_workers = 8),
  cloud_attrs = aws_attributes(
    ebs_volume_count = 3,
    ebs_volume_size = 100
  ),
  custom_tags = list(
    purpose = "brickster_cluster_demo"
  )
)

However, if the intention was to only change the size of a given cluster the db_cluster_resize() function is a simpler alternative.

I can either adjust the number of workers or change the autoscale range. If the range or workers is adjusted via autoscale the number of workers active on the cluster will be increased/decreased if they are outside the bounds.

# adjust number autoscale range to be between 4-6 workers
db_cluster_resize(
  cluster_id = new_cluster$cluster_id,
  autoscale = cluster_autoscale(min_workers = 4, max_workers = 6)
)

It's important to note that if specifying num_workers instead of autoscale on a cluster than has an existing autoscale range it will become a fixed number of workers from that point onward.

Databricks clusters can be "pinned" which stops them from being removed after 30 days of termination. db_cluster_pin() and db_cluster_unpin() are the functions used for changing if a cluster is "pinned" or not.

# pin the cluster
db_cluster_pin(cluster_id = new_cluster$cluster_id)

# unpin the cluster
# db_cluster_unpin(cluster_id = new_cluster$cluster_id)

Cluster State

There are a few functions that can be used to to manage the state of an existing cluster

| Function | Purpose | |-----------------:|------------------------------------------------------| | db_cluster_start() | Start a cluster that is inactive | | db_cluster_restart() | Restart a cluster, cluster must already be running | | db_cluster_delete() /db_cluster_terminate() | Terminate an active cluster, does not remove the cluster configuration from Databricks | | db_cluster_perm_delete() | Stops (if active) and permanently deletes a cluster, it will not longer appear in Databricks |

Cluster Libraries

Databricks clusters can have libraries installed from a number of sources using db_libs_install() and the associated libs_*() functions:

| Function | Library Source | |--------------:|---------------------| | lib_cran() | CRAN | | lib_pypi() | PyPi | | lib_egg() | Python egg (file) | | lib_whl() | Python wheel (file) | | lib_maven() | Maven | | lib_jar() | JAR (file) |

# installing a package from CRAN on cluster
db_libs_install(
  cluster_id = new_cluster$cluster_id,
  libraries = libraries(
    lib_cran(package = "palmerpenguins"),
    lib_cran(package = "dplyr")
  )
)

For convenience the wait_for_lib_installs() function will block until all the libraries for the specified cluster have finished installing.

wait_for_lib_installs(cluster_id = new_cluster$cluster_id)

Installation of libraries is asynchronous and will complete in the background. db_libs_cluster_status() is used to check on the installation status of libraries for a given cluster, db_libs_all_cluster_statuses() is used for getting the status of all libraries across all clusters in the workspace.

db_libs_cluster_status(cluster_id = new_cluster$cluster_id)

Libraries can be uninstalled using db_libs_uninstall().

db_libs_uninstall(
  cluster_id = new_cluster$cluster_id,
  libraries = libraries(
    lib_cran(package = "palmerpenguins")
  )
)

Using db_libs_cluster_status() shows that the library will be uninstalled upon restart (e.g. db_cluster_restart()).

db_libs_cluster_status(cluster_id = new_cluster$cluster_id)

Events

A list of events regarding the clusters activity can be fetched via db_cluster_events(). There are many event types that can occur, and by default the 50 most recent events are returned.

events <- db_cluster_events(cluster_id = new_cluster$cluster_id)
head(events, 1)

db_cluster_unpin(cluster_id = new_cluster$cluster_id)
db_cluster_perm_delete(cluster_id = new_cluster$cluster_id)

Any scripts or data that you put into this service are public.

brickster documentation built on April 12, 2025, 1:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

brickster
R Toolkit for 'Databricks'

Cluster Management
In brickster: R Toolkit for 'Databricks'

Cluster Creation

Editing Clusters

Cluster State

Cluster Libraries

Events

Try the brickster package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

brickster R Toolkit for 'Databricks'

Cluster Management In brickster: R Toolkit for 'Databricks'

Cluster Creation

Editing Clusters

Cluster State

Cluster Libraries

Events

Try the brickster package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

brickster
R Toolkit for 'Databricks'

Cluster Management
In brickster: R Toolkit for 'Databricks'