knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
{brickster}
has 1:1 mappings with the clusters REST API, enabling full control of Databricks clusters from your R session.
Clusters have a number of parameters and can be configured to match to needs of a given workload. db_cluster_create()
facilitates creation of a cluster in a Databricks workspace for all cloud platforms (AWS, Azure, GCP).
Depending on the cloud you will need to change the node types and cloud_attrs
to be one of; aws_attributes()
, azure_attributes()
, or gcp_attributes()
.
Below we will create a cluster on AWS and then step through using the other supporting functions.
library(brickster) # create a small cluster on AWS with DBR 9.1 LTS new_cluster <- db_cluster_create( name = "brickster-cluster", spark_version = "9.1.x-scala2.12", num_workers = 2, node_type_id = "m5a.xlarge", cloud_attrs = aws_attributes( ebs_volume_count = 3, ebs_volume_size = 100 ) )
temp <- get_and_start_cluster(cluster_id = new_cluster$cluster_id)
Refer to documentation for details on how to use other parameters not mentioned here (e.g. spark_conf
).
Before creating a cluster you may want to check the supported values for a number of the parameters. There are functions to assist with this:
| Function | Purpose |
|----------------:|-------------------------------------------------------|
| db_cluster_runtime_versions()
| List of runtime versions available for the workspace, useful for finding relevant spark_version
|
| db_cluster_list_node_types()
| List of supported node types available in workspace/region, useful for finding relevant node_type_id
/driver_node_type_id
|
| db_cluster_list_zones()
| AWS Only, lists availability zones (AZ) clusters can occupy |
db_cluster_get()
will provide details for the cluster we just created, including information such as the state.
This can be useful as you may wish to wait for the cluster to be RUNNING
, which is exactly what get_and_start_cluster()
uses internally to wait until the cluster is running before completing.
cluster_info <- db_cluster_get(cluster_id = new_cluster$cluster_id) cluster_info$state
You can edit Databricks clusters to change various parameters using db_cluster_edit()
. For example, we may decide we want our cluster to autoscale between 2-8 nodes and add some tags.
# we are required to input all parameters db_cluster_edit( cluster_id = new_cluster$cluster_id, name = "brickster-cluster", spark_version = "9.1.x-scala2.12", node_type_id = "m5a.xlarge", autoscale = cluster_autoscale(min_workers = 2, max_workers = 8), cloud_attrs = aws_attributes( ebs_volume_count = 3, ebs_volume_size = 100 ), custom_tags = list( purpose = "brickster_cluster_demo" ) )
However, if the intention was to only change the size of a given cluster the db_cluster_resize()
function is a simpler alternative.
I can either adjust the number of workers or change the autoscale range. If the range or workers is adjusted via autoscale
the number of workers active on the cluster will be increased/decreased if they are outside the bounds.
# adjust number autoscale range to be between 4-6 workers db_cluster_resize( cluster_id = new_cluster$cluster_id, autoscale = cluster_autoscale(min_workers = 4, max_workers = 6) )
It's important to note that if specifying num_workers
instead of autoscale
on a cluster than has an existing autoscale range it will become a fixed number of workers from that point onward.
Databricks clusters can be "pinned" which stops them from being removed after 30 days of termination. db_cluster_pin()
and db_cluster_unpin()
are the functions used for changing if a cluster is "pinned" or not.
# pin the cluster db_cluster_pin(cluster_id = new_cluster$cluster_id) # unpin the cluster # db_cluster_unpin(cluster_id = new_cluster$cluster_id)
There are a few functions that can be used to to manage the state of an existing cluster
| Function | Purpose |
|-----------------:|------------------------------------------------------|
| db_cluster_start()
| Start a cluster that is inactive |
| db_cluster_restart()
| Restart a cluster, cluster must already be running |
| db_cluster_delete()
/db_cluster_terminate()
| Terminate an active cluster, does not remove the cluster configuration from Databricks |
| db_cluster_perm_delete()
| Stops (if active) and permanently deletes a cluster, it will not longer appear in Databricks |
Databricks clusters can have libraries installed from a number of sources using db_libs_install()
and the associated libs_*()
functions:
| Function | Library Source |
|--------------:|---------------------|
| lib_cran()
| CRAN |
| lib_pypi()
| PyPi |
| lib_egg()
| Python egg (file) |
| lib_whl()
| Python wheel (file) |
| lib_maven()
| Maven |
| lib_jar()
| JAR (file) |
# installing a package from CRAN on cluster db_libs_install( cluster_id = new_cluster$cluster_id, libraries = libraries( lib_cran(package = "palmerpenguins"), lib_cran(package = "dplyr") ) )
For convenience the wait_for_lib_installs()
function will block until all the libraries for the specified cluster have finished installing.
wait_for_lib_installs(cluster_id = new_cluster$cluster_id)
Installation of libraries is asynchronous and will complete in the background. db_libs_cluster_status()
is used to check on the installation status of libraries for a given cluster, db_libs_all_cluster_statuses()
is used for getting the status of all libraries across all clusters in the workspace.
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
Libraries can be uninstalled using db_libs_uninstall()
.
db_libs_uninstall( cluster_id = new_cluster$cluster_id, libraries = libraries( lib_cran(package = "palmerpenguins") ) )
Using db_libs_cluster_status()
shows that the library will be uninstalled upon restart (e.g. db_cluster_restart()
).
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
A list of events regarding the clusters activity can be fetched via db_cluster_events()
. There are many event types that can occur, and by default the 50 most recent events are returned.
events <- db_cluster_events(cluster_id = new_cluster$cluster_id) head(events, 1)
db_cluster_unpin(cluster_id = new_cluster$cluster_id) db_cluster_perm_delete(cluster_id = new_cluster$cluster_id)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.