knitr::opts_chunk$set(
    warning = FALSE,
    collapse = TRUE,
    comment = "#>"
)

While Databricks supports R users through interactive notebooks and a hosted instance of RStudio Server, it can be cumbersome to convert R files into production jobs. bricksteR makes it easy to quickly turn .R and .Rmd files into automated jobs that run on Databricks by using the Databricks REST API.

Here are some highlights of what you can do with bricksteR:

Table of Contents

Installation

The first thing you'll need to get started is to install the package from GitHub.

devtools::install_github("RafiKurlansik/bricksteR")
library(bricksteR)

Authentication

You will need an Access Token in order to authenticate with the Databricks REST API.

If you are working locally or using DB Connect, a good way to authenticate is to use a .netrc file. By default, bricksteR will look for the .netrc file before checking for a token in the function call. If you don't already have one, create a .netrc file in your home directory that looks like this:

machine <databricks-instance>
login token
password <personal-access-token-value>

You can also authenticate by passing a token to Bearer authentication. To be clear, it is not recommended to store your credentials directly in your code. If you are working on Databricks in a notebook, you can use dbutils.secrets.get() to avoid printing your token.

Finally, you will need to identify the workspace URL of your Databricks instance. On AWS this typically has the form https://dbc-a64b1209f-d811.cloud.databricks.com, or if you have a vanity URL it may look more like https://mycompany.cloud.databricks.com. On Azure it will have the form https://eastus2.azuredatabricks.net, or whichever region your instance is located in.

workspace <- "https://field-eng.cloud.databricks.com"

# If running in a Databricks Notebook
# token <- dbutils.secrets.get(scope = "bricksteR", key = "rest_api")

Importing R Scripts to Databricks

A common use case is moving R scripts or R Markdown files to Databricks to run as a notebook. We can automate this process using import_to_workspace().

import_to_workspace(file = "/Users/rafi.kurlansik/RProjects/AnRMarkdownDoc.Rmd",
                    notebook_path = "/Users/rafi.kurlansik@databricks.com/AnRMarkdownDoc",
                    workspace = workspace,
                    overwrite = 'true')

Jobs

Create and Run a Job

Now that the file is in our workspace, we can create a job that will run it as a Notebook Task.

new_job <- create_job(notebook_path = "/Users/rafi.kurlansik@databricks.com/AnRMarkDownDoc", 
                       name = "Beer Sales Forecasting Job",
                       job_config = "default",
                       workspace = workspace)

job_id <- new_job$job_id

Each job requires a job_config, which is a JSON file or JSON formatted string specifying (at a minimum) the task and type of infrastructure required. If no config is supplied, a 'default' cluster is created with 1 driver and 2 workers. Here's what that default configuration looks like:

# Default JSON configs
 job_config <- paste0('{
  "name": ', name,'
  "new_cluster": {
    "spark_version": "5.3.x-scala2.11",
    "node_type_id": "r3.xlarge",
    "aws_attributes": {
      "availability": "ON_DEMAND"
    },
    "num_workers": 2
  },
  "email_notifications": {
    "on_start": [],
    "on_success": [],
    "on_failure": []
  },
  "notebook_task": {
    "notebook_path": ', notebook_path, '
  }
}')

Where name is the name of your job and notebook_path is the path to the notebook in your workspace that will be run.

Creating a job by itself will not actually execute any code. If we want our beer forecast - and we do - then we'll need to actually run the job. We can run the job by id, or by name.

# Specifying the job_id
run_job(job_id = job_id, workspace = workspace, token = NULL)

# Specifying the job name
run_job(name = "Beer Sales Forecasting Job", workspace = workspace, token = NULL)

Running a job by name is convenient, but there's nothing stopping you from giving two jobs the same name on Databricks. The way bricksteR handles those conflicts today is by forcing you to use the unique Job ID or rename the job you want to run with a distinct moniker.

If you want to run a job immediately, one option is to use the create_and_run_job() function:

# By specifying a local file we implicitly import it
running_new_job <- create_and_run_job(file = "/Users/rafi.kurlansik/agg_and_widen.R",
                                      name = "Lager Sales Forecasting Job",
                                      notebook_path = "/Users/rafi.kurlansik@databricks.com/agg_and_widen", 
                                      job_config = "default",
                                      workspace = workspace,
                                      overwrite = 'true')

View Jobs and Runs

There are two handy functions that will return a list of Jobs and their runs - jobs_list() and runs_list(). One of the objects returned is a concise R dataframe that allows for quick inspection.

jobs <- jobs_list(workspace = workspace)
beer_runs <- runs_list(name = "Lager Sales Forecasting Job", workspace = workspace)

head(jobs$response_tidy)
head(beer_runs$response_tidy)

Reset a Job

If you need to update the config for a job, use reset_job().

# JSON string or file path can be passed
new_config <- "/Users/rafi.kurlansik/Desktop/nightly_model_training.json"

reset_job(job_id = job_id, new_config = new_config, workspace = workspace)

Delete a Job

If you need to clean up a job from the workspace, use delete_job().

delete_job(job_id = job_id, workspace = workspace)

Export from Databricks

To export notebooks from Databricks, use export_from_workspace().

export_from_workspace(workspace_path = "/Users/rafi.kurlansik@databricks.com/AnRMarkdownDoc",
                      format = "SOURCE",
                      filename = "/Users/rafi.kurlansik/RProjects/anotherdoc.Rmd",
                      workspace = workspace)

Passing a filename will save the exported data to your local file system. Otherwise it will simply be returned as part of the API response.

Future Plans

That's about all there is to it to get started with automating R on Databricks! Whether you are using blastula to email a report or are spinning up a massive cluster to perform ETL or model training, these APIs will make it easy for you to define, schedule, and monitor those jobs from R.

In 2020 I'd like to add support for DBFS operations, and provide additional tools for managing libraries.


Questions or feedback? Feel free to contact me at rafi.kurlansik@databricks.com.



RafiKurlansik/bricksteR documentation built on Oct. 13, 2022, 6:58 a.m.