knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
For a new project, start by initializing the project using dpbuild::dp_init
, which does the following:
branch_name
renv
to capture package dependenciesdaap_config.yaml
First, create a new repository with your project name on github and provide the
repo url to dp_init
. An example would be as follows:
library(daapr) board_params_set_dried <- fn_dry(board_params_set_s3( bucket_name = "daap_bucket", region = "us-west-1" )) # Dry function call to setting credentials creds_set_dried <- fn_dry(creds_set_aws( key = Sys.getenv("AWS_KEY"), secret = Sys.getenv("AWS_SECRET") )) # Initialize dp repo dp_repo <- dp_init( project_path = "dp_test1", project_description = "Test data product", branch_name = "us001", branch_description = "User story 1", readme_general_note = "This data object is generated for testing purposes", board_params_set_dried = board_params_set_dried, creds_set_dried = creds_set_dried, github_repo_url = "<GIT PATH/dp_test1.git>" )
NOTE: dp_init
builds the yaml config file, daap_config.yaml
, with all the
configurations specified. Configuration includes key:value pairs as well as
instructions for function calls. In the above example, two instructions for two
function calls are provided. These function call instructions can be thought of
as "dried" functions which could be "hydrated" later when executed:
board_params_set_s3(bucket_name = "daap_bucket", region = "us-west-1")
creds_set_aws(key = Sys.getenv("AWS_KEY"), secret = Sys.getenv("AWS_SECRET") )
Note that the second function call relies on "AWS_KEY" and "AWS_SECRET" to be
available in the environment when the function is being hydrated. Do not pass keys
or secrets directly to creds_set*
. Instead, use environment variables as above or a
password manager package such as keyring.
After initializing the project, set your working directory to the project directory:
setwd(dp_repo)
You can double-check that everything is set up correctly with is_valid_dp_repository()
Note: to make sure everything is set up correctly, open the dp_repo Project in order to restart your R session and load your renv library. You can do this via File > Open Project and select the relevant .Rproj file.
This step is optional, but highly recommended. The starter code includes:
dp_journal.RMD
: A dev journal which will help both guide one through and
document the steps in building the data productdp_make.R
: The main workflow management script. Sourcing this script will build
the data productdpbuild::dpcode_add(project_path = dp_repo)
After adding code, the steps in dp_journal.RMD
will walk you through how to
add and sync input data, build the data product, and deploy it to a remote location.
Goal: This involves following the steps in the dev journal up until
source("dp_make.R")
step. The goal of this step is to sync the right subset
(or all) of the input data into remote and capture the relevant metadata.
Below is an example adding and syncing data with the cars dataset, but you can
upload any data file(s) of interest into the input_files
folder as long as your
data is in a tabular format.
# Upload data into input_files folder readr::write_csv(x = cars, file = "./input_files/cars.csv") # Map all input_files content and clean file labels in the map input_map <- dpinput_map(project_path = ".") input_map <- inputmap_clean(input_map = input_map) # Sync each input file to remote data repo config <- dpconf_get(project_path = ".") synced_map <- dpinput_sync(conf = config, input_map = input_map, verbose = T) # For each sync'd dataset, record info that will help you retrieve as needed dpinput_write(project_path = ".", input_d = synced_map)
This is where the main logic of building a data product per user story is implemented
as functions defined within the /R
sub-directory of the project, as well as
integration of these functions within dp_make.R
workflow.
Here is where the main logic of the data product is implement and the data
product is built. As an example, let's make a new function derive_dist()
where
we want to derive output distance in meters. We can make a new file derive_dist.R
in the R project directory.
derive_dist <- function(data_files_read, config) { output <- data_files_read$cars(config = config) %>% dplyr::mutate(dist_m = 0.3048 * dist) return(output) }
Then, we can modify dp_make.R
to include our derive function so that it gets
built into the data product:
# Derive distance dist_m = derive_dist(data_files_read = data_files_read, config = config)
And we also need to make sure our derived data gets added to dp_structure
within
dp_make.R
:
# Structure data obj data_object = dp_structure( data_files_read = data_files_read, config = config, output = list(dist_m = dist_m), metadata = list() )
The output can contain many datasets, structured as desired in the form of a named list.
dp_make.R
Once satisfied with changes needed to derived features, execute the workflow plan
(this is included dp_journal.RMD
)
source("dp_make.R")
You can check your built data product by inspecting the rds object in the output_files
folder before continuing with the next steps.
If data testing has been implemented, the test results can be evaluated here and modifications to the code be made as needed.
Once the data product meets the expectations, you can commit and push your code,
providing a commit message to dp_commit
.
NOTE: for your push to work:
Sys.getenv("GITHUB_PAT")
returns the corresponding "GITHUB_PAT"dpbuild::dp_commit(project_path = ".", commit_description = "First dp build: only input data") dpbuild::dp_push(project_path = ".")
This will complete one development cycle, making the data product and code ready for deployment. NOTE: committing and and pushing can be decoupled, so just as in a standard git workflow, you could add several different commits before pushing.
Now your data product is ready to be deployed to the remote location with one call to dp_deploy
:
dpdeploy::dp_deploy()
Typical access pattern starts with setting up the environment vars, but for brevity here we can just use the existing config to connect to the board, get the data and list what else is on the board.
board_object <- dp_connect(board_params = config$board_params, creds = config$creds) dp <- dp_get(board_object = board_object, data_name = "dp-test1-us001") dp_list(board_object = board_object)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.