knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Build, deploy and access a(n overly) simple data product to get familiarized with the concepts and functions.
We are interested in a data product that simply provides distances in cars
dataset in metric unit. See ?cars
for additional detail about the dataset.
As this is a new project, we initialize the project using dpbuild::dp_init
. See
the getting started vignette for details of what the initialization does.
library(daapr) board_params_set_dried <- fn_dry(board_params_set_s3( bucket_name = "<BUCKET>", region = "<REIGION>" )) # Dry function call to setting credentials creds_set_dried <- fn_dry(creds_set_aws( key = Sys.getenv("AWS_KEY"), secret = Sys.getenv("AWS_SECRET") )) # Initialize dp repo dp_repo <- dp_init( project_path = "dp_cars", project_description = "Cars data product", branch_name = "us001", branch_description = "User story 1", readme_general_note = "Data product to explore cars stopping distance", board_params_set_dried = board_params_set_dried, creds_set_dried = creds_set_dried, github_repo_url = "<GIT PATH/dp_cars.git>" )
At this point your project has all the basic components to provide you with a sandbox where you can do your development. It is not necessary, but it may be instructional to clean and restart your R session before this next step. Then, activate and set up the sandbox for this project.
# Switch to project directory setwd(dp_repo) # only necessary if you re-started your R session if (!"daapr" %in% (.packages())) { library("daapr") } # Set up "promised" env variables for remote data repository Sys.setenv("AWS_KEY" = "<BUCKETS AWS KEY>") Sys.setenv("AWS_SECRET" = "<BUCKETS AWS SECRET>") # Set up env variables for remote code repository Sys.setenv("GITHUB_PAT" = "<YOUR GITHUB PAT>") # Retrieve configuration config <- dpconf_get(project_path = ".")
In this step you go from whatever content you have in the input_files
folder
to metadata representation of the read datasets. Here we only have one dataset:
cars.csv
# Upload data into input_files folder readr::write_csv(x = cars, file = "./input_files/cars.csv") # Map all input_files content and clean file labels in the map input_map <- dpinput_map(project_path = ".") input_map <- inputmap_clean(input_map = input_map) # Sync each read files to remote data repo synced_map <- dpinput_sync(conf = config, input_map = input_map, verbose = T) # For each sync'd dataset, record info that will help you retrieve as needed dpinput_write(project_path = ".", input_d = synced_map)
Here is where the main logic of the data product is implement and the data product is built.
# read in the input data from what is recorded by dpinput_write data_files_read <- dpinput_read() # build your output data output <- data_files_read$cars(config = config) %>% dplyr::mutate(dist_m = 0.3048 * dist) # Structure the input, output, metadata ... you wish to have in your data product data_object <- dp_structure( data_files_read = data_files_read, output = output, config = config ) # save and log the data product built dp_write(data_object = data_object, project_path = ".")
Why so many steps if the above chunk is the main logic? The pay off for all you
have done is this: you have built a portable recipe where your metadata,
package dependencies, data and logic are all code!
Now, by simply saving the above chunk as an R-script, let's say named dp_make.R
,
you can have it reproduced from all the configurations recorded without having
to provide input data. Everything now is code and can be tracked by git
.
So to make your project reproducible save the above chunk as dp_make.R
in the
main directory. Sourcing this file after closing should be all that is needed to
reproduce the data product.
At this point, you can commit and push your code. NOTE: for your push to work
Sys.getenv("GITHUB_PAT")
returns the corresponding "GITHUB_PAT"dp_commit(project_path = ".", commit_description = "First dp build") dp_push(project_path = ".")
Deploy the data product:
dp_deploy()
Typical access pattern starts with setting up the env vars, but for brevity here we can just use the existing config to connect to the board, get the data and list what else is on the board.
board_object <- dp_connect(board_params = config$board_params, creds = config$creds) dp <- dp_get(board_object = board_object, data_name = "dp-cars-us001") dp_list(board_object = board_object)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.