aqueduct: Run an aqueduct Workflow
In harveybarnhard/aqueduct: What the Package Does (One Line, Title Case)

Description Usage Arguments Details Value Examples

View source: R/aqueduct.R

This function runs the workflow provided in the arguments, ignoring components of the workflow (nodes) whose parent nodes have not changed

1	aqueduct(..., verbose = FALSE)

...

Formulas, separated by commas, that dictate the workflow of the project. Each formula is written as

outdir(out) ~ node(indir1(in1) + indir2(in2) + ..., options)

where the use of the components of the formula are described below.

verbose

FALSE by default. Returns all output of code while running.

Formula Arguments

An individual formula is made up of the following components

node The name of the .R file as it exists in the code folder as defined in aqueduct_setup(). Do not include the .R file extension.
indir1, indir2 The names of the directories for the first and second input file where the names are defined in aqueduct_setup()
in1, in2 The names of the first and second input files, without extensions, that are located in indir1 and indir2, respectively. Most of the time, the files will be in .csv format, but aqueduct() will also read in .xlsx and .dta file formats without a need to specify the extension. If the input files are not found in the the input directories, then the files are looked for in the sub directories of the input directories.
outdir The name of the directory for the output file.
out The name of the output file created by node. aqueduct() is smart in that it determines the file format of the output and saves it as the appropriate file format. For the most part, the file output will be a single .csv file. When the main function in node outputs a list of dataframe style objects, a .csv file is saved in outdir and the files are saved with the file name [name_in_list].csv where name_in_list is the name given for that object in the list. When this is the case, specifying outdir() is sufficient. If the file is not a dataframe object, matrix, or vector, then the object will be saved as a .RData object.

Formula Options

The options section of the formula contains additional arguments to pass on to the node file.

`workflow`	A dataframe of all the workflow nodes with the previous timestamp before running, and the timestamp after running
`plot`	A Plot displaying a directed acyclic graph (DAG) of the workflow

# Let's say we are working for a government agency that provides affordable
# housing to low-income individuals. We want to determine if there are
# subgroups of the population that disproportionately drop out of the
# program despite being eligible. The workflow is: 
#    1. Clean and merge characteristic files of individuals eligible 
#       for program using a file called load_chars.R
#    2. Merge characteristic files onto the main database file that contains
#       information on whether or not eligible individuals participated in
#       the program, and if so, how long they participated.
#    3. Run a SVM to classify groups that are disproportionately likely to
#       drop out of the program
#    4. Produce plots based on these results and the underlying
#       characteristics of the population
#    5. Produce a knitted document displaying these results
# And the filepath is as follows, starting from the basepath:
#    /finding_groups
#    ----/code
#    --------/clean
#    ------------/load_chars.R
#    ------------/clean_chars.R
#    --------/build
#    ------------/add_chars.R
#    --------/analyze
#    ------------/svm_classify.R
#    ------------/create_plots.R
#    ------------/produce_report.Rmd
#    ----/data
#    --------/raw
#    ------------/main_db.csv
#    ------------/chars
#    ----------------/location_file.csv
#    ----------------/race_file.csv
#    ----------------/age_file.csv
#    ----------------/education_file.csv
#    --------/derived
#    --------/current
#    ----/output
#    --------/plots

# First, set paths using aqueduct_setup()
aqueduct_setup(
  basepath = "C:/Users/Harvey/GitHub/aqueduct/examples/example1",
  raw      ~ basepath/data/raw,
  derived  ~ basepath/data/derived,
  current  ~ basepath/data/current,
  plots    ~ basepath/output/plots
)
# Then run the aqueduct workflow!
aqueduct(
  raw() ~ create_data(,seed=1996)
  derived(chars) ~ load_chars(raw(location_file) +
                              raw(race_file) +
                              raw(age_file) +
                              raw(education_file)),
  derived(clean_chars) ~ clean_chars(derived(chars)),
  derived(db_w_chars) ~ add_chars(raw(main_db) + derived(clean_chars)),
  current(classified_groups) ~ svm_classify(derived(db_w_chars)),
  plots(classify_plots) ~ create_plots(current(classified_groups) +
                                       derived(clean_chars)),
  output(final_report)  ~ produce_report() 
)