The API design of mbtools makes it well suited for workflow managers. This is particularly useful when you have longer analysis chains. For instance, let's look at an example workflow. We start with the config:

library(mbtools)
ref <- system.file("extdata/genomes/zymo_mock.fna.gz", package = "mbtools")
reads <- system.file("extdata/nanopore", package = "mbtools")

conf_p <- config_preprocess(
    trimLeft = 10,
    maxEE = Inf,
    maxN = Inf,
    truncQ = 0,
    out_dir = "preprocessed"
)
conf_a <- config_align(
    reference = ref,
    alignment_dir = "alignments",
    threads = 4,
    use_existing = FALSE
)
conf_c <- config_count(
    reference = ref,
    weights = TRUE,
    maxit = 1000,
    threads = 4
)

Where our actual workflow looks like this:

counting <- find_read_files(reads) %>%
            quality_control(min_score = 20) %>%
            preprocess(conf_p) %>%
            align_long_reads(conf_a) %>%
            count_references(conf_c)

This looks pretty clear but it will run the entire workflow everytime we execute the code. This is particularly costly if we only want to change some parameters of the last step. For instance to check what the effect of weighting on counting is.

In this case we don't need to run preprocessing and aligning again. We could implement this with several if-statements and by saving the immediate outputs but this makes the code way more convoluted. Also this would not tell us if the input files changed.

Workflow management with drake

One popular workflow manager for R is drake which will automatically track what needs to be recomputed and what not. This requires only minimal changes to mbtools workflows, mostly wrapping all input and output location with file_in and file_out.

library(drake)

reads <- file_in("../inst/extdata/nanopore")

plan <- drake_plan(
    files = find_read_files(reads),
    qa = quality_control(files, min_score = 20),
    qual_plot = ggsave(file_out("quals.png"), qa$quality_plot),
    processed = preprocess(files, conf_p),
    aligned = align_long_reads(processed, conf_a),
    counting = count_references(aligned, conf_c)
)

We can take a look if our plan makes sense:

clean()
dconf <- drake_config(plan)
vis_drake_graph(dconf)

As we see drake knows we have to run everything. Let's execute the plan and see what happens. Note, the lock_envir argument which is currently necessary.

make(plan, lock_envir = FALSE)

We can access our final output.

readd(counting)$counts %>% head()

Now if we check our plan...

vis_drake_graph(dconf)

We see that everything is up-to-date. So running the plan again would do nothing:

make(plan)

Now lets change the weighting for counting.

conf_c$weights = FALSE

drake detects that we only have to rerun the counting.

outdated(dconf)

Which we can do with running counting again.

make(plan, lock_envir = FALSE)


Gibbons-Lab/mbtools documentation built on Jan. 28, 2024, 11:08 a.m.