suppressPackageStartupMessages({
  library(kableExtra)
  library(rtrackr)
  library(networkD3)
})

Overview

rtrackr provides data logging for every record in a dataset throughout the processing chain. In most cases, when records are altered or one record is divided to multiple records, rtrackr will simply assign a new trackr id and log changes when a record is updated.

When data is summarised, on the other hand (multiple records become a single record), rtrackr needs to record the trackr_ids of all parent records. trackr_summarise() provides a convenient way to summarise data without losing information in the trackr_id column.

trackr_summarise() works by combining all parent ids into one row, separated by a ", ". The same operation would work for combining records manually outside of R.

Example workflow

We will use a simple workflow To demonstrate the use of trackr_summarise() in a data processing chain. Continuing from getting started, we will create a new dataset, and log a new processing timepoint with trackr_new().

trackr_dir <- '~/Documents/trackr_dir'
df <- data.frame(a = c('a', 'b', 'c'), b = c(1, 2, 3))
df <- trackr_new(df, trackr_dir = trackr_dir, suppress_success = TRUE)

Details

kable(df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Now, we will bind the dataset to itself, and make a change to one version.

df <- rbind(df, df %>% dplyr::mutate(b = b + 1))
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Merged dataframes', suppress_success = TRUE)

Details

kable(df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

trackr_summarise is a simple wrapper around dplyr::summarise and accepts the same arguments.

df <- df %>% 
  dplyr::group_by(a) %>% 
  trackr_summarise(n = dplyr::n())

Details

kable(df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Now, we can log a new timepoint with trackr_timepoint().

df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Summarised dataframes', suppress_success = TRUE)

Details

kable(df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

We will make and log one more change, to better visualize the effect of the summarise operation.

df <- df %>% dplyr::mutate(n = n + 100)
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Added 100', suppress_success = TRUE)

Details

kable(df) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

To visualize this operation on one record, we create a trackr_lineage and trackr_network. See getting started for more information.

target_id <- df$trackr_id[1]
trackr_lineage(target_id, trackr_dir)

lineage_fn <- paste0(trackr_dir, '/', target_id, '_lineage.json')

trackr_network(lineage_fn)

Clean up

clean_trackr_dir(trackr_dir)

Article by Hamish Gibbs r Sys.time(). To report a problem with this package, please create an issue on GitHub.



hamishgibbs/rtrackr documentation built on June 25, 2020, 8:16 p.m.