Managing Cohort Object

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options("tibble.print_min" = 5, "tibble.print_max" = 5)
library(magrittr)
library(cohortBuilder)

When working with already defined cohort, you may want to manipulate its configuration (i.e. filter value) without the need to create the cohort from scratch.

cohortBuilder offers various methods that perform common Cohort management operations.

To present the functionality we'll be working on the below librarian_cohort object:

librarian_source <- set_source(
  as.tblist(librarian)
)

librarian_cohort <- librarian_source %>% 
  cohort(
    step(
      filter(
        "discrete", id = "author", dataset = "books", 
        variable = "author", value = "Dan Brown"
      ),
      filter(
        "discrete", id = "program", dataset = "borrowers", 
        variable = "program", value = "premium", keep_na = FALSE
      )
    ),
    step(
      filter(
        "range", id = "copies", dataset = "books", 
        variable = "copies", range = c(-Inf, 5)
      )
    ),
    run_flow = TRUE
  )

Managing filters

In order to manage filters configuration you may call the following methods:

Updating filter:

librarian_cohort %>% 
  update_filter(
    step_id = 1, filter_id = "author", value = c("Dan Brown", "Khaled Hosseini")
  )

sum_up(librarian_cohort)

Adding new filter:

librarian_cohort %>% 
  add_filter(
    filter(
      "date_range", id = "issue_date", dataset = "issues", 
      variable = "date", range = c(as.Date("2010-01-01"), Inf)
    ),
    step_id = 2
  )

sum_up(librarian_cohort)

Removing filter:

librarian_cohort %>% 
  rm_filter(step_id = 2, filter_id = "copies")

sum_up(librarian_cohort)

By default the above configuration doesn't trigger data recalculation so we need to call run method.

Calling run we trigger all steps computations. In our case we've updated only the second step so we can optimize workflow skipping the previous steps calculation by specifying min_step_id parameter:

run(librarian_cohort, min_step_id = 2)

get_data(librarian_cohort)

Note. If you want to run data computation directly after calling one of the above methods just set run_flow = TRUE within the method.

Managing steps

Similar to filter, you can operate on the Cohort to manage steps. cohortBuilder offers add_step and rm_step methods to add new, or remove existing step respectively.

librarian_cohort %>% 
  rm_step(step_id = 1)

sum_up(librarian_cohort)

Note. Removing not the last step results with renaming all step ids (so that we always have steps numbering starting with 1).

librarian_cohort %>% 
  add_step(
    step(
      filter(
        "discrete", id = "author", dataset = "books", 
        variable = "author", value = "Dan Brown"
      ),
      filter(
        "discrete", id = "program", dataset = "borrowers", 
        variable = "program", value = "premium", keep_na = FALSE
      )
    )
  )

sum_up(librarian_cohort)

Note. All the methods used for managing steps and filters can be also called on Source object itself. See vignette("cohort-configuration").

Managing source

The last Cohort configuration component - source, can be also managed within the Cohort itself. With update_source method you can change the source defined in the existing Cohort.

Below we update cohort with Source having source_code parameter defined. The argument is responsible to generate source object definition printed in the reproducible code (you can use it when the default method doesn't print reasonable output).

code(librarian_cohort, include_methods = NULL)

new_source <- set_source(
  as.tblist(librarian),
  source_code = quote({
    source <- list()
    source$dtconn <- as.tblist(librarian)
  })
)

update_source(librarian_cohort, new_source)
sum_up(librarian_cohort)
code(librarian_cohort, include_methods = NULL)

Note that updating source doesn't remove Cohort configuration (steps and filters). If you want to clear the configuration just set keep_steps = FALSE:

update_source(librarian_cohort, new_source, keep_steps = FALSE)
sum_up(librarian_cohort)

You can also use update_source to add Source to an empty Cohort:

new_source <- set_source(
  as.tblist(librarian)
)
empty_cohort <- cohort()
update_source(empty_cohort, new_source)
code(empty_cohort, include_methods = NULL)

The update_source method can be also useful if you want to update source along with steps and filters configuration.

In this case, the good practice is to keep the configuration directly in Source:

source_one <- set_source(
  as.tblist(librarian)
) %>% 
  add_step(
    step(
      filter(
        "discrete", id = "author", dataset = "books", 
        variable = "author", value = "Dan Brown"
      ),
      filter(
        "discrete", id = "program", dataset = "borrowers", 
        variable = "program", value = "premium", keep_na = FALSE
      )
    )
  )

source_two <- set_source(
  as.tblist(librarian)
) %>% 
  add_step(
    step(
      filter(
        "range", id = "copies", dataset = "books", 
        variable = "copies", range = c(-Inf, 5)
      )
    )
  )

my_cohort <- cohort(source_one)
sum_up(my_cohort)

update_source(my_cohort, source_two)
sum_up(my_cohort)


Try the cohortBuilder package in your browser

Any scripts or data that you put into this service are public.

cohortBuilder documentation built on Sept. 25, 2024, 5:06 p.m.