R/get_data_extracts.R

Defines functions get_data_extracts

Documented in get_data_extracts

#
#                _         _    _      _                _    
#               (_)       | |  | |    | |              | |   
#  _ __    ___   _  _ __  | |_ | |__  | |  __ _  _ __  | | __
# | '_ \  / _ \ | || '_ \ | __|| '_ \ | | / _` || '_ \ | |/ /
# | |_) || (_) || || | | || |_ | |_) || || (_| || | | ||   < 
# | .__/  \___/ |_||_| |_| \__||_.__/ |_| \__,_||_| |_||_|\_\
# | |                                                        
# |_|                                                        
# 
# This file is part of the 'rich-iannone/pointblank' package.
# 
# (c) Richard Iannone <riannone@me.com>
# 
# For full copyright and license information, please look at
# https://rich-iannone.github.io/pointblank/LICENSE.html
#


#' Collect data extracts from a validation step
#'
#' @description
#' In an agent-based workflow (i.e., initiating with [create_agent()]), after
#' interrogation with [interrogate()], we can extract the row data that didn't
#' pass row-based validation steps with the `get_data_extracts()` function.
#' There is one discrete extract per row-based validation step and the amount of
#' data available in a particular extract depends on both the fraction of test
#' units that didn't pass the validation step and the level of sampling or
#' explicit collection from that set of units. These extracts can be collected
#' programmatically through `get_data_extracts()` but they may also be
#' downloaded as CSV files from the HTML report generated by the agent's print
#' method or through the use of [get_agent_report()].
#'
#' The availability of data extracts for each row-based validation step depends
#' on whether `extract_failed` is set to `TRUE` within the [interrogate()] call
#' (it is by default). The amount of *fail* rows extracted depends on the
#' collection parameters in [interrogate()], and the default behavior is to
#' collect up to the first 5000 *fail* rows.
#'
#' Row-based validation steps are based on those validation functions of the
#' form `col_vals_*()` and also include [conjointly()] and [rows_distinct()].
#' Only functions from that combined set of validation functions can yield data
#' extracts.
#'
#' @param agent An agent object of class `ptblank_agent`. It should have had
#'   [interrogate()] called on it, such that the validation steps were carried
#'   out and any sample rows from non-passing validations could potentially be
#'   available in the object.
#' @param i The validation step number, which is assigned to each validation
#'   step by **pointblank** in the order of definition. If `NULL` (the default),
#'   all data extract tables will be provided in a list object.
#' 
#' @return A list of tables if `i` is not provided, or, a standalone table if
#'   `i` is given.
#' 
#' @section Examples:
#' 
#' Create a series of two validation steps focused on testing row values for
#' part of the `small_table` object. Use [interrogate()] right after that.
#' 
#' ```r
#' agent <-
#'   create_agent(
#'     tbl = small_table %>%
#'       dplyr::select(a:f),
#'     label = "`get_data_extracts()`"
#'   ) %>%
#'   col_vals_gt(vars(d), value = 1000) %>%
#'   col_vals_between(
#'     columns = vars(c),
#'     left = vars(a), right = vars(d),
#'     na_pass = TRUE
#'   ) %>%
#'   interrogate()
#' ```
#' 
#' Using `get_data_extracts()` with its defaults returns of a list of tables,
#' where each table is named after the validation step that has an extract
#' available.
#' 
#' ```r
#' agent %>% get_data_extracts()
#' ```
#' 
#' \preformatted{## $`1`
#' ## # A tibble: 6 × 6
#' ##       a b             c     d e     f    
#' ##   <int> <chr>     <dbl> <dbl> <lgl> <chr>
#' ## 1     8 3-ldm-038     7  284. TRUE  low  
#' ## 2     7 1-knw-093     3  843. TRUE  high 
#' ## 3     3 5-bce-642     9  838. FALSE high 
#' ## 4     3 5-bce-642     9  838. FALSE high 
#' ## 5     4 2-dmx-010     7  834. TRUE  low  
#' ## 6     2 7-dmx-010     8  108. FALSE low  
#' ## 
#' ## $`2`
#' ## # A tibble: 4 × 6
#' ##       a b             c     d e     f    
#' ##   <int> <chr>     <dbl> <dbl> <lgl> <chr>
#' ## 1     6 8-kdg-938     3 2343. TRUE  high 
#' ## 2     8 3-ldm-038     7  284. TRUE  low  
#' ## 3     7 1-knw-093     3  843. TRUE  high 
#' ## 4     4 5-boe-639     2 1036. FALSE low}
#' 
#' 
#' 
#' We can get an extract for a specific step by specifying it in the `i`
#' argument. Let's get the failing rows from the first validation step (the
#' [col_vals_gt()] one).
#' 
#' ```r
#' agent %>% get_data_extracts(i = 1)
#' ```
#' 
#' \preformatted{## # A tibble: 6 × 6
#' ##       a b             c     d e     f    
#' ##   <int> <chr>     <dbl> <dbl> <lgl> <chr>
#' ## 1     8 3-ldm-038     7  284. TRUE  low  
#' ## 2     7 1-knw-093     3  843. TRUE  high 
#' ## 3     3 5-bce-642     9  838. FALSE high 
#' ## 4     3 5-bce-642     9  838. FALSE high 
#' ## 5     4 2-dmx-010     7  834. TRUE  low  
#' ## 6     2 7-dmx-010     8  108. FALSE low}
#' 
#' 
#' 
#' @family Post-interrogation
#' @section Function ID:
#' 8-2
#' 
#' @export
get_data_extracts <- function(
    agent,
    i = NULL
) {

  # Stop function if the agent hasn't
  # yet performed an interrogation
  if (!inherits(agent, "has_intel")) {
    
    stop(
      "The `agent` has not yet performed an interrogation.",
      call. = FALSE
    )
  }
  
  # Get the number of validation steps
  validation_steps <- unique(agent$validation_set$i)
  
  if (is.null(i)) {
    return(agent$extracts)
  }
  
  # Stop function if the `i`th step does not exist in `agent`
  if (!(i %in% seq(validation_steps))) {
    
    stop("The provided step number does not exist.", call. = FALSE)
  }
  
  # Get the names of the extracts
  extract_names <- names(agent$extracts)
  
  # Stop function if the `i`th step does not have an extract available
  if (!(as.character(i) %in% extract_names)) {
    
    stop(
      "The provided step number does not have an associated extract.",
      call. = FALSE
    )
  }
  
  # Get the data extract
  agent$extracts[[as.character(i)]]
}

Try the pointblank package in your browser

Any scripts or data that you put into this service are public.

pointblank documentation built on April 25, 2023, 5:06 p.m.