R/datastep.R

Defines functions fill_missing fix_names perform_merge perform_set has_output assign_attributes copy_df_attributes copy_attributes_sp copy_attributes collect_attributes assign_attribute_list output delete datastep

Documented in datastep delete output

# Set up environment for shared variables
e <- new.env(parent = emptyenv())
e$output <- list()

# Datastep Definition -----------------------------------------------------


#' @title Step through data row-by-row
#' @description The \code{datastep} function allows you to perform
#' row-wise conditional processing on a data frame, data table, or tibble. 
#' The function
#' contains parameters to drop, keep, or rename variables, perform
#' by-group processing, and perform row-wise or column-wise calculations.  
#' @details 
#' Two parameters on the \code{datastep} function are required: 
#' \strong{data} and \strong{steps}.  The \strong{data} parameter is
#' the input data to the data step.  The \strong{steps} parameter contains
#' the code statements you want to apply to the data. The \strong{steps}
#' should be wrapped in curly braces.  When running, the data step
#' will loop through the input data row-by-row, and execute the steps for 
#' each row.  Variables inside the data step can be accessed using 
#' non-standard evaluation (meaning they  do not have to be quoted).
#' 
#' Note that the data step is pipe-friendly.  It can be used within 
#' a \strong{dplyr} pipeline.  The data step allows you to perform
#' deeply nested and complex conditionals within the pipeline.  The data
#' step is also very readable compared to other pipeline conditionals.
#' 
#' @section Automatic Variables:
#' The \code{datastep} function provides five automatic variables. These 
#' variables are generated for every data step, and can 
#' be accessed at any point within the data step: 
#' \itemize{
#'   \item{\strong{data}: Represents the entire input data frame.}
#'   \item{\strong{rw}: Represents the current row.}
#'   \item{\strong{n.}: Contains the row number.}
#'   \item{\strong{first.}: Indicates the beginning of a by-group.}
#'   \item{\strong{last.}: Indicates the end of a by-group.}
#' }
#' Automatic variables will be dropped from the data frame at the end
#' of the data step.  If you wish to keep the automatic variable values,
#' assign the automatic variable to a new variable and keep that variable.
#' 
#' If there are multiple by group variables, the \code{first.} and \code{last.} 
#' automatic variables indicates a either/or combination of all by variables.  
#' In addition,
#' \code{first.X} and \code{last.X} automatic variables will be created for 
#' each variable, where "X" represents the name of the specified variable.
#' As always, these names are case-sensitive.
#' 
#' @section Column Attributes:
#' To set attributes for a column on your data, use the \code{attrib}
#' parameter.  Example attributes include 'label', 'description', 
#' and 'format'.  These types of attributes are set using a named list and a 
#' \code{\link{dsattr}} object. The name of the list item
#' is the column name you want to set attributes on. 
#' The value of the list item is the \code{dsattr} object.
#' For a complete list of available attributes, 
#' see the \code{\link{dsattr}} documentation.
#' 
#' It should be mentioned  that the \code{dsattr} object is not required.  
#' You can also set attributes with a name and a default value.  
#' The default value can be any valid data value, such as a number or string.
#' 
#' The label and format attributes may also be set with the 'label' and 
#' 'format' parameters. These parameters accept a named list with the 
#' labels or formats, and will be assigned to the output data frame.  
#' 
#' @section Optional Parameters:
#' Optional parameters on the \code{datastep} allow you to shape 
#' the output dataset or enhance the operation of the \code{datastep}.  Some
#' parameters are classified as input parameters, and others as output 
#' parameters.  Input parameters modify the data before the data step
#' operations takes place.  Output parameters operate on the data
#' after the data step.
#' 
#' The \code{keep}, \code{drop}, and \code{rename} parameters
#' are output parameters.  These parameters will be applied after the
#' data step statements are executed.  Therefore, within the data step, 
#' refer to variables using the input variable name. New variables may 
#' be created on the fly, just by assigning a value to the new
#' variable name.
#' 
#' The \code{keep}, \code{drop}, and \code{rename} parameters require 
#' quoted variable names, as the variables may not yet exist at the 
#' time they are passed into the function.  Within a data step or 
#' calculate block, however, 
#' variable names do not need to be quoted. 
#' 
#' The \code{calculate} parameter is used to perform vectorized functions
#' on the data prior to executing the data step.  For example, you 
#' may want to determine a mean for a variable in the \code{calculate}
#' block, and then make decisions on that mean in the data step block. 
#' 
#' The \code{retain} parameter allows you to access the prior row value.
#' At the start of the data step, the retained variable is seeded with the 
#' initial value. For each subsequent step, the variable is seeded with the
#' value of the prior step/row.  This functionality allows you to increment 
#' values or perform cumulative operations.
#' 
#' \code{calculate} and \code{retain} are both input parameters.
#' 
#' @section Set and Merge Operations:
#' The \code{datastep} function allows you to join one or more input datasets 
#' into a single output dataset.  There are two operations in this regard:
#' "set" and "merge".  A set operation stacks the datasets vertically, one
#' on top of another. The merge operation joins the datasets horizontally,
#' left and right.  
#' 
#' The \code{datastep} set and merge operations are unusually flexible compared
#' to other join functions in R. The set operation does not require the same 
#' number of columns in each dataset.  Likewise, the merge operation does
#' not require the same number of rows.  In both cases, where there is no
#' corresponding column or row, the function will fill in with NA values.
#' 
#' The merge operation can perform both inner and outer joins.  By default, 
#' the merge performs a full outer join.  If you wish to limit the operation
#' to an inner join, use the "merge_in" parameter to set up variables with 
#' which you can filter the desired rows.  The "merge_in" variables will
#' be populated with 1 or 0 values, which indicate whether or not the 
#' dataset contained that row.  Once these variables are populated, you
#' can easily limit the results using a \code{where} expression, or the 
#' \code{delete} or \code{output} functions from inside the datastep.
#' 
#' @section Data Step Arrays:
#' There are times you may want to iterate over columns in your data step.  Such 
#' iteration is particularly useful when you have a wide dataset,
#' and wish to perform the same operation on several columns.
#' For instance, you may want to calculate the mean for 10 different
#' variables on your dataset.
#' 
#' The \code{arrays} parameter allows you to iterate across columns.  
#' This parameter accepts a named list of 
#' \code{\link{dsarray}} objects.  The \code{dsarray} is essentially
#' a list of columns.  You can use a \code{for} loop to iterate over the
#' \code{dsarray}, and also send it into a vectorized function.  Data 
#' step arrays allow to you to perform row-wise calculations. 
#' For instance, you can calculate 
#' a sum or mean by row for the variables in your array.
#' 
#' @section Output Column Order:
#' By default, the data step will retain the column order of any variables that
#' already exist on the input data set. New variables created 
#' in a data step will be appended to the right of existing variables.  
#' Yet these new variables can sometimes appear in an order that is 
#' unexpected or undesirable.  
#' 
#' There are two ways to control the order of output columns:
#' the \code{keep} parameter and the \code{attrib} parameter.
#' 
#' Columns names included on the 
#' \code{keep} parameter will appear in the order indicated on the keep
#' vector.  This ordering mechanism is appropriate when you have a small
#' number of columns and can easily pass the entire keep list.
#' 
#' To control the order of new variables only, use the \code{attrib} parameter.
#' New variables for which attributes are defined will appear in the 
#' order indicated on the \code{attrib} list.  The \code{attrib} list
#' is useful when you are adding a relatively small number of columns to 
#' an existing data set, and don't want to pass all the column names. 
#' 
#' Remember
#' that you can supply an attribute list with default values only,
#' such as \code{attrib = list(column1 = 0, column2 = "")}.  This style of 
#' attribute definition is convenient if you are only trying to control
#' the order of columns.
#' 
#' If the above two mechanisms to control column order are not sufficient,
#' use the data frame subset operators or column ordering functions 
#' provided by other packages.
#' 
#' @section Datastep Performance:
#' The \code{datastep} is intended to be used on small and medium-sized 
#' datasets.  It is not recommended for large datasets.
#' If your dataset is greater than one million rows, you should consider
#' other techniques for processing your data.  While there is no 
#' built-in restriction on the number of rows, performance of the
#' \code{datastep} can become unacceptable with a large number of rows.
#' @param data The data to step through.
#' @param steps The operations to perform on the data.  This parameter is 
#' specified as a set of R statements contained within 
#' curly braces. If no steps are desired, pass empty curly braces.
#' @param keep A vector of quoted variable names to keep in the output
#' data set. By default, all variables are kept.
#' @param drop A vector of quoted variable names to drop from the output
#' data set. By default, no variables are dropped.
#' @param rename A named vector of quoted variables to rename.  The current
#' variable name should be on the left hand side of the name/value pair,
#' and the new variable name should be on the right.  The rename operation
#' is performed after the data step, the keep, and the drop.  Therefore, 
#' the data steps should use the input variable name.  By default, all variables
#' retain their original names.
#' @param by A vector of quoted variable names to use for by-group processing.
#' This parameter will activate the \code{first.} and \code{last.} automatic
#' variables, that indicate the first or last rows in a group.  These 
#' automatic variables are useful for conditional processing on groups.
#' The function will also create first and last automatic variables for each
#' variable specified in the by group.
#' @param calculate Steps to set up calculated variables.  
#' Calculated variables are commonly generated with summary functions such as
#' \code{mean}, \code{median}, \code{min}, \code{max}, etc.  It is more 
#' efficient to set up calculated variables with the calculate parameter and then 
#' use those variables in the data step, rather than perform the summary
#' function inside the data step.  The calculate block will be executed 
#' immediately before the data step.
#' @param retain A list of variable names and initial values 
#' to retain.  Retained variables will begin the data step with the initial
#' value.  Then for each iteration of the data step, the variable will
#' be populated with the ending value from the previous step.  The retain
#' functionality allows you to perform cumulative operations or decisions
#' based on the value of the previous iteration of the data step.  Initial 
#' values should be of the expected data type for the column.  For example, 
#' for a numeric column set the initial value to a zero, and for a
#' character column, set the initial value to an empty string, i.e.
#' \code{retain = list(col1 = 0, col2 = "")}.  There is no default initial 
#' value for a variable.  You must supply an initial value for each retained
#' variable.
#' @param attrib A named list of attributes.  The list can be either
#' \code{\link{dsattr}} objects or single default values.  The \code{dsattr}
#' object allows you to set more attributes on each column.  The 
#' single default value is convenient if you simply want to create a variable.
#' By default, variables will be created on the fly with no attributes.
#' @param arrays A named list of \code{\link{dsarray}} objects. The 
#' \code{dsarray} is a list of columns which you can 
#' iterate over inside the data step.  You can iterate over a \code{dsarray}
#' either with a \code{for} loop, or with a vectorized function. 
#' The default value of the \code{arrays} parameter is NULL, meaning
#' no arrays are defined.
#' @param sort_check Checks to see if the input data is sorted according to
#' the \code{by} variable parameter.  The sort check will give an error
#' if the input data is not sorted according to the \code{by} variable.
#' The check is turned on if the value of 
#' \code{sort_check} is TRUE, and turned off if FALSE.  The default value
#' is TRUE.  Turn the sort check off if you want to perform by-group 
#' processing on unsorted data, or data that is not sorted according
#' to the by-group.
#' @param format A named list of formats to assign to the data
#' frame.  Formats will be assigned both before and after the datastep.
#' @param label A named list of labels to assign to the output data frame.
#' @param where An expression to filter the output dataset.  The where
#' clause will be applied prior to any drop, keep, or rename statement.
#' Use the \code{expression} function to assign the where clause.
#' @param set A dataset or list of datasets to append to the input 
#' data frame.  The set operation will occur at the beginning of the datastep,
#' prior to the execution of any steps.  The columns in the set datasets
#' do not have to match.  Where there are no matching columns, the missing
#' values will be filled with NA.
#' @param merge A dataset or list of datasets to merge with the input
#' data.  The merge operation will occur at the beginning of the datastep,
#' prior to the execution of any steps.  When the \code{merge} operation is 
#' requested, the \code{merge_by} parameter will be used to indicate which variable(s)
#' to merge by. If no \code{merge_by} is specified, the merge dataset columns will 
#' simply be appended to the right of the input dataset.
#' @param merge_by If the \code{merge} parameter is set, the \code{merge_by} 
#' parameter will be used to identify the variable(s) to merge by. If merge 
#' variables are the same on both datasets, the names may be passed as a simple 
#' quoted vector. If the variable names are different, pass the variables 
#' to merge on as a named vector.  For example, \code{c("ITEMID" = "ITEMCODE")}
#' would specify that the join should occur on the "ITEMID" from the 
#' dataset specified in the \code{data} parameter, and the "ITEMCODE"
#' variable from the dataset specified on the \code{merge} parameter.
#' @param merge_in A vector of column names to be used to hold the merge flags.  
#' The number of names should correspond to the number of 
#' datasets being merged. The
#' merge flags will be populated with 0 or 1 values to indicate whether the record
#' came from the corresponding table. Use the \code{where} parameter, 
#' \code{delete} function, or \code{output} function to filter desired results.
#' @param log Whether or not to log the datastep.  Default is TRUE.  This 
#' parameter is used internally.
#' @return The processed data frame, tibble, or data table.  
#' @family datastep
#' @seealso \code{\link{libname}} function to create a data library, and
#' the \code{\link{dictionary}} function to create a data dictionary.
#' @examples 
#' # Example #1: Simple Data Step
#' df <- datastep(mtcars[1:10,], 
#'                keep = c("mpg", "cyl", "disp", "mpgcat", "recdt", "is8cyl"), {
#'                  
#'   if (mpg >= 20) 
#'     mpgcat <- "High"
#'   else 
#'     mpgcat <- "Low"
#'                  
#'   recdt <- as.Date("1974-06-10")
#'                  
#'   if (cyl == 8)
#'     is8cyl <- TRUE
#'   else 
#'     is8cyl <- FALSE
#'                  
#' })
#' 
#' df
#' #                    mpg cyl  disp mpgcat      recdt
#' # Mazda RX4         21.0   6 160.0   High 1974-06-10
#' # Mazda RX4 Wag     21.0   6 160.0   High 1974-06-10
#' # Datsun 710        22.8   4 108.0   High 1974-06-10
#' # Hornet 4 Drive    21.4   6 258.0   High 1974-06-10
#' # Hornet Sportabout 18.7   8 360.0    Low 1974-06-10
#' # Valiant           18.1   6 225.0    Low 1974-06-10
#' # Duster 360        14.3   8 360.0    Low 1974-06-10
#' # Merc 240D         24.4   4 146.7   High 1974-06-10
#' # Merc 230          22.8   4 140.8   High 1974-06-10
#' # Merc 280          19.2   6 167.6    Low 1974-06-10
#'
#' # Example #2: By-group Processing
#' df <- datastep(mtcars[1:10,], 
#'                keep = c("mpg", "cyl", "gear", "grp"), 
#'                by = c("gear"), sort_check = FALSE, {
#'                  
#'   if (first.)
#'     grp <- "Start"
#'   else if (last.)
#'     grp <- "End"
#'   else 
#'     grp <- "-"
#'                  
#' })
#' 
#' df
#' #                    mpg cyl gear   grp
#' # Mazda RX4         21.0   6    4 Start
#' # Mazda RX4 Wag     21.0   6    4     -
#' # Datsun 710        22.8   4    4   End
#' # Hornet 4 Drive    21.4   6    3 Start
#' # Hornet Sportabout 18.7   8    3     -
#' # Valiant           18.1   6    3     -
#' # Duster 360        14.3   8    3   End
#' # Merc 240D         24.4   4    4 Start
#' # Merc 230          22.8   4    4     -
#' # Merc 280          19.2   6    4   End 
#' 
#' # Example #3: Calculate Block
#' df <- datastep(mtcars, 
#'                keep = c("mpg", "cyl", "mean_mpg", "mpgcat"), 
#'                calculate = { mean_mpg = mean(mpg) }, {
#'                  
#'   if (mpg >= mean_mpg)
#'     mpgcat <- "High"
#'   else 
#'     mpgcat <- "Low"
#'                  
#' })
#' 
#' df[1:10,]
#' #                    mpg cyl mean_mpg mpgcat
#' # Mazda RX4         21.0   6 20.09062   High
#' # Mazda RX4 Wag     21.0   6 20.09062   High
#' # Datsun 710        22.8   4 20.09062   High
#' # Hornet 4 Drive    21.4   6 20.09062   High
#' # Hornet Sportabout 18.7   8 20.09062    Low
#' # Valiant           18.1   6 20.09062    Low
#' # Duster 360        14.3   8 20.09062    Low
#' # Merc 240D         24.4   4 20.09062   High
#' # Merc 230          22.8   4 20.09062   High
#' # Merc 280          19.2   6 20.09062    Low
#'
#' # Example #4: Data pipeline
#' library(dplyr)
#' library(magrittr)
#' 
#' # Add datastep to dplyr pipeline
#' df <- mtcars %>% 
#'   select(mpg, cyl, gear) %>% 
#'   mutate(mean_mpg = mean(mpg)) %>% 
#'   datastep({
#'     
#'     if (mpg >= mean_mpg)
#'       mpgcat <- "High"
#'     else 
#'       mpgcat <- "Low"
#'     
#'   }) %>% 
#'   filter(row_number() <= 10)
#' 
#' df
#' #     mpg cyl gear mean_mpg mpgcat
#' # 1  21.0   6    4 20.09062   High
#' # 2  21.0   6    4 20.09062   High
#' # 3  22.8   4    4 20.09062   High
#' # 4  21.4   6    3 20.09062   High
#' # 5  18.7   8    3 20.09062    Low
#' # 6  18.1   6    3 20.09062    Low
#' # 7  14.3   8    3 20.09062    Low
#' # 8  24.4   4    4 20.09062   High
#' # 9  22.8   4    4 20.09062   High
#' # 10 19.2   6    4 20.09062    Low
#' 
#' # Example #5: Drop, Retain and Rename
#' df <- datastep(mtcars[1:10, ], 
#'                drop = c("disp", "hp", "drat", "qsec", 
#'                         "vs", "am", "gear", "carb"), 
#'                retain = list(cumwt = 0 ),
#'                rename = c(mpg = "MPG", cyl = "Cylinders", wt = "Wgt", 
#'                           cumwt = "Cumulative Wgt"), {
#'                  
#'   cumwt <- cumwt + wt
#'                  
#' })
#' 
#' df
#' #                    MPG Cylinders   Wgt Cumulative Wgt
#' # Mazda RX4         21.0         6 2.620          2.620
#' # Mazda RX4 Wag     21.0         6 2.875          5.495
#' # Datsun 710        22.8         4 2.320          7.815
#' # Hornet 4 Drive    21.4         6 3.215         11.030
#' # Hornet Sportabout 18.7         8 3.440         14.470
#' # Valiant           18.1         6 3.460         17.930
#' # Duster 360        14.3         8 3.570         21.500
#' # Merc 240D         24.4         4 3.190         24.690
#' # Merc 230          22.8         4 3.150         27.840
#' # Merc 280          19.2         6 3.440         31.280
#' 
#' # Example #6: Attributes and Arrays
#' 
#' # Create sample data
#' dat <- read.table(header = TRUE, text = '
#'    Year  Q1   Q2  Q3  Q4
#'    2000 125  137 152 140
#'    2001 132  145 138  87
#'    2002 101  104 115 121')
#'  
#' # Use attrib list to control column order and add labels
#' # Use array to calculate row sums and means, and get best quarter
#' df <- datastep(dat,
#'                attrib = list(Tot = dsattr(0, label = "Year Total"),
#'                              Avg = dsattr(0, label = "Year Average"),
#'                              Best = dsattr(0, label = "Best Quarter")),
#'                arrays = list(qtrs = dsarray("Q1", "Q2", "Q3", "Q4")),
#'                drop = "q",
#'                steps = {
#'                
#'                  # Empty brackets return all array values
#'                  Tot <- sum(qtrs[])
#'                  Avg <- mean(qtrs[])
#'                  
#'                  # Iterate to find best quarter
#'                  for (q in qtrs) {
#'                    if (qtrs[q] == max(qtrs[]))
#'                      Best <- q
#'                  }
#'                })
#'                
#' df
#' #   Year  Q1  Q2  Q3  Q4 Tot    Avg Best
#' # 1 2000 125 137 152 140 554 138.50   Q3
#' # 2 2001 132 145 138  87 502 125.50   Q2
#' # 3 2002 101 104 115 121 441 110.25   Q4
#' 
#' dictionary(df)
#' #   A tibble: 8 x 10
#' #   Name  Column Class     Label        Description Format Width Justify  Rows   NAs
#' #   <chr> <chr>  <chr>     <chr>        <chr>       <lgl>  <int> <chr>   <int> <int>
#' # 1 df    Year   integer   NA           NA          NA        NA NA          3     0
#' # 2 df    Q1     integer   NA           NA          NA        NA NA          3     0
#' # 3 df    Q2     integer   NA           NA          NA        NA NA          3     0
#' # 4 df    Q3     integer   NA           NA          NA        NA NA          3     0
#' # 5 df    Q4     integer   NA           NA          NA        NA NA          3     0
#' # 6 df    Tot    integer   Year Total   NA          NA        NA NA          3     0
#' # 7 df    Avg    numeric   Year Average NA          NA        NA NA          3     0
#' # 8 df    Best   character Best Quarter NA          NA         2 NA          3     0
#' 
#' # Example #7: Set and Merge Operations
#' 
#' # Create sample data
#' grp1 <- read.table(header = TRUE, text = '
#'   GROUP  NAME
#'   G01  Group1
#'   G02  Group2
#' ', stringsAsFactors = FALSE)
#' 
#' grp2 <- read.table(header = TRUE, text = '
#'   GROUP  NAME
#'   G03  Group3
#'   G04  Group4
#' ', stringsAsFactors = FALSE)
#'   
#' dat <- read.table(header = TRUE, text = '
#'   ID AGE SEX GROUP
#'   A01 58 F    G01
#'   A02 20 M    G02
#'   A03 47 F    G05
#'   A04 11 M    G03
#'   A05 23 F    G01
#' ', stringsAsFactors = FALSE)
#' 
#' # Set operation
#' grps <- datastep(grp1, set = grp2, {})
#' grps
#' #   GROUP   NAME
#' # 1   G01 Group1
#' # 2   G02 Group2
#' # 3   G03 Group3
#' # 4   G04 Group4
#' 
#' # Merge operation - Outer Join
#' res <- datastep(dat, merge = grps, 
#'                 merge_by = "GROUP", 
#'                 merge_in = c("inA", "inB"), {})
#'                 
#' # View results
#' res
#' #     ID AGE  SEX GROUP   NAME inA inB
#' # 1  A01  58    F   G01 Group1   1   1
#' # 2  A05  23    F   G01 Group1   1   1
#' # 3  A02  20    M   G02 Group2   1   1
#' # 4  A04  11    M   G03 Group3   1   1
#' # 5  A03  47    F   G05   <NA>   1   0
#' # 6 <NA>  NA <NA>   G04 Group4   0   1
#' 
#' # Merge operation - Inner Join
#' res <- datastep(dat, merge = grps, 
#'                 merge_by = "GROUP", 
#'                 merge_in = c("inA", "inB"), 
#'                 where = expression(inA & inB), {})
#'                 
#' # View results
#' res
#' #     ID AGE  SEX GROUP   NAME inA inB
#' # 1  A01  58    F   G01 Group1   1   1
#' # 2  A05  23    F   G01 Group1   1   1
#' # 3  A02  20    M   G02 Group2   1   1
#' # 4  A04  11    M   G03 Group3   1   1
#' @import dplyr
#' @export
datastep <- function(data, steps, keep = NULL,
                     drop = NULL, rename = NULL,
                     by = NULL, calculate = NULL,
                     retain = NULL, attrib = NULL,
                     arrays = NULL,
                     sort_check = TRUE,
                     format = NULL,
                     label = NULL,
                     where = NULL, 
                     set = NULL,
                     merge = NULL,
                     merge_by = NULL,
                     merge_in = NULL, 
                     log = TRUE) {
  
  if (!"data.frame" %in% class(data))
    stop("input data must be inherited from data.frame")
  
  
  if (!is.null(retain)) {
    if (!"list" %in% class(retain))
      stop("retain parameter value must be of class 'list'")
    
  }
  
  if (!is.null(attrib)) {
    if (!"list" %in% class(attrib))
      stop("attrib parameter value must be of class 'list'")
    
  }
  
  if (!is.null(arrays)) {
    if (!"list" %in% class(arrays))
      stop("arrays parameter value must be of class 'list'")
    
  }
  
  # Deal with single value unquoted parameter values
  oby <- deparse(substitute(by, env = environment()))
  by <- tryCatch({if (typeof(by) %in% c("character", "NULL")) by else oby},
                 error = function(cond) {oby})
  
  odrop <- deparse(substitute(drop, env = environment()))
  drop <- tryCatch({if (typeof(drop) %in% c("character", "NULL")) drop else odrop},
                 error = function(cond) {odrop})
  
  okeep <- deparse(substitute(keep, env = environment()))
  keep <- tryCatch({if (typeof(keep) %in% c("character", "NULL")) keep else okeep},
                 error = function(cond) {okeep})
  
  omby <- deparse(substitute(merge_by, env = environment()))
  merge_by <- tryCatch({if (typeof(merge_by) %in% c("character", "NULL")) merge_by else omby},
                   error = function(cond) {omby})
  
  
  # Capture number of starting columns
  startcols <- ncol(data)
  
  # Put code in a variable for safe-keeping
  code <- substitute(steps, env = environment())
  
  # Determine if there is an output function
  hout <- has_output(deparse(code))
  
  # Give warning if there are no rows and no output()
  if (hout == FALSE & nrow(data) == 0) {
    warning("Input dataset has no rows.") 
  }
  
  # Save off incoming dataset class
  dataclass <- class(data)
  
  # Deal with set parameter
  if (!is.null(set)) {
    data <- perform_set(data, set)
  }
  
  # Deal with merge parameter
  if (!is.null(merge)) {
    data <- perform_merge(data, merge, merge_by, merge_in) 
  }
  
  # Clear output list
  e$output <- list()
  
  # Put aggregate functions in a variable 
  agg <- substitute(calculate, env = environment())
  
  # Execute aggregate functions
  if (paste0(deparse(agg), collapse = "") != "NULL") {
    data <- within(data, eval(agg), keepAttrs = TRUE)
  }
  
  # Apply variable attributes
  if (!is.null(attrib)) {
    for (nm in names(attrib)) { 
      if ("dsattr" %in% class(attrib[[nm]])) {
        
        # If the attrib is a dsattr
        if (!nm %in% names(data)) {
          data[[nm]] <- attrib[[nm]][["default"]]
        }
        for (at in names(attrib[[nm]])) {
          
          if (at != "class")
            attr(data[[nm]], at) <-  attrib[[nm]][[at]]
          
        }
      } else {
        
        # If the attrib is not a dsattr, use as default value
        if (!nm %in% names(data)) {
          data[[nm]] <- attrib[[nm]]
        }
        
        
      }
    }
  }
  
  # Assign arrays to variables in this environment
  # Otherwise they won't be accessible from datastep code
  if (!is.null(arrays)) {
    for (nm in names(arrays)) {
      
      assign(nm, arrays[[nm]]) 
      
    }
  }
  
  ret <- list()
  firstval <- NULL
  firstvals <- list()
  rowcount <- nrow(data)
  orig_class <- class(data)
  
  # Set by if data is a grouped tibble
  if (is.null(by) && "grouped_df" %in% class(data)) {
    if (!is.null(attr(data, "groups"))) {
      grpdf <- attr(data, "groups")
      nms <- names(grpdf)
      if (!is.null(nms)) {
        nms <- nms[nms != ".rows"]
        if (length(nms) > 0) {
          by <- nms
          
        }
      }
    }
  }
  
  # Save off any attributes
  data_attributes <- data[1, , drop = FALSE]
  
  # Tibble subset will keep attributes, but data.frame will not
  if (!"tbl_df" %in% class(data)) {
    data_attributes <- copy_attributes(data, data_attributes)
    
  }
  if (!is.null(format)) {
    data_attributes <- assign_attributes(data_attributes, format, "format")
  }
  
  # For some reason the grouped tibble kills performance.
  # Temporarily convert to a data frame.  
  # Seriously like 20X performance increase.
  if (any("grouped_df" == class(data)))
    data <- as.data.frame(data, stringsAsFactors = FALSE)
  
  # data.table is not that bad, but data.frame is better.
  if (any("data.table" == class(data)))
    data <- as.data.frame(data, stringsAsFactors = FALSE)
  
  # Strip any crazy classes, as they can mess up datastep functions
  data_classes <- class(data)
  if (any(!class(data) %in% c("data.frame", "list"))) {
    data <- as.data.frame(unclass(data), stringsAsFactors = FALSE, 
                          check.names = FALSE) 
  }
  
  # Add automatic variables
  data <- add_autos(data, by, sort_check)
  
  # Increase rowcount if needed
  if (nrow(data) > rowcount) {
    rowcount <- nrow(data)
  }
  
  # If there is no code to step through
  if (length(as.character(code)) == 1) {
    
    # Just set original dataset
    ret <- data
    
  } else {
    
    # Sometimes the local environment cannot access the parent frame.
    # If this happens, transfer any variables from the parent frame 
    # to the local frame.
    lf <- sys.frame(sys.nframe())
    pf <- parent.frame()
    pnms <- ls(pf)
    lnms <- ls(lf)
    for (pnm in pnms) {
      if (!pnm %in% lnms) {
        #assign(pnm, pf[[pnm]], envir = lf)
        lf[[pnm]] <- pf[[pnm]]
      }
    }

    # Step through row by row
    for (n. in seq_len(rowcount)) {
      
      # Subset by row
      rw <- data[n., , drop = FALSE]
      
      # Put back any attributes dropped during row subset
      rw <- copy_attributes(data_attributes, rw)
      
      
      
      # Deal with retained variables
      if (!is.null(retain)) {
        if (length(ret) == 0) {
          for (nm in names(retain)) {
            
            # Populate with initial value
            rw[[nm]] <- retain[[nm]]
            
          }
          
        } else {
          for (nm in names(retain)) {
            
            # Populate with value from previous row   
            #data[n., nm] <- ret[n. - 1, nm]  way backup
            
            rw[[nm]] <- ret[[n. - 1]][[nm]] # current
            
            
          }
        }
      }
      
      
      # Evaluate the code for the row
      # ret[[n.]]  <-  within(rw, eval(code), 
      #                       keepAttrs = TRUE)

      # Evaluate the code for the row
      ret[[n.]] <- within(rw, eval(code, enclos = lf))

    }
    
    # Bind all rows
    if (hout) {
      ret <- bind_rows(e$output, .id = "column_label")
      
    } else {
      ret <- bind_rows(ret, .id = "column_label")
    }
    ret["column_label"] <- NULL
    

  }
  
  
  
  # Delete
  if ("..delete" %in% names(ret)) {
    ret <- tryCatch({subset(ret, ret[["..delete"]] == FALSE)},
                    error = function(cond){ret})
  }
  
  # Where Before
  if (!is.null(where)) {
    ret <- tryCatch({subset(ret, eval(where))},
                    error = function(cond){ret})
  }
  
  # Improve column order
  rtnms <- rev(names(ret))
  orgnms <- names(data)
  ret <- ret[ ,c(orgnms, rtnms[!rtnms %in% orgnms])]
  
  # Remove automatic variables
  ret <- remove_autos(ret, by)
  
  # Perform drop operation
  if (!is.null(drop)) {
    
    if (!all(drop %in% names(ret))) {
      
      message("Drop parameter '" %p%  drop[!drop %in% names(ret)] %p% 
                "' not found on output dataset.")
      
      drop <-  drop[drop %in% names(ret)] 
    }
    
    ret <- ret[ , !names(ret) %in% drop, drop = FALSE]
    
  }
  
  # Perform keep operation
  if (!is.null(keep)) {
    if (!all(keep %in% names(ret))) {
      
      message("Keep parameter '" %p%  keep[!keep %in% names(ret)] %p% 
              "' not found on output dataset.")
      
      keep <-  keep[keep %in% names(ret)] 
    }
    
    ret <- ret[ , keep, drop = FALSE]
  }
  
  
  # Convert back to tibble if original was a tibble
  if ("tbl_df" %in% orig_class & !"tbl_df" %in% class(ret)) {
    ret <- as_tibble(ret, .name_repair = "minimal")
  }
  
  # Put back grouping attributes if original data was grouped
  if (!is.null(by) & "grouped_df" %in% orig_class) {
    
    if (all(by %in% names(ret)))
      ret <- group_by(ret, across({{by}})) 
    
  }
  
  # Convert back to data.table if original was a data.table
  if ("data.table" %in% orig_class & !"data.table" %in% class(ret)) {
    ret <- data.table::as.data.table(ret)
  }
  
  # Restore any stripped classes
  if (any(!data_classes %in% class(ret))) {
    
    class(ret) <- data_classes
  }
  
  # Restore attributes from original data 
  ret <- copy_attributes(data_attributes, ret)
  
  
  # Perform rename operation
  if (!is.null(rename)) {
    nms <- names(ret)
    names(ret) <- ifelse(nms %in% names(rename), rename[nms], nms)
  }
  
  # Where After
  if (!is.null(where)) {
    ret <- tryCatch({subset(ret, eval(where))},
                    error = function(cond){ret})
    
    # Restore attributes from original data 
    ret <- copy_attributes(data_attributes, ret)
  }
  
  # Labels
  if (!is.null(label)) {
    ret <- assign_attributes(ret, label, "label")
  }
  
  # Formatting
  if (!is.null(format)) {
    ret <- assign_attributes(ret, format, "format")
  }
  
  # Clear out rownames
  rownames(ret) <- NULL
  
  endcols <- ncol(ret)
  if (startcols > endcols)
    log_logr(paste0("datastep: columns decreased from ", startcols, " to ", 
                    endcols))
  else if (startcols < endcols)
    log_logr(paste0("datastep: columns increased from ", startcols, " to ", 
                    endcols))
  else 
    log_logr(paste0("datastep: columns started with ", startcols, 
                    " and ended with ", endcols))
  
  if (log_output() & log) {
    log_logr(ret)
    print(ret)
  }
  
  return(ret)
}



#' @title Removes an observation from a datastep
#' @description The \code{delete} function will remove an observation
#' from the output of a datastep.  The function takes no parameters.  To use 
#' the function, simply call it on the rows you want to delete.  Typically
#' it is called within a conditional.
#' @return Observation is marked with a delete flag.  No return value.
#' @export
#'
#' @family datastep
#' @examples
#' #' # Remove all cars that are not 4 cylinder
#' df <- datastep(mtcars, 
#'                keep = c("mpg", "cyl", "disp"), {
#'                  
#'   if (cyl != 4)
#'     delete()
#'                  
#' })
#' 
#' df
#' #     mpg cyl  disp
#' # 1  22.8   4 108.0
#' # 2  24.4   4 146.7
#' # 3  22.8   4 140.8
#' # 4  32.4   4  78.7
#' # 5  30.4   4  75.7
#' # 6  33.9   4  71.1
#' # 7  21.5   4 120.1
#' # 8  27.3   4  79.0
#' # 9  26.0   4 120.3
#' # 10 30.4   4  95.1
#' # 11 21.4   4 121.0
delete <- function() {
  
  # Parent frame hold environment with ds row
  pf <- parent.frame()
  
  # Set by reference
  pf$..delete <- TRUE
  
  
}


#' @title Outputs an observation from a datastep
#' @description The \code{output} function will output an observation
#' from a datastep.  The function takes no parameters.  To use 
#' the function, simply call it on the rows you want to output.  Typically
#' it is called within a conditional.  The output function is interesting
#' in that you can output multiple rows for the same input observation.
#' @return Observation is marked with a output flag.  No return value.
#' @export
#'
#' @family datastep
#' @examples
#' #' # Example 1: Output all cars that are 4 cylinder 
#' df <- datastep(mtcars, 
#'                keep = c("mpg", "cyl", "disp"), {
#'                  
#'   if (cyl == 4)
#'     output()
#'                  
#' })
#' 
#' df
#' #     mpg cyl  disp
#' # 1  22.8   4 108.0
#' # 2  24.4   4 146.7
#' # 3  22.8   4 140.8
#' # 4  32.4   4  78.7
#' # 5  30.4   4  75.7
#' # 6  33.9   4  71.1
#' # 7  21.5   4 120.1
#' # 8  27.3   4  79.0
#' # 9  26.0   4 120.3
#' # 10 30.4   4  95.1
#' # 11 21.4   4 121.0
#' 
#' # Example 2: Output two rows for each 6 cylinder car
#' 
#' # Prepare sample data
#' dat <- data.frame(name = rownames(mtcars), mtcars, stringsAsFactors = FALSE)
#' 
#' # Perform datastep
#' df <- datastep(dat, 
#'                keep = c("name", "mpg", "cyl", "disp", "seq"), {
#'                  
#'   if (cyl == 6) {
#'     seq <- 1
#'     output()
#'     seq <- 2
#'     output()
#'   }
#'                  
#' })
#' 
#' df
#' #              name  mpg cyl  disp seq
#' # 1       Mazda RX4 21.0   6 160.0   1
#' # 2       Mazda RX4 21.0   6 160.0   2
#' # 3   Mazda RX4 Wag 21.0   6 160.0   1
#' # 4   Mazda RX4 Wag 21.0   6 160.0   2
#' # 5  Hornet 4 Drive 21.4   6 258.0   1
#' # 6  Hornet 4 Drive 21.4   6 258.0   2
#' # 7         Valiant 18.1   6 225.0   1
#' # 8         Valiant 18.1   6 225.0   2
#' # 9        Merc 280 19.2   6 167.6   1
#' # 10       Merc 280 19.2   6 167.6   2
#' # 11      Merc 280C 17.8   6 167.6   1
#' # 12      Merc 280C 17.8   6 167.6   2
#' # 13   Ferrari Dino 19.7   6 145.0   1
#' # 14   Ferrari Dino 19.7   6 145.0   2
#' 
#' # Example 3: Create data frame using output() functions
#' df <- datastep(data.frame(), {
#' 
#'   # Row 1
#'   COL1 <- 1
#'   COL2 <- "One"
#'   output()
#'   
#'   # Row 2
#'   COL1 <- 2
#'   COL2 <- "Two"
#'   output()
#' 
#' })
#' 
#' df
#' #   COL1 COL2
#' # 1    1  One
#' # 2    2  Two
output <- function() {
  
  #browser()
  # Parent frame hold row
  pf <- parent.frame()
  
  # Convert to list so it can be converted to a data frame
  nlst <- as.list(pf)
  nlst[["..delete"]] <- pf$..delete
  
  # Convert to data frame and append to output list
  e$output[[length(e$output) + 1]] <- as.data.frame(nlst, 
                                                    stringsAsFactors = FALSE, 
                                                    make.names = FALSE,
                                                    optional = FALSE,
                                                    check.names = FALSE)

  
}

# Utilities ---------------------------------------------------------------


assign_attribute_list <- function(df, lst) {
  
  ret <- df
  
  anms <- names(lst)
  
  for (nm in names(ret)) {
     
    if (nm %in% anms) {
      
      att <- lst[[nm]] 
      for (at in names(att)) {
        
        # Don't break factors
        if ("factor" %in% class(ret[[nm]]) & at == "levels") {
          
          if (length(att[[at]]) ==  length(attr(ret[[nm]], at)))
            attr(ret[[nm]], at) <- att[[at]]
          
        } else {
          
          attr(ret[[nm]], at) <- att[[at]]
        }
      
      }
    }
  }
   
  return(ret)
}

# Collects column attributes into a list,
# preserving any attributes already in the list.
collect_attributes <- function(alst, df, idcols, sfx) {
 
  if (is.null(alst))
    ret <- list()
  else 
    ret <- alst
  
  anms <- names(ret)
  
  for (nm in names(df)) {
    
    if (!nm %in% anms) {
      
     ret[[nm]] <- attributes(df[[nm]]) 
    } else if (!nm %in% idcols) {
      
      # If name already exists, add suffixes
      nnm <- paste0(nm, sfx[1])
      tnm <- paste0(nm, sfx[2])
      ret[[nnm]] <- ret[[nm]]
      ret[[tnm]] <- attributes(df[[nm]]) 
      ret[[nm]] <- NULL
    }
    
  }
  
  return(ret)
  
}


# Copy column attributes from one df to another.
# Used during datastep operations to restore attributes
# lost when using Base R functions. 
#' @noRd
copy_attributes <- function(df1, df2) {
  
  ret <- df2
  
  for (nm in names(df2)) {

    att <- attributes(df1[[nm]])
    if (!is.null(att))
      attributes(ret[[nm]]) <- att
    
    # col <- df1[[nm]]
    # for (at in names(attributes(col))) {
    #   
    #   attr(ret[[nm]], at) <- attr(col, at)
    #   
    # }
    
  }
  
  return(ret)
}

copy_attributes_sp <- function(df1, df2) {
  
  ret <- df2
  
  for (nm in names(df2)) {
    
    col <- df1[[nm]]
    if (!is.null(col)) {
      for (at in names(attributes(col))) {
        
        if (!at %in% c("levels")) { 
  
          attr(ret[[nm]], at) <- attr(col, at)
        }
  
      }
    }
    
  }
  
  return(ret)
}


# Copies attributes on data frame from one df to another
# Skips rownames and names, which can cause trouble.
copy_df_attributes <- function(src, trgt) {
  
  atts <- attributes(src)
  
  ret <- trgt
  
  for (anm in names(atts)) {
    
    if (!anm %in% c("names", "row.names")) { 
      attr(ret, anm) <- atts[[anm]] 
    }
  }
  
  return(ret)
}

# General function to assign column attributes to 
# a data frame.  Can assign basically any attributes
# like labels or formats or whatever.  Used to apply
# attributes assigned on the function call.
assign_attributes <- function(df, alst, attr) {
  
  nmsdf <- names(df)
  nmslst <- names(alst)
  
  ret <- df
  
  for (nm in nmslst) {
    
    if (nm %in% nmsdf) { 
    
      attr(ret[[nm]], attr) <- alst[[nm]] 
    }
  }
  
  return(ret)
  
}

# Test to see whether the code has an output statement.
# Will change how the datastep is conducted.
has_output <- function(codestr) {
 
  
  ret <- FALSE
  
  opos <- grepl("output()", codestr, fixed = TRUE)
  
  if (any(opos == TRUE))
    ret <- TRUE
  
  return(ret)

}

# Perform the set operation.  Works on main
# dataset plus one or more datasets.
perform_set <- function(dta, stdta) {
  
  # Save off class
  dtacls <- class(dta)
  
  # Work with pure data frames.
  # Tibbles will mess with names.
  dta <- as.data.frame(dta)
  
  # Put in list
  if ("data.frame" %in% class(stdta))
    dtalst <- list(stdta)
  else
    dtalst <- stdta
  
  # Collect Names
  fnms <- names(dta)
  
  # Assign counter to ensure stacking
  dta[["..ds"]] <- 0
  
  ret <- dta
  
  # Stack datasets
  for (i in seq_len(length(dtalst))){
    
    tmp <- as.data.frame(dtalst[[i]])
    nnms <- names(tmp)
    fnms <- c(fnms, nnms[!nnms %in% fnms])
    tmp[["..ds"]] <- i
    ret <- merge(ret, tmp, all = TRUE, sort = FALSE) 
    
  }
  
  # Clean up counter
  ret[["..ds"]] <- NULL
  dta[["..ds"]] <- NULL
  
  # Rename so first dataset drives naming
  # Can easily break if name has been changed.
  ret <- tryCatch({ret[ , fnms]}, error = function(cond){ret})
  
  # Restore attributes
  ret <- copy_attributes_sp(dta, ret)
  ret <- copy_df_attributes(dta, ret)
  
  # Restore original class
  class(ret) <- dtacls
  
  return(ret)
  
}

# Perform merge operation.  Works on one or more datasets.
perform_merge <- function(dta, mrgdta, mrgby, mrgin) {
  
  # Save off class
  dtacls <- class(dta)
  
  # Work with pure data frames.
  # Tibbles will mess with names.
  dta <- as.data.frame(dta)
  
  # Put in list
  if ("data.frame" %in% class(mrgdta))
    dtalst <- list(mrgdta)
  else
    dtalst <- mrgdta
  
  ret <- dta
  
  # Capture names for sorting
  fnms <- names(dta)
  
  xnms <- names(mrgby)
  ynms <- mrgby 
  names(ynms) <- NULL
  if (is.null(xnms)) {
    xnms <- ynms
    ynms <- NULL 
  } else {
    
    if (length(ynms) != length(xnms)) {
      xnms <- ynms
      ynms <- NULL
      
    }
  }
  
  if (!is.null(mrgin)) {
    
    ret[[mrgin[1]]] <- 1

  }
  
  # Initialize attribute list with left df
  alst <- collect_attributes(NULL, ret, mrgby, c())
  
  # Merge datasets
  for (i in seq_len(length(dtalst))){
    
    tmp <- as.data.frame(dtalst[[i]])
    
    # Create suffixes (if needed)
    sfx <- c("." %p% i, "." %p% (i + 1))
    
    # Construct name list from original dfs
    fnms <- fix_names(fnms, names(tmp), mrgby, sfx)
    
    # Collect attributes from right df
    alst <- collect_attributes(alst, tmp, mrgby, sfx)
    
    # Add in variables
    if (!is.null(mrgin)) {
      if (!is.na(mrgin[i + 1])) {
       
        tmp[[mrgin[i + 1]]] <- 1 
      }
    }
    
    # Bail if merge columns are not in source df
    if (!all(xnms %in% names(ret))) {
      stop("Merge column name '", xnms[!xnms %in% names(ret)],
           "' not found in left dataset.")

    }
    
    # If there is no merge_by, just append to the end
    # Otherwise, perform the join.
    if (is.null(mrgby)) {
      
      # Deal with mismatched number of rows
      if (nrow(ret) > nrow(tmp))
        ret <- cbind(ret, fill_missing(tmp, nrow(ret)))
      else if (nrow(tmp) > nrow(ret))
        ret <- cbind(fill_missing(ret, nrow(tmp)), tmp)
      else 
        ret <- cbind(ret, tmp)
      
      # Assign corrected names
      names(ret) <- fnms
      
    } else {
    
      
      if (is.null(ynms)) {
        # When merge by column names are the same
        ret <- merge(ret, tmp, by = xnms, suffix = sfx,
                     all = TRUE,
                     sort = FALSE) 
      } else {
        
        if (!all(ynms %in% names(tmp))) {
          stop("Merge column name '", ynms[!ynms %in% names(tmp)], 
               "' not found in right dataset.")
        }
  
        # When merge column names are different
        ret <- merge(ret, tmp, by.x = xnms, by.y = ynms, suffix = sfx,
                     all = TRUE,
                     sort = FALSE) 
        
      }
      
    }
    
  }
  
  # Fill zero for non-matches
  if (!is.null(mrgin)) {
    for (nm in mrgin) { 
    
      ret[[nm]] <- ifelse(is.na(ret[[nm]]), 0, ret[[nm]]) 
    }
  }
  
  # Reorder columns in a sensible way, if possible.
  # This breaks easily is names are missing/changed, 
  # so wrap in tryCatch.
  ret <- tryCatch({ 
    if (!is.null(mrgin))
      ret <- ret[ , c(fnms, mrgin)]
    else
      ret <- ret[ , fnms]
    
    ret
  }, error = function(cond) {
    
    ret 
  })
  
  # Restore attributes
  ret <- assign_attribute_list(ret, alst)
  ret <- copy_df_attributes(dta, ret)
  
  # Restore original class
  class(ret) <- dtacls 
  
  return(ret)
  
}


# A function to perform naming for merged datasets.
# This will keep the preferred column order and append
# indexes for repeated column names.
fix_names <- function(nms1, nms2, keys, sfxs) {
  
  if (is.null(keys))
    keys <- ""
  
  ret <- c()
  for (i in seq_along(nms1)) {
    if (nms1[i] %in% keys) 
      ret[i] <- nms1[i]
    else if (nms1[i] %in% nms2)
      ret[i] <- paste0(nms1[i], sfxs[1])
    else 
      ret[i] <- nms1[i]
  }
  
  for (i in seq_along(nms2)) {
    
    if (!nms2[i] %in% keys) {
        
      if (nms2[i] %in% nms1)
        ret[length(ret) + 1] <- paste0(nms2[i], sfxs[2])
      else 
        ret[length(ret) + 1] <- nms2[i]
    }
  }
  
  return(ret)
}

# Fill in missing rows on a dataset. Takes
# a dataset and a number of rows for the desired row count.
# This is used when cbinding to make sure the datasets
# are the same number of rows.
fill_missing <- function(ds, num) {
  
  if (num > nrow(ds)) {
    nas <- rep(NA, num - nrow(ds))
    nw <- list()
    
    for (nm in names(ds)) {
      
      nw[[nm]] <- nas
    }
    
    dfn <- as.data.frame(nw, stringsAsFactors = FALSE)
    
    ret <- rbind(ds, nw)
  
  } else {
    ret <- ds 
  }
  
  
  return(ret)
  
}
dbosak01/libr documentation built on March 19, 2024, 5:50 p.m.