cv_abm: Estimate and Test an ABM
In JohnNay/eat: Empirical Agent Training Software for Data-Driven Modeling

Description Usage Arguments Details Value Examples

cv_abm uses cross-validation to test an ABM's predictive power.

cv_abm(data, features, Formula, agg_patterns, abm_simulate, abm_vars, iters,
  tseries_len, tp = rep(tseries_len, nrow(agg_patterns)),
  package = c("caretglm", "caretglmnet", "glm", "caretnnet", "caretdnn"),
  sampling = FALSE, sampling_size = 1000, STAT = c("mean", "median"),
  saving = FALSE, filename = NULL, abm_optim = c("GA", "DE"),
  validate = c("lgocv", "cv"), folds = ifelse(validate == "lgocv",
  max(data$group), 10), drop_nzv = FALSE, verbose = TRUE,
  predict_test_par = FALSE, optimize_abm_par = FALSE,
  parallel_training = FALSE)

`data`	`data.frame` with each row (obervational unit) being an individual decision. With a column named "group" specifying which group of `agg_patterns` each obseravtion is in, and a column named "period" specifying at what time period each behavior was taken.
`features`	`list` of the variables (columns in `data`) to be used in the prediction `Formula`. As many elements in the `list` as we want discrete models for different times. Each element of the `list` is a `character vector`, with each element of the `character vector` being a feature to use for training an individual-level model.
`Formula`	`list` where each element is a length one character vector that specifies a formula, e.g. `"y ~ x"`. The character vector makes sense in the context of the `features` and `data`. There are as many elements in the list as there are discrete models for different times.
`agg_patterns`	data.frame with rows (observational unit) being the group and columns: (a.) those aggregate level variables needed for the prediction with the specified `formula` (with same names as the variables in the formula); (b.) a column named "action" with the proportion of the relevant outcome action taken in that group; (c.) columns named `paste(seq(tseries_len))` with the mean/median levels (`STAT`) of the action for each time period.
`abm_simulate`	function with these arguments: `model, features, parameters, tuning_parameters, iterations, time_len, STAT = c("mean", "median")`. Where `model` is the output of `training`. Output of the function is a list with three named elements: `dynamics, action_avg, simdata`. Where `dynamics` is a numeric vector length `tseries_len`, `action_avg` is a numeric vector length one, and `simdata` is a `data.frame` with the numeric results of the simulation.
`abm_vars`	a list with either (1.) a numeric vector named "lower" AND a numeric vector named "upper" each the length of the number of tuning_params of ABM (the names of the elements of these vecs should be the names of the variables and they should be in the same order that the `abm_simulate` function uses them); or (2.) a numeric vector named "value" the length of the number of tuning_params of the ABM (variables should be in the same order that the `abm_simulate` function uses them). Either provide lower and upper elements of the list or provide a value element of the list.
`iters`	numeric vector length one specifying number of iterations to simulate ABM for.
`tseries_len`	numeric vector length one specifying maximum number of time periods to use for model training and testing. If some groups have less than the maximum then you need to provide a vector to the `tp` argument.
`tp`	optional numeric vector length number of rows of `agg_patterns` specifying how long the time series for each group should be. Default is `rep(tseries_len, nrow(agg_patterns))`.
`package`	optional character vector length one, default is `"caretglm", "caretglmnet", "glm", "caretnnet", "caretdnn"`.
`sampling`	optional logical vector length one, default is `FALSE`. If `sampling == TRUE`, we sample equal numbers of observations from each 'group' to reduce potential problems with the final estimated model being too affected by groups with more observations.
`sampling_size`	optional numeric vector length one specifying how many observations from each group that `training` should sample to train the model, default is 1000. Only applicable when `sampling` argument is set to `TRUE`.
`STAT`	optional character vector length one, default is `c("mean", "median")`.
`saving`	optional logical vector length one, default is `FALSE`.
`filename`	optional character vector length one, default is `NULL`.
`abm_optim`	optional character vector length one, default is `c("GA", "DE")`.
`validate`	optional character vector length one, default is `c("lgocv", "cv")`.
`folds`	optional numeric vector length one, default is `ifelse(validate == "lgocv", max(data$group), 10)`.
`drop_nzv`	optional logical vector length one, default is `FALSE`.
`verbose`	optional logical vector length one, default is `TRUE`.
`predict_test_par`	optional logical vector length one, default is `FALSE`. If you are getting any errors with this function, make sure you set args like this to FALSE because debugging in parallel is much harder.
`optimize_abm_par`	optional logical vector length one, default is `FALSE`. This is passed to the optimization algorithm.
`parallel_training`	optional logical vector length one, default is `FALSE`. This is passed to `training`.

The function returns an S4 object. See cv_abm for the details of the slots (objects) that this type of object will have.

Returns an S4 object of class cv_abm. With slots for call = "language", predicted_patterns = "list", timing = "numeric", and diagnostics = "character".

# Create data:
cdata <- data.frame(period = rep(seq(10), 1000),
                   action = rep(0:1, 5000),
                 my.decision1 = sample(1:0, 10000, TRUE),
                 other.decision1 = sample(1:0, 10000, TRUE),
                 group = c(rep(1, 5000), rep(2, 5000)))
time_len <- 2
agg_patterns <- data.frame(group = c(1,2),
                        action = c( mean(as.numeric(cdata[cdata$group==1, "action"])),
                                    mean(as.numeric(cdata[cdata$group==2, "action"]))),
                        c(eat::period_vec_create(cdata[cdata$group==1, ], time_len)[1],
                          eat::period_vec_create(cdata[cdata$group==2, ], time_len)[1]),
                        c(eat::period_vec_create(cdata[cdata$group==1, ], time_len)[2],
                          eat::period_vec_create(cdata[cdata$group==2, ], time_len)[2]))
names(agg_patterns)[3:4] <- c("1", "2")

# Create ABM:
simulate_abm <- function(model, features, parameters, time_len, 
                        tuning_parameters,
                      iterations = 1250, STAT = "mean"){
matrixOut <- data.frame(period = rep(1:10, 1000),
                       action = rep(0:1, 5000),
                       my.decision1 = sample(1:0, 10000, TRUE),
                       other.decision1 = sample(1:0, 10000, TRUE))
action_avg <- mean(matrixOut$action, na.rm=TRUE) 
dynamics <- period_vec_create(matrixOut, time_len)
list(dynamics = dynamics, action_avg = action_avg, simdata = matrixOut)
} 
# Create features and formula lists:
k <- 1
features <- as.list(rep(NA, k)) # create list to fill
features[[1]] <- c("my.decision1", "other.decision1")
Formula <- as.list(rep(NA, k)) # create list to fill
Formula[[1]] <- "action ~ my.decision1 + other.decision1"
# Call cv_abm():
res <- cv_abm(cdata, features, Formula, agg_patterns,
             abm_simulate = simulate_abm,
             abm_vars = list(values = c(0.3, 0.5)),
             iters = 1000,
             tseries_len = time_len,
             tp = c(1, 2),
             package = "caretglm",
             STAT = "mean",
             saving = FALSE, filename = NULL,
             validate = "lgocv", 
             drop_nzv = FALSE, 
             predict_test_par = FALSE)
             
summary(res)
#plot(res)
#performance(res, "cor_pval")