cv_abm: Estimate and Test an ABM

Description Usage Arguments Details Value Examples

View source: R/cv_abm.R

Description

cv_abm uses cross-validation to test an ABM's predictive power.

Usage

1
2
3
4
5
6
7
8
9
cv_abm(data, features, Formula, agg_patterns, abm_simulate, abm_vars, iters,
  tseries_len, tp = rep(tseries_len, nrow(agg_patterns)),
  package = c("caretglm", "caretglmnet", "glm", "caretnnet", "caretdnn"),
  sampling = FALSE, sampling_size = 1000, STAT = c("mean", "median"),
  saving = FALSE, filename = NULL, abm_optim = c("GA", "DE"),
  validate = c("lgocv", "cv"), folds = ifelse(validate == "lgocv",
  max(data$group), 10), drop_nzv = FALSE, verbose = TRUE,
  predict_test_par = FALSE, optimize_abm_par = FALSE,
  parallel_training = FALSE)

Arguments

data

data.frame with each row (obervational unit) being an individual decision. With a column named "group" specifying which group of agg_patterns each obseravtion is in, and a column named "period" specifying at what time period each behavior was taken.

features

list of the variables (columns in data) to be used in the prediction Formula. As many elements in the list as we want discrete models for different times. Each element of the list is a character vector, with each element of the character vector being a feature to use for training an individual-level model.

Formula

list where each element is a length one character vector that specifies a formula, e.g. "y ~ x". The character vector makes sense in the context of the features and data. There are as many elements in the list as there are discrete models for different times.

agg_patterns

data.frame with rows (observational unit) being the group and columns: (a.) those aggregate level variables needed for the prediction with the specified formula (with same names as the variables in the formula); (b.) a column named "action" with the proportion of the relevant outcome action taken in that group; (c.) columns named paste(seq(tseries_len)) with the mean/median levels (STAT) of the action for each time period.

abm_simulate

function with these arguments: model, features, parameters, tuning_parameters, iterations, time_len, STAT = c("mean", "median"). Where model is the output of training. Output of the function is a list with three named elements: dynamics, action_avg, simdata. Where dynamics is a numeric vector length tseries_len, action_avg is a numeric vector length one, and simdata is a data.frame with the numeric results of the simulation.

abm_vars

a list with either (1.) a numeric vector named "lower" AND a numeric vector named "upper" each the length of the number of tuning_params of ABM (the names of the elements of these vecs should be the names of the variables and they should be in the same order that the abm_simulate function uses them); or (2.) a numeric vector named "value" the length of the number of tuning_params of the ABM (variables should be in the same order that the abm_simulate function uses them). Either provide lower and upper elements of the list or provide a value element of the list.

iters

numeric vector length one specifying number of iterations to simulate ABM for.

tseries_len

numeric vector length one specifying maximum number of time periods to use for model training and testing. If some groups have less than the maximum then you need to provide a vector to the tp argument.

tp

optional numeric vector length number of rows of agg_patterns specifying how long the time series for each group should be. Default is rep(tseries_len, nrow(agg_patterns)).

package

optional character vector length one, default is "caretglm", "caretglmnet", "glm", "caretnnet", "caretdnn".

sampling

optional logical vector length one, default is FALSE. If sampling == TRUE, we sample equal numbers of observations from each 'group' to reduce potential problems with the final estimated model being too affected by groups with more observations.

sampling_size

optional numeric vector length one specifying how many observations from each group that training should sample to train the model, default is 1000. Only applicable when sampling argument is set to TRUE.

STAT

optional character vector length one, default is c("mean", "median").

saving

optional logical vector length one, default is FALSE.

filename

optional character vector length one, default is NULL.

abm_optim

optional character vector length one, default is c("GA", "DE").

validate

optional character vector length one, default is c("lgocv", "cv").

folds

optional numeric vector length one, default is ifelse(validate == "lgocv", max(data$group), 10).

drop_nzv

optional logical vector length one, default is FALSE.

verbose

optional logical vector length one, default is TRUE.

predict_test_par

optional logical vector length one, default is FALSE. If you are getting any errors with this function, make sure you set args like this to FALSE because debugging in parallel is much harder.

optimize_abm_par

optional logical vector length one, default is FALSE. This is passed to the optimization algorithm.

parallel_training

optional logical vector length one, default is FALSE. This is passed to training.

Details

The function returns an S4 object. See cv_abm for the details of the slots (objects) that this type of object will have.

Value

Returns an S4 object of class cv_abm. With slots for call = "language", predicted_patterns = "list", timing = "numeric", and diagnostics = "character".

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Create data:
cdata <- data.frame(period = rep(seq(10), 1000),
                   action = rep(0:1, 5000),
                 my.decision1 = sample(1:0, 10000, TRUE),
                 other.decision1 = sample(1:0, 10000, TRUE),
                 group = c(rep(1, 5000), rep(2, 5000)))
time_len <- 2
agg_patterns <- data.frame(group = c(1,2),
                        action = c( mean(as.numeric(cdata[cdata$group==1, "action"])),
                                    mean(as.numeric(cdata[cdata$group==2, "action"]))),
                        c(eat::period_vec_create(cdata[cdata$group==1, ], time_len)[1],
                          eat::period_vec_create(cdata[cdata$group==2, ], time_len)[1]),
                        c(eat::period_vec_create(cdata[cdata$group==1, ], time_len)[2],
                          eat::period_vec_create(cdata[cdata$group==2, ], time_len)[2]))
names(agg_patterns)[3:4] <- c("1", "2")

# Create ABM:
simulate_abm <- function(model, features, parameters, time_len, 
                        tuning_parameters,
                      iterations = 1250, STAT = "mean"){
matrixOut <- data.frame(period = rep(1:10, 1000),
                       action = rep(0:1, 5000),
                       my.decision1 = sample(1:0, 10000, TRUE),
                       other.decision1 = sample(1:0, 10000, TRUE))
action_avg <- mean(matrixOut$action, na.rm=TRUE) 
dynamics <- period_vec_create(matrixOut, time_len)
list(dynamics = dynamics, action_avg = action_avg, simdata = matrixOut)
} 
# Create features and formula lists:
k <- 1
features <- as.list(rep(NA, k)) # create list to fill
features[[1]] <- c("my.decision1", "other.decision1")
Formula <- as.list(rep(NA, k)) # create list to fill
Formula[[1]] <- "action ~ my.decision1 + other.decision1"
# Call cv_abm():
res <- cv_abm(cdata, features, Formula, agg_patterns,
             abm_simulate = simulate_abm,
             abm_vars = list(values = c(0.3, 0.5)),
             iters = 1000,
             tseries_len = time_len,
             tp = c(1, 2),
             package = "caretglm",
             STAT = "mean",
             saving = FALSE, filename = NULL,
             validate = "lgocv", 
             drop_nzv = FALSE, 
             predict_test_par = FALSE)
             
summary(res)
#plot(res)
#performance(res, "cor_pval")

JohnNay/eat documentation built on May 7, 2019, noon