internal_workflow: A learning and prediction workflow with internal validation

Description Usage Arguments Value

View source: R/new_workflows.R

Description

A learning and prediction workflow that may deal with NAs and use internal validation to parametrize a re-sampling technique to balance an imbalanced regression problem.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
internal_workflow(
  train,
  test,
  form,
  model,
  time,
  site_id,
  resample.grid,
  resample.pars = NULL,
  internal.est = NULL,
  internal.est.pars = NULL,
  internal.evaluator = "int_util_evaluate",
  internal.eval.pars = NULL,
  metrics = c("F1.u", "rmse_phi"),
  metrics.max = c(TRUE, FALSE),
  stat = "MED",
  handleNAs = "centralImputNAs",
  min_train = 2,
  nORp = 0.2,
  .int_parallel = FALSE,
  .intRes = TRUE,
  .full_intRes = FALSE,
  ...
)

Arguments

train

a data frame for training

test

a data frame for testing

form

a formula describing the model to learn

model

the name of the algorithm to use

time

the name of the column in train and test containing time-stamps

site_id

the name of the column in train and test containing location IDs

resample.grid

a data.frame with columns indicating resample.pars to test using internal.est. Any NA value in resample.grid will have the argument set to NULL.

resample.pars

parameters to be passed to re-sample function. Default is NULL.

internal.est

character string identifying the internal estimator function to use

internal.est.pars

named list of internal estimator parameters (e.g., tr.perc or nfolds)

internal.evaluator

character string indicating internal evaluation function

internal.eval.pars

named list of parameters to feed to internal evaluation function

metrics

vector of names of two metrics to be used to determine the best parametrization (the second metric is only used in case of ties)

metrics.max

vector of Booleans indicating whether each metric in parameter metrics should be maximized (TRUE) or minimized (FALsE) for best results

stat

parameter indicating summary statistic that should be used to determine the best internal evaluation metric: "MED" (for median) or "MEAN" (for mean)

handleNAs

string indicating how to deal with NAs. If "centralImputNAs", training observations with at least 80% of non-NA columns, will have their NAs substituted by the mean value and testing observatiosn will have their NAs filled in with mean value regardless. Default is NULL.

min_train

a minimum number of observations that must be left to train a model. If there are not enough observations, predictions will be NA. Default is 2.

nORp

a maximum number or fraction of columns/rows with missing values above which a row/column will be removed from train before learning the model. Only works if handleNAs was set to centralImputNAs. Default is 0.2.

.int_parallel

a Boolean indicating whether rows in the grid search should be tested in parallel

.intRes

a Boolean indicating whether the evalRes object outputed by internal validation should be returned. Defaults to TRUE

.full_intRes

a Boolean indicating whether the full results object for internal validation should be returned as well. Defaults to FALSE

...

other parameters to feed to model

Value

a data frame containing time-stamps, location IDs, true values and predicted values


mrfoliveira/STResampling-JDSA2020 documentation built on June 28, 2021, 7:01 p.m.