| designTreatmentsC | R Documentation | 
Function to design variable treatments for binary prediction of a
categorical outcome.  Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric). Note: re-encoding high cardinality
categorical variables can introduce undesirable nested model bias, for such data consider
using mkCrossFrameCExperiment.
designTreatmentsC(
  dframe,
  varlist,
  outcomename,
  outcometarget = TRUE,
  ...,
  weights = c(),
  minFraction = 0.02,
  smFactor = 0,
  rareCount = 0,
  rareSig = NULL,
  collarProb = 0,
  codeRestriction = NULL,
  customCoders = NULL,
  splitFunction = NULL,
  ncross = 3,
  forceSplit = FALSE,
  catScaling = TRUE,
  verbose = TRUE,
  parallelCluster = NULL,
  use_parallel = TRUE,
  missingness_imputation = NULL,
  imputation_map = NULL
)
dframe | 
 Data frame to learn treatments from (training data), must have at least 1 row.  | 
varlist | 
 Names of columns to treat (effective variables).  | 
outcomename | 
 Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values.  | 
outcometarget | 
 Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice.  | 
... | 
 no additional arguments, declared to forced named binding of later arguments  | 
weights | 
 optional training weights for each row  | 
minFraction | 
 optional minimum frequency a categorical level must have to be converted to an indicator column.  | 
smFactor | 
 optional smoothing factor for impact coding models.  | 
rareCount | 
 optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.  | 
rareSig | 
 optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.  | 
collarProb | 
 what fraction of the data (pseudo-probability) to collar data at if doCollar is set during   | 
codeRestriction | 
 what types of variables to produce (character array of level codes, NULL means no restriction).  | 
customCoders | 
 map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md).  | 
splitFunction | 
 (optional) see vtreat::buildEvalSets .  | 
ncross | 
 optional scalar >=2 number of cross validation splits use in rescoring complex variables.  | 
forceSplit | 
 logical, if TRUE force cross-validated significance calculations on all variables.  | 
catScaling | 
 optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling.  | 
verbose | 
 if TRUE print progress.  | 
parallelCluster | 
 (optional) a cluster object created by package parallel or package snow.  | 
use_parallel | 
 logical, if TRUE use parallel methods (when parallel cluster is set).  | 
missingness_imputation | 
 function of signature f(values: numeric, weights: numeric), simple missing value imputer.  | 
imputation_map | 
 map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.  | 
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - #' - sig : an estimate significance of effect
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
Note: re-encoding high cardinality on training data can introduce nested model bias, consider using mkCrossFrameCExperiment instead.
treatment plan (for use with prepare)
prepare.treatmentplan, designTreatmentsN, designTreatmentsZ, mkCrossFrameCExperiment
dTrainC <- data.frame(x=c('a','a','a','b','b','b'),
   z=c(1,2,3,4,5,6),
   y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
dTestC <- data.frame(x=c('a','b','c',NA),
   z=c(10,20,30,NA))
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=0.99)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.