add_bestfeature: Add the best feature from a list to improve an existing set
In jeanmarcgp/mlStocks: Machine Learning Predictive Analysis for Stocks

Description Usage Arguments Details Value

Given a list of candidate features and a separate base set of features, this functon will seach through the list to find the single most predictive feature to add to the base set. This is useful during search in feature selection algorithms.

add_bestfeature(base_set, feature_list, train_set, val_set, Nrepeat = 1,
  mlalgo = "h2o_rf", mlpar = list(mtry = 1, ntree = 1000, min_rows = 5),
  meritFUN = "trading_returns", meritFUNpar = list(long_thres = 0,
  short_thres = 0))

`base_set`	The base set of features that is given and used in all model train/validate runs. At every run, one feature from the feature_list is selected and added to this base_set, and used to train and validate the model.
`feature_list`	A list or a vector of features used during the iterative loop. At each iteration, one feature from this list is extracted and combined with the base_set, and the resulting set is used to train the machine learning model.
`train_set`	The training set used to build the model. Column 1 should contain the target variable (y). The features argument above is used to subset the train_set to extract the features for training.
`val_set`	The validation set used to validate the model's performance. Column 1 should contain the target variable (y). The features argument above is used to subset the val_set and extract the features for predicting.
`Nrepeat`	Number of times to iterate the train-validate process. This is useful to build and validate multiple identical models and compile statistics on the figure of merits for all runs. Doing this helps to empirically determine the hyper-parameter values by ensuring all such models make similar predictions.
`mlalgo`	The machine learning algorithm used to build the model.
`mlpar`	A named list containing the machine learning model parameters. If a parameter is missing, then the model's defaults are used.
`meritFUN`	The name of a function used to calculate a numeric figure of merit (FOM) to include in the return list for evaluation by an upper layer function.
`meritFUNpar`	The name of a function used to calculate a numeric figure of merit (FOM) to include in the return list for evaluation by an upper layer function.

For each feature in the feature_list set, train a machine learning model by combining the candidate feature with the base_set and validate against a validation data set. The function evaluates the performance against the validation set using an externally provided function to obtain a figure of merit associated with the feature set being evaluated. When used iteratively to loop through different sets of features, the figure of merit can be compared to assess the relative predictive power of each feature set. The best performing set is identified and the associated features set is returned as a dataframe ($bestruns) as part of a list.

This is a low-level function normally used within a higher level loop to perform feature selection through iteratively training and validation.

Returns a list with the following elements:

$summary

A dataframe containing the summary results for each feature set tested. Each feature set is a dataframe row. The columns include:

added_feature The feature added to the feature set at the given run.
alpha_mean The average of the alpha generated among all identical models. The function generates Nrepeat identical models for this purpose, and alpha is the model's mean return - the market's mean return.
alpha_sd The standard deviation of the alphas obtained from all Nrepeat identical models.
Normalized_sd The normalized alpha standard deviation, taken as alpha_sd / alpha_mean.
PctTraded The percentage of vectors in the validation set that are selected to be traded according to the meritFUN argument.
PctLongs The percentage of vectors in the validation set that are traded as longs (buy trades).
PctShorts The percentage of vectors in the validation set that are traded as shorts (sell trades).
Nmarket The number of vectors available in the validation set. This defines the market against which the models attempt to extract positive alpha.

$bestrun

A dataframe row identical as the summary above, but containing the best run data as measured by the highest alpha_mean. It also has one additional column 'feature_set', which is a character string of the best feature set, where each feature is separated by a comma.

$bestset

A character vector of the best feature set. This is the set used as the best run but organized as a character vector instead of a single string as in bestrun$featureset above.

$rundetails

A dataframe containing the details of each underlying run. The number of rows equals Nrepeat runs * the number of features tested. For example, if 3 features are tested against the base set and for each feature set we rebuild the model 8 times, then we have 3 * 8 = 24 runs and rows in this dataframe. The columns are as follows:

Trade_Alpha The alpha generated by the given run. Alpha is the average return of all trades identified by the run, minus the average of all trades in the market.
Trade_Rets The average return of all trades identified by the run.
Market_Rets The average return of all trades in the market (validation set).
NLongs The number of vectors in the validation set that are traded as longs (buy trades).
NShorts The number of vectors in the validation set that are traded as shorts (sell trades).
Nmarket The number of vectors available in the validation set. This defines the market against which the models attempt to extract a positive alpha.
PctLongs The percentage of vectors in the validation set that are traded as longs (buy trades).
PctShorts The percentage of vectors in the validation set that are traded as shorts (sell trades).
PctTraded The percentage of vectors in the validation set that are traded in total (Longs + Shorts).