nes_var_select: Parallel natural evolutionary variable selection assuming...

Description Usage Arguments Value Examples

View source: R/LEGIT.R

Description

[Slow, highly recommended when the number of variables is large] Use natural evolution strategy (nes) gradient descent ran in parallel to find the best subset of variables. It is often as good as genetic algorithms but much faster so it is the recommended variable selection function to use as default. Note that this approach assumes that the inclusion of a variable does not depends on whether other variables are included (i.e. it assumes independent bernouilli distributions); this is generally not true but this approach still converge well and running it in parallel increases the probability of reaching the global optimum.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
nes_var_select(
  data,
  formula,
  parallel_iter = 3,
  alpha = c(1, 5, 10),
  entropy_threshold = 0.05,
  popsize = 25,
  lr = 0.2,
  prop_ignored = 0.5,
  latent_var = NULL,
  search_criterion = "AICc",
  n_cluster = 3,
  eps = 0.01,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  seed = NULL,
  progress = TRUE,
  cv_iter = 5,
  cv_folds = 5,
  folds = NULL,
  Huber_p = 1.345,
  classification = FALSE,
  print = FALSE,
  test_only = FALSE
)

Arguments

data

data.frame of the dataset to be used.

formula

Model formula. The names of latent_var can be used in the formula to represent the latent variables. If names(latent_var) is NULL, then L1, L2, ... can be used in the formula to represent the latent variables. Do not manually code interactions, write them in the formula instead (ex: G*E1*E2 or G:E1:E2).

parallel_iter

number of parallel tries (Default = 3). For speed, I recommend using the number of CPU cores.

alpha

vector of the parameter for the Dirichlet distribution of the starting points (Assuming a symmetric Dirichlet distribution with only one parameter). If the vector has size N and parralel_iter=K, we use alpha[1], ..., alpha[N], alpha[1], ... , alpha[N], ... for parallel_iter 1 to K respectively. We assume a dirichlet distribution for the starting points to get a bit more variability and make sure we are not missing on a great subset of variable that doesn't converge to the global optimum with the default starting points. Use bigger values for less variability and lower values for more variability (Default = c(1,5,10)).

entropy_threshold

Entropy threshold for convergence of the population (Default = .10). The smaller the entropy is, the less diversity there is in the population, which means convergence.

popsize

Size of the population, the number of subsets of variables sampled at each iteration (Default = 25). Between 25 and 100 is generally adequate.

lr

learning rate of the gradient descent, higher will converge faster but more likely to get stuck in local optium (Default = .2).

prop_ignored

The proportion of the population that are given a fixed fitness value, thus their importance is greatly reduce. The higher it is, the longer it takes to converge. Highers values makes the algorithm focus more on favorizing the good subsets of variables than penalizing the bad subsets (Default = .50).

latent_var

list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ...

search_criterion

Criterion used to determine which variable subset is the best. If search_criterion="AIC", uses the AIC, if search_criterion="AICc", uses the AICc, if search_criterion="BIC", uses the BIC, if search_criterion="cv", uses the cross-validation error, if
search_criterion="cv_AUC", uses the cross-validated AUC, if search_criterion="cv_Huber", uses the Huber cross-validation error, if search_criterion="cv_L1", uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the R^2 is based on) that are more resistant to outliers, the lower the values the better.

n_cluster

Number of parallel clusters, I recommend using the number of CPU cores (Default = 1).

eps

Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). Note that using .001 rather than .01 (default) can more than double or triple the computing time of genetic_var_select.

maxiter

Maximum number of iterations.

family

Outcome distribution and link function (Default = gaussian).

ylim

Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).

seed

Optional seed.

progress

If TRUE, shows the progress done (Default=TRUE).

cv_iter

Number of cross-validation iterations (Default = 5).

cv_folds

Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data) will lead to leave-one-out cross-validation.

folds

Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.

Huber_p

Parameter controlling the Huber cross-validation error (Default = 1.345).

classification

Set to TRUE if you are doing classification and cross-validation (binary outcome).

print

If TRUE, print the parameters of the search distribution and the entropy at each iteration. Note: Only works using Rterm.exe in Windows due to parallel clusters. (Default = FALSE).

test_only

If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

Value

Returns a list containing the best subset's fit, cross-validation output, latent variables and starting points.

Examples

1
2
3
4
5
6
7
## Not run: 
## Example
train = example_3way_3latent(250, 2, seed=777)
nes = nes_var_select(train$data, latent_var=train$latent_var,
formula=y ~ E*G*Z)

## End(Not run)

LEGIT documentation built on Jan. 12, 2022, 1:08 a.m.