run_cross_validation: Run k fold cross validation

Description Usage Arguments Value Examples

View source: R/run_cross_validation.R

Description

This function runs k fold cross validation by splitting input data into k partitions and holding out each partition as the test set in k different learning iterations.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
run_cross_validation(
  y,
  X,
  Verbose.pass = FALSE,
  restrict = FALSE,
  rank = FALSE,
  Verbose = TRUE,
  standardize_features = TRUE,
  cluster_corr_prop = 1,
  ct = 1,
  sibling_prune = 0.1,
  k = 5,
  condensed_output = TRUE
)

Arguments

y

outcome variable

X

inut feature matrix

Verbose.pass

logical as to whether the kTSCR procedure should be verbose (i.e. should run_cross_validation pass 'verbose=TRUE' to get_top_clusters()) )

restrict

a list of colnames of X by which to restrict the analysis

rank

logical as to whether to use rank of outcome

Verbose

logical as to whether to be verbose

standardize_features

logical as to whether to standardize all features of X

cluster_corr_prop

what proportion of the maximum (weighted) cluster correlation with y should be reflected by the chosen siblings. A hyperparameter. Default is 1 (meaning include all elder-sibling pairs in cluster)

ct

correlation threshold determined how much a new cluster must improve the current correlation with y in order to be added as a top cluster. A hyperparameter. Default is 1 (meaning any improvement is sufficient to add the next cluster within the greedy framework)

sibling_prune

numeric between 0-1 that sets the threshold for how close apparent correlation and test correlation must be for a k-cv iteration to contribute its siblings to the final chosen siblings. In other words, a lower number is more stringent, since it means the overfitting had to be really low in a k-cv iteration for it to contribute to the final sibling output.

k

the k parameter in k fold cross validation (i.e. train/test partitions). Default is 5

condensed_output

return output that is condensed and summarized across k_cv iterations, specifically with regard to feature importance

Value

returns the list given by get_top_clusters for each n fold k cv run and includes the test correlation and train/test splits from each iteration

Examples

1
2
3
4
5
C <- 100  # represents samples
R <- 200 # represents features
y <- rnorm(C) # represents outcome variable
X <- matrix(rbeta(R*C, 2, 3), nrow = R)  # simulate data matrix
cv_res <- run_cross_validation

mdkessler/kTSCR documentation built on Feb. 25, 2021, 10:31 p.m.