GeneSelectR: Gene Selection and Evaluation with GeneSelectR

View source: R/GeneSelectR.R

GeneSelectRR Documentation

Gene Selection and Evaluation with GeneSelectR

Description

This function performs gene selection using different methods on a given training set and evaluates their performance using cross-validation. Optionally, it also calculates permutation feature importances.

Usage

GeneSelectR(
  X,
  y,
  pipelines = NULL,
  custom_fs_methods = NULL,
  selected_methods = NULL,
  custom_fs_grids = NULL,
  classifier = NULL,
  classifier_grid = NULL,
  preprocessing_steps = NULL,
  testsize = 0.2,
  validsize = 0.2,
  scoring = "accuracy",
  njobs = -1,
  n_splits = 5,
  search_type = "random",
  n_iter = 10,
  max_features = 50,
  calculate_permutation_importance = FALSE,
  perform_test_split = FALSE,
  random_state = NULL
)

Arguments

X

A matrix or data frame with features as columns and observations as rows.

y

A vector of labels corresponding to the rows of X_train.

pipelines

An optional list of pre-defined pipelines to use for fitting and evaluation. If this argument is provided, the feature selection methods and preprocessing steps will be ignored.

custom_fs_methods

An optional list of feature selection methods to use for fitting and evaluation. If this argument is not provided, a default set of feature selection methods will be used.

selected_methods

An optional vector of names of feature selection methods to use from the default set. If this argument is provided, only the specified methods will be used.

custom_fs_grids

An optional list of hyperparameter grids for the feature selection methods. Each element of the list should be a named list of parameters for a specific feature selection method. The names of the elements should match the names of the feature selection methods. If this argument is provided, the function will perform hyperparameter tuning for the specified feature selection methods in addition to the final estimator.

classifier

An optional sklearn classifier. If left NULL then sklearn RandomForestClassifier is used.

classifier_grid

An optional named list of classifier parameters. If none are provided then default grid is used (check vignette for exact params).

preprocessing_steps

An optional named list of sklearn preprocessing procedures. If none provided defaults are used (check vignette for exact params).

testsize

The size of the test set used in the evaluation.

validsize

The size of the validation set used in the evaluation.

scoring

A string representing what scoring metric to use for hyperparameter adjustment. Default value is 'accuracy'

njobs

Number of jobs to run in parallel.

n_splits

Number of train/test splits.

search_type

A string indicating the type of search to use. 'grid' for GridSearchCV and 'random' for RandomizedSearchCV. Default is 'random'.

n_iter

An integer indicating the number of parameter settings that are sampled in RandomizedSearchCV. Only applies when search_type is 'random'.

max_features

Maximum number of features to be selected by default feature selection methods. Max features cannot exceed the total number of features in a dataset.

calculate_permutation_importance

A boolean indicating whether to calculate permutation feature importance. Default is FALSE.

perform_test_split

Whether to perform train and test split, to have an evaluation on unseen test set. The default value is set to FALSE

random_state

An integer value setting the random seed for feature selection algorithms and cross validation procedure. By default set to NULL to use different random seed every time an algorithm is used. For reproducibility could be fixed, otherwise for an unbiased estimation should be left as NULL.

Value

Returns an object of class PipelineResults with the following elements:

  • @field best_pipeline: A list of the best-fitted pipelines for each feature selection method and data split.

  • @field cv_results: A list containing cross-validation results for each pipeline, including scores and other metrics.

  • @field inbuilt_feature_importance: A list of the inbuilt feature importance scores for each pipeline, aggregated across all data splits.

  • @field test_metrics: A data frame summarizing test metrics (precision, recall, F1 score, accuracy) for each pipeline, if a test split was performed.

  • @field cv_mean_score: A data frame summarizing the mean cross-validation scores for each pipeline across all data splits.

  • @field permutation_importance: A list of permutation importance scores for each pipeline, if permutation importance calculation was enabled. This comprehensive return structure allows for in-depth analysis of the feature selection methods and model performance.

Examples


if (GeneSelectR:::check_python_modules_available(c("numpy", "pandas", "sklearn", 'boruta'))) {
  # Create a mock dataset with 29 feature columns and 1 binary label column
  set.seed(123) # for reproducibility
  n_rows <- 10
  n_features <- 100

  # Randomly generate feature data
  X <- as.data.frame(matrix(rnorm(n_rows * n_features), nrow = n_rows, ncol = n_features))
  # Ensure each feature has a variance greater than 0.85
  for(i in 1:ncol(X)) {
    while(var(X[[i]]) <= 0.85) {
      X[[i]] <- X[[i]] * 1.1
    }
  }
  colnames(X) <- paste0("Feature", 1:n_features)

  # Create a mock binary label column
  y <- factor(sample(c("Class1", "Class2"), n_rows, replace = TRUE))

  # Set up the environment
  GeneSelectR::configure_environment()
  GeneSelectR::set_reticulate_python()

  # Run GeneSelectR
  results <- GeneSelectR(X, y)

  # Perform gene selection and evaluation using user-defined methods
  fs_methods <- list("Lasso" = select_model(lasso(penalty = 'l1',
                                                  C = 0.1,
                                                  solver = 'saga'),
                                            threshold = 'median'))
  custom_fs_grids <- list("Lasso" = list('C' = c(0.1, 1, 10)))
  results <- GeneSelectR(X,
                         y,
                         max_features = 15,
                         custom_fs_methods = fs_methods,
                         custom_fs_grids = custom_fs_grids)
} else {
  message("Skipping example as not all required Python modules are available.")
}


GeneSelectR documentation built on May 29, 2024, 4:01 a.m.