auto_simon_ml: Automated Machine Learning Model Building

View source: R/functions.R

auto_simon_mlR Documentation

Automated Machine Learning Model Building

Description

This function automates the process of building machine learning models using the caret package. It supports both binary and multi-class classification and allows users to specify a list of machine learning algorithms to be trained on the dataset. The function splits the dataset into training and testing sets, applies preprocessing steps, and trains models using cross-validation. It computes relevant performance metrics such as confusion matrix, AUROC (for binary classification), and prAUC (for binary classification).

Usage

auto_simon_ml(dataset_ml, settings)

Arguments

dataset_ml

A data frame containing the dataset for training. All columns except the outcome column should contain the features.

settings

A list containing the following parameters:

  • outcome: A string specifying the name of the outcome column in dataset_ml. Defaults to "immunaut" if not provided.

  • excludedColumns: A vector of column names to be excluded from the training data. Defaults to NULL.

  • preProcessDataset: A vector of preprocessing steps to be applied (e.g., c("center", "scale", "medianImpute")). Defaults to NULL.

  • selectedPartitionSplit: A numeric value specifying the proportion of data to be used for training. Must be between 0 and 1. Defaults to 0.7.

  • selectedPackages: A character vector specifying the machine learning algorithms to be used for training (e.g., "nb", "rpart"). Defaults to c("nb", "rpart").

Details

The function performs preprocessing (e.g., centering, scaling, and imputation of missing values) on the dataset based on the provided settings. It splits the data into training and testing sets using the specified partition, trains models using cross-validation, and computes performance metrics.

For binary classification problems, the function calculates AUROC and prAUC. For multi-class classification, it calculates macro-averaged AUROC, though prAUC is not used.

The function returns a list of trained models along with their performance metrics, including confusion matrix, variable importance, and post-resample metrics.

Value

A list where each element corresponds to a trained model for one of the algorithms specified in settings$selectedPackages. Each element contains:

  • info: General information about the model, including resampling indices, problem type, and outcome mapping.

  • training: The trained model object and variable importance.

  • predictions: Predictions on the test set, including probabilities, confusion matrix, post-resample statistics, AUROC (for binary classification), and prAUC (for binary classification).

Examples

## Not run: 
dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1)

# Generate a file header for the dataset to use in downstream analysis
file_header <- generate_file_header(dataset)

settings <- list(
    fileHeader = file_header,
    # Columns selected for analysis
    selectedColumns = c("ExampleColumn1", "ExampleColumn2"), 
    clusterType = "Louvain",
    removeNA = TRUE,
    preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
    target_clusters_range = c(3,4),
    resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
    min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9),
    pickBestClusterMethod = "Modularity",
    seed = 1337
)

result <- immunaut(dataset, settings)
dataset_ml <- result$dataset$original
dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster
dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster)
dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))]
settings_ml <- list(
    excludedColumns = c("ExampleColumn0"),
    preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
    selectedPartitionSplit = split,  # Use the current partition split
    selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA", 
    "gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"),
    trainingTimeout = 180  # Timeout 3 minutes
)
ml_results <- auto_simon_ml(dataset_ml, settings_ml)

## End(Not run)


immunaut documentation built on April 12, 2025, 1:22 a.m.