immunaut: Main function to carry out Immunaut Analysis
In immunaut: Machine Learning Immunogenicity and Vaccine Response Analysis

immunaut

R Documentation

Main function to carry out Immunaut Analysis

Description

This function performs clustering and dimensionality reduction analysis on a dataset using user-defined settings. It handles various preprocessing steps, dimensionality reduction via t-SNE, multiple clustering methods, and generates associated plots based on user-defined or default settings.

Usage

immunaut(dataset, settings = list())

Arguments

dataset

A data frame representing the dataset on which the analysis will be performed. The dataset must contain numeric columns for dimensionality reduction and clustering.

settings

A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:

fileHeader

A data frame mapping the original column names to remapped column names. Used for t-SNE input preparation.

selectedColumns

Character vector of columns to be used for the analysis. Defaults to NULL.

cutOffColumnSize

Numeric. The maximum size of the dataset in terms of columns. Defaults to 50,000.

excludedColumns

Character vector of columns to exclude from the analysis. Defaults to NULL.

groupingVariables

Character vector of columns to use for grouping the data during analysis. Defaults to NULL.

colorVariables

Character vector of columns to use for coloring in the plots. Defaults to NULL.

preProcessDataset

Character vector of preprocessing methods to apply (e.g., scaling, normalization). Defaults to NULL.

fontSize

Numeric. Font size for plots. Defaults to 12.

pointSize

Numeric. Size of points in plots. Defaults to 1.5.

theme

Character. The ggplot2 theme to use (e.g., "theme_gray"). Defaults to "theme_gray".

colorPalette

Character. Color palette for plots (e.g., "RdPu"). Defaults to "RdPu".

aspect_ratio

Numeric. The aspect ratio of plots. Defaults to 1.

clusterType

Character. The clustering method to use. Options are "Louvain", "Hierarchical", "Mclust", "Density". Defaults to "Louvain".

removeNA

Logical. Whether to remove rows with NA values. Defaults to FALSE.

datasetAnalysisGrouped

Logical. Whether to perform grouped dataset analysis. Defaults to FALSE.

plot_size

Numeric. The size of the plot. Defaults to 12.

knn_clusters

Numeric. The number of clusters for KNN-based clustering. Defaults to 250.

perplexity

Numeric. The perplexity parameter for t-SNE. Defaults to NULL (automatically determined).

exaggeration_factor

Numeric. The exaggeration factor for t-SNE. Defaults to NULL.

max_iter

Numeric. The maximum number of iterations for t-SNE. Defaults to NULL.

theta

Numeric. The Barnes-Hut approximation parameter for t-SNE. Defaults to NULL.

eta

Numeric. The learning rate for t-SNE. Defaults to NULL.

clustLinkage

Character. Linkage method for hierarchical clustering. Defaults to "ward.D2".

clustGroups

Numeric. The number of groups for hierarchical clustering. Defaults to 9.

distMethod

Character. Distance metric for clustering. Defaults to "euclidean".

minPtsAdjustmentFactor

Numeric. Adjustment factor for the minimum points in DBSCAN clustering. Defaults to 1.

epsQuantile

Numeric. Quantile to compute the epsilon parameter for DBSCAN clustering. Defaults to 0.9.

assignOutliers

Logical. Whether to assign outliers in the clustering step. Defaults to TRUE.

excludeOutliers

Logical. Whether to exclude outliers from clustering. Defaults to TRUE.

legendPosition

Character. Position of the legend in plots (e.g., "right", "bottom"). Defaults to "right".

datasetAnalysisClustLinkage

Character. Linkage method for dataset-level analysis. Defaults to "ward.D2".

datasetAnalysisType

Character. Type of dataset analysis (e.g., "heatmap"). Defaults to "heatmap".

datasetAnalysisRemoveOutliersDownstream

Logical. Whether to remove outliers during downstream dataset analysis (e.g., machine learning). Defaults to FALSE.

datasetAnalysisSortColumn

Character. The column used to sort dataset analysis results. Defaults to "cluster".

datasetAnalysisClustOrdering

Numeric. The order of clusters for analysis. Defaults to 1.

anyNAValues

Logical. Whether the dataset contains NA values. Defaults to FALSE.

categoricalVariables

Logical. Whether the dataset contains categorical variables. Defaults to FALSE.

resolution_increments

Numeric vector. The resolution increments to be used for Louvain clustering. Defaults to c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5).

min_modularities

Numeric vector. The minimum modularities to test for clustering. Defaults to c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9).

target_clusters_range

Numeric vector. The range of acceptable clusters to identify. Defaults to c(3, 6).

pickBestClusterMethod

Character. The method to use for picking the best clustering result ("Modularity", "Silhouette", or "SIMON"). Defaults to "Modularity".

weights

List. Weights for evaluating clusters based on AUROC, modularity, and silhouette. Defaults to list(AUROC = 0.5, modularity = 0.3, silhouette = 0.2). These weights are applied to help choose the most relevant clusters based on user goals:

AUROC: Weight for predictive performance (area under the receiver operating characteristic curve). Prioritize this when predictive accuracy is the main goal. For predictive analysis, a recommended configuration could be list(AUROC = 0.8, modularity = 0.1, silhouette = 0.1).
modularity: Weight for modularity score, which indicates the strength of clustering. Higher modularity suggests that clusters are well-separated. To prioritize well-separated clusters, use a configuration like list(AUROC = 0.4, modularity = 0.4, silhouette = 0.2).
silhouette: Weight for silhouette score, a measure of cohesion within clusters. Useful when cluster cohesion and interpretability are desired. For balanced clusters, a suggested configuration is list(AUROC = 0.4, modularity = 0.3, silhouette = 0.3).

Value

A list containing the following:

tsne_calc: The t-SNE results object.
tsne_clust: The clustering results.
dataset: A list containing the original dataset, the preprocessed dataset, and a dataset with machine learning-ready data.
clusters: The final cluster assignments.
settings: The list of settings used for the analysis.

Examples


  data <- matrix(runif(2000), ncol=20)
  settings <- list(clusterType = "Louvain", 
  resolution_increments = c(0.05, 0.1), 
  min_modularities = c(0.3, 0.5))
  result <- immunaut(data.frame(data), settings)
  print(result$clusters)

immunaut documentation built on April 12, 2025, 1:22 a.m.