calculate_auc: Prioritize cell types involved in a biological process

View source: R/calculate_auc.R

calculate_aucR Documentation

Prioritize cell types involved in a biological process

Description

Prioritize cell types involved in a complex biological process by training a machine-learning model to predict sample labels (e.g., disease vs. control, treated vs. untreated, or time post-stimulus), and evaluate the performance of the model in cross-validation.

Usage

calculate_auc(
  input,
  meta = NULL,
  label_col = "label",
  cell_type_col = "cell_type",
  n_subsamples = 50,
  subsample_size = 20,
  folds = 3,
  min_cells = NULL,
  var_quantile = 0.5,
  feature_perc = 0.5,
  n_threads = 4,
  show_progress = T,
  augur_mode = c("default", "velocity", "permute"),
  classifier = c("rf", "lr"),
  rf_params = list(trees = 100, mtry = 2, min_n = NULL, importance = "accuracy"),
  lr_params = list(mixture = 1, penalty = "auto")
)

Arguments

input

a matrix, data frame, or Seurat, monocle, or SingleCellExperiment object containing gene expression values (genes in rows, cells in columns) and, optionally, metadata about each cell

meta

a data frame containing metadata about the input gene-by-cell matrix, at minimum containing the cell type for each cell and the labels (e.g., group, disease, timepoint); can be left as NULL if input is a Seurat or monocle object

label_col

the column of the meta data frame, or the metadata container in the Seurat or monocle object, that contains condition labels (e.g., disease, timepoint) for each cell in the gene-by-cell expression matrix; defaults to label

cell_type_col

the column of the meta data frame, or the metadata container in the Seurat/monocle object, that contains cell type labels for each cell in the gene-by-cell expression matrix; defaults to cell_type

n_subsamples

the number of random subsamples of fixed size to draw from the complete dataset, for each cell type; defaults to 50. Set to 0 to omit subsampling altogether, calculating performance on the entire dataset, but note that this may introduce bias due to cell type or label class imbalance. Note that when setting augur_mode = "permute", values less than 100 will be replaced with a default of 500.

subsample_size

the number of cells per type to subsample randomly from each experimental condition, if n_subsamples is greater than 1; defaults to 20

folds

the number of folds of cross-validation to run; defaults to 3. Be careful changing this parameter without also changing subsample_size

min_cells

the minimum number of cells for a particular cell type in each condition in order to retain that type for analysis; defaults to subsample_size

var_quantile

the quantile of highly variable genes to retain for each cell type using the variable gene filter (select_variance); defaults to 0.5

feature_perc

the proportion of genes that are randomly selected as features for input to the classifier in each subsample using the random gene filter (select_random); defaults to 0.5

n_threads

the number of threads to use for parallelization; defaults to 4.

show_progress

if TRUE, display a progress bar for the analysis with estimated time remaining

augur_mode

one of "default", "velocity", or "permute". Setting augur_mode = "velocity" disables feature selection, assuming feature selection has been performed by the RNA velocity procedure to produce the input matrix, while setting augur_mode = "permute" will generate a null distribution of AUCs for each cell type by permuting the labels

classifier

the classifier to use in calculating area under the curve, one of "rf" (random forest) or "lr" (logistic regression); defaults to "rf", which is the recommended setting

rf_params

for classifier == "rf", a list of parameters for the random forest models, containing the following items (see rand_forest from the parsnip package):

"mtry"

the number of features randomly sampled at each split in the random forest classifier; defaults to 2

"trees"

the number of trees in the random forest classifier; defaults to 100

"min_n"

the minimum number of observations to split a node in the random forest classifier; defaults to NULL

"importance"

the method of calculating feature importances to use; defaults to "accuracy"; can also specify "gini"

lr_params

for classifier == "lr", a list of parameters for the logistic regression models, containing the following items (see logistic_reg from the parsnip package):

"mixture"

the proportion of L1 regularization in the model; defaults to 1

"penalty"

the total amount of regularization in the model; defaults to "auto", which uses cv.glmnet to set the penalty

Details

If a Seurat object is provided as input, Augur will use the default assay (i.e., whatever GetAssayData returns) as input. To use a different assay, provide the expression matrix and metadata as input separately, using the input and meta arguments.

Value

a list of class "Augur", containing the following items:

  1. X: the numeric matrix (or data frame or sparse matrix, depending on the input) containing gene expression values for each cell in the dataset

  2. y: the vector of experimental condition labels being predicted

  3. cell_types: the vector of cell type labels

  4. parameters: the parameters provided to this function as input

  5. results: the area under the curve for each cell type, in each fold, in each subsample, in the comparison of interest, as well as a series of other classification metrics

  6. feature_importance: the importance of each feature for calculating the AUC, above. For random forest classifiers, this is the mean decrease in accuracy or Gini index. For logistic regression classifiers, this is the standardized regression coefficients, computed using the Agresti method

  7. AUC: a summary of the mean AUC for each cell type (for continuous experimental conditions, this is replaced by a CCC item that records the mean concordance correlation coefficient for each cell type)


neurorestore/Augur documentation built on Oct. 28, 2024, 9:41 a.m.