exir: Experimental data-based Integrated Ranking
In influential: Identification and Classification of the Most Influential Nodes

View source: R/exir.R

exir	R Documentation

Experimental data-based Integrated Ranking

Description

This function runs the Experimental data-based Integrated Ranking (ExIR) model for the classification and ranking of top candidate features. The input data could come from any type of experiment such as transcriptomics and proteomics. A shiny app has also been developed for Running the ExIR model, visualization of its results as well as computational simulation of knockout and/or up-regulation of its top candidate outputs, which is accessible using the 'influential::runShinyApp("ExIR")' command. You can also access the shiny app online at https://influential.erc.monash.edu/.

Usage

exir(
  Desired_list = NULL,
  Diff_data,
  Diff_value,
  Regr_value = NULL,
  Sig_value,
  Exptl_data,
  Exptl_data_type = c("bulk", "sc"),
  condition = "condition",
  Exptl_data_orientation = c("features_rows", "samples_rows"),
  assay = "RNA",
  layer = "counts",
  normalize = FALSE,
  pseudo_sample = FALSE,
  pseudo_samples_per_group = 100,
  Exptl_data_size_check = TRUE,
  feature_filter = TRUE,
  min_feature_prevalence = NULL,
  min_feature_total = NULL,
  min_feature_variance = 1e-12,
  always_keep_diff_features = TRUE,
  cor_thresh_method = "mr",
  r = 0.5,
  mr = 20,
  max.connections = 50000,
  alpha = 0.05,
  num_trees = 500,
  mtry = NULL,
  num_permutations = 50,
  inf_const = 10^10,
  ncores = "default",
  seed = 1234,
  verbose = TRUE
)

Arguments

`Desired_list`	(Optional) A character vector of your desired features. This vector could be, for instance, a list of features obtained from cluster analysis, time-course analysis, or a list of dysregulated features with a specific sign.
`Diff_data`	A dataframe of all significant differential/regression data and their statistical significance values (p-value/adjusted p-value). Note that the differential data should be in the log fold-change (log2FC) format. You may have selected a proportion of the differential data as the significant ones according to your desired thresholds. A function, named `diff_data.assembly`, has also been provided for the convenient assembling of the Diff_data dataframe.
`Diff_value`	An integer vector containing the column number(s) of the differential data in the Diff_data dataframe. The differential data could result from any type of differential data analysis. One example could be the fold changes (FCs) obtained from differential expression analyses. The user may provide as many differential data as he/she wish.
`Regr_value`	(Optional) An integer vector containing the column number(s) of the regression data in the Diff_data dataframe. The regression data could result from any type of regression data analysis or other analyses such as time-course data analyses that are based on regression models.
`Sig_value`	An integer vector containing the column number(s) of the significance values (p-value/adjusted p-value) of both differential and regression data (if provided). Providing significance values for the regression data is optional.
`Exptl_data`	Experimental data used by the ExIR model. This can be a data frame, tibble, matrix, sparse matrix such as a `dgCMatrix`, or a Seurat object. For non-Seurat inputs, the expected orientation is controlled by `Exptl_data_orientation`. By default, features/genes are expected to be in rows and samples/cells in columns, which is the usual omics layout. Internally, ExIR converts the data to its required analysis format, with samples/cells in rows and features in columns.
`Exptl_data_type`	Character string specifying the experimental data type. One of `"bulk"` or `"sc"`. This is used for data-type checks, optional normalization, and pseudo-sampling. For `"bulk"`, the input expression data may be either already normalized/log-transformed or raw count-like data when `normalize = TRUE`. For `"sc"`, raw counts are recommended and raw counts are required when `pseudo_sample = TRUE`.
`condition`	A character string or character/factor vector specifying the sample/cell conditions. If a single character string is supplied, it is interpreted as the name of the condition column/row in `Exptl_data`, or as the name of a metadata column when `Exptl_data` is a Seurat object. If a vector is supplied, it must have the same length and order as the samples/cells in `Exptl_data`. Default is `"condition"`.
`Exptl_data_orientation`	Character string specifying the orientation of non-Seurat `Exptl_data`. One of `"features_rows"` or `"samples_rows"`. If `"features_rows"`, features are rows and samples/cells are columns. If `"samples_rows"`, samples/cells are rows and features are columns. Default is `"features_rows"`.
`assay`	Character string specifying the assay to use when `Exptl_data` is a Seurat object. Default is `"RNA"`.
`layer`	Character string specifying the assay layer to use when `Exptl_data` is a Seurat object. For pseudo-sampling of single-cell data, this should usually be a raw-count layer such as `"counts"`. Default is `"counts"`.
`normalize`	Logical; whether to normalize count-like input data using TMM normalization followed by logCPM transformation with edgeR. Default is `FALSE`. For `Exptl_data_type = "bulk"`, this can be used when raw bulk RNA-seq count-like data are supplied. This normalization strategy is appropriate for many bulk omics count datasets, especially bulk RNA-seq, but users should confirm that TMM/logCPM normalization is suitable for their specific data modality. If the data modality requires a different normalization strategy, users should pre-normalize their data and set `normalize = FALSE`. For `Exptl_data_type = "sc"`, normalization is automatically applied after pseudo-bulking when `pseudo_sample = TRUE`. If `Exptl_data_type = "sc"` and `pseudo_sample = FALSE`, users should provide pre-normalized single-cell data and keep `normalize = FALSE`.
`pseudo_sample`	Logical; whether to perform pseudo-sampling before running ExIR. Pseudo-sampling is recommended when the number of cells/samples is greater than 500 or when computational resources are limited. For bulk data, pseudo-sampling averages normalized log-expression values within non-overlapping condition-specific groups. For single-cell data, pseudo-sampling sums raw counts within non-overlapping condition-specific groups, followed by TMM normalization and logCPM transformation using edgeR. Default is `FALSE`.
`pseudo_samples_per_group`	Integer specifying the target number of pseudo-samples to generate per condition group when `pseudo_sample = TRUE`. For example, if one condition contains 500 cells/samples and `pseudo_samples_per_group = 100`, each pseudo-sample will contain 5 cells/samples. If another condition contains 536 cells/samples, 64 pseudo-samples will contain 5 cells/samples and 36 pseudo-samples will contain 6 cells/samples. Default is `100`.
`Exptl_data_size_check`	Logical; whether to check the number of input samples/cells and, in interactive sessions, prompt the user to consider pseudo-sampling when more than 500 samples/cells are provided and `pseudo_sample = FALSE`. In non-interactive sessions, a message is shown and the function continues. Default is `TRUE`.
`feature_filter`	Logical; whether to apply conservative feature filtering before running RF, PCA and correlation analysis. This filter is not a highly variable gene filter. It removes only features with insufficient expression/prevalence or essentially zero variance, which are unlikely to produce reliable correlations. Default is `TRUE`.
`min_feature_prevalence`	Integer or `NULL`; minimum number of samples/cells/ pseudo-samples in which a feature must be non-zero to be retained. If `NULL`, an adaptive conservative threshold is used based on `Exptl_data_type`, pseudo-sampling, and sample size.
`min_feature_total`	Numeric or `NULL`; minimum total abundance/count/expression support required for a feature to be retained. If `NULL`, this criterion is not used. For raw count-like data, values such as 10 or 20 may be useful. For normalized/log-scale data, users should usually leave this as `NULL`.
`min_feature_variance`	Numeric; minimum variance required for a feature to be retained. This is intended only to remove zero-variance or near-zero-variance features, not to perform HVG selection. Default is `1e-12`.
`always_keep_diff_features`	Logical; whether to always retain features present in `Diff_data` and `Desired_list`, even if they fail the conservative expression/ prevalence filters. This helps preserve candidate differential features while still reducing uninformative non-DE background features. Default is `TRUE`.
`cor_thresh_method`	A character string indicating the method for filtering the correlation results, either "mr" (default; Mutual Rank) or "cor.coefficient".
`r`	The threshold of Spearman correlation coefficient for the selection of correlated features (default is 0.5).
`mr`	An integer determining the threshold of mutual rank for the selection of correlated features (default is 20). Note that higher mr values considerably increase the computation time.
`max.connections`	The maximum number of connections to be included in the association network. Higher max.connections might increase the computation time, cost, and accuracy of the results (default is 50,000).
`alpha`	The threshold of the statistical significance (p-value) used throughout the entire model (default is 0.05)
`num_trees`	Number of trees to be used for the random forests classification (supervised machine learning). Default is set to 500.
`mtry`	Number of features to possibly split at in each node. Default is the (rounded down) square root of the number of variables. Alternatively, a single argument function returning an integer, given the number of independent variables.
`num_permutations`	Number of permutations to be used for computation of the statistical significance (p-values) of the importance scores resulted from random forests classification (default is 50).
`inf_const`	The constant value to be multiplied by the maximum absolute value of differential (logFC) values for the substitution with infinite differential values. This results in noticeably high biomarker values for features with infinite differential values compared with other features. Having said that, the user can still use the biomarker rank to compare all of the features. This parameter is ignored if no infinite value is present within Diff_data. However, this is used in the case of sc-seq experiments where some genes are uniquely expressed in a specific cell-type and consequently get infinite differential values. Note that the sign of differential value is preserved (default is 10^10).
`ncores`	Integer; the number of cores to be used for parallel processing. If ncores == "default" (default), the number of cores to be used will be the max(number of available cores) - 1. We recommend leaving ncores argument as is (ncores = "default").
`seed`	The seed to be used for all of the random processes throughout the model (default is 1234).
`verbose`	Logical; whether to display formatted progress messages and a progress bar using cli. If `TRUE`, ExIR reports the major analysis stages, selected warnings, and a final output summary. Default is `TRUE`.

Value

A list of one graph and one to four tables including:

- Driver table: Top candidate drivers

- DE-mediator table: Top candidate differentially expressed/abundant mediators

- nonDE-mediator table: Top candidate non-differentially expressed/abundant mediators

- Biomarker table: Top candidate biomarkers

The number of returned tables depends on the input data and specified arguments.

Examples

## Not run: 
MyDesired_list <- Desiredlist
MyDiff_data <- Diffdata
Diff_value <- c(1,3,5)
Regr_value <- 7
Sig_value <- c(2,4,6,8)
MyExptl_data <- Exptldata
condition <- "condition"
My.exir <- exir(Desired_list = MyDesired_list,
               Diff_data = MyDiff_data, Diff_value = Diff_value,
               Regr_value = Regr_value, Sig_value = Sig_value,
               Exptl_data = MyExptl_data, condition = condition)

## End(Not run)

influential documentation built on May 28, 2026, 5:07 p.m.