Benchmark: 'Benchmark' object constructor
In davnovak/SingleBench: Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Description Usage Arguments Details Input data and tools Manual labelling of data points Subpipelines and n-parameters Stability analysis of clustering Train-and-map Hierarchical labelling sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction Distance matrix construction HDF5 integration Creating new wrappers See Also

View source: R/01_BenchmarkSetUp_.R

Creates an object of type Benchmark, used to configure a benchmarking pipeline and input data. This object allows to run projection (dimension-reduction/denoising) and clustering pipelines on high-dimensional single-cell data (such as from flow cytometry, mass cytometry, CITE-seq, scRNA-seq).

Benchmark(
  input,
  input_labels = NULL,
  hierarchy = FALSE,
  hierarchy.params = list(g_c = 0, g_a = 0.2, m_c = 0.4, m_a = 0),
  input_features = NULL,
  remove_labels = NULL,
  unassigned_labels = "unassigned",
  compensation_matrix = NULL,
  transform = NULL,
  transform_cofactor = NULL,
  transform_features = NULL,
  input_marker_types = NULL,
  batch_id = "sample_id",
  projection.training_set = NULL,
  clustering.training_set = NULL,
  precomputed_knn = NULL,
  knn.k = 100,
  knn.algorithm = "annoy",
  knn.distance = "euclidean",
  knn.features = NULL,
  precomputed_dist = NULL,
  subpipelines = NULL,
  n_params = NULL,
  stability_repeat = NULL,
  stability_bootstrap = NULL,
  benchmark_name = "SingleBench_Benchmark",
  h5_path = paste0(benchmark_name, ".h5"),
  ask_overwrite = TRUE,
  verbose = TRUE
)

`input`	string vector, `flowSet` or `SummarizedExperiments`: input dataset given as vector of FCS file paths, `flowSet` object or `SummarizedExperiment` object
`input_labels`	optional string or factor vector (or a list of these vectors): if manual labels of data points are not included in the input FCS file(s), `flowSet` or `SummarizedExperiment`, specify them here. For a single FCS file or a `flowSet` or `SummarizedExperiment`, use a single vector (of length equal to number of expression matrix rows). If multiple FCS files are used as input, use a list of vectors, one for each of the files (in the same order as the FCS file paths).
`hierarchy`	logical: whether to use hierarchical penalties for mismatches in scoring clustering results. Default value is `FALSE`
`hierarchy.params`	optional list: parameter values for hierarchical penalties model
`input_features`	optional numeric or string vector: which columns of the input data should be used in analysis. Default value is `NULL`, which causes all columns to be used
`remove_labels`	optional string vector: names of population labels to exclude from analysis. Default value is `NULL`
`unassigned_labels`	optional string or string vector: names of population labels to handle as 'unassigned' events (not belonging to a manually annotated population) for purposes of computing evaluation metrics. Default value is '`unassigned`'
`compensation_matrix`	optional numeric matrix: compensation matrix to use if input is uncompensated flow cytometry data. See `flowCore::compensate`. Default value is `NULL`, whereby compensation is not applied
`transform`	optional string: name of transformation function to apply to input expression data. If specified, it must be one of `asinh`, `estimateLogicle`. Default value is `NULL`, whereby transformation is not applied
`transform_cofactor`	optional numeric value: cofactor for a transformation function specified in `transform`. For example, `asinh` with cofactor `5` is usually appropriate for mass cytometry data. Default value is `NULL`
`transform_features`	optional numeric or string vector: channels or indices of columns to transform (if `transform` is specified). Intersection between the indices given here and the indices retrieved from `idcs.input_features` or using `input_marker_types` are taken (if either of those two parameters is specified). Default value is `NULL`, which translates to 'all features that are used in the analysis'
`input_marker_types`	optional string: if using `SummarizedExperiment` as input, which type(s) of markers (specified in vector `marker_type` of `colData`) should be used in analysis. This overrides `idcs.input_features`. Default value is `NULL`
`batch_id`	optional string: if using `SummarizedExperiment` as input, which `rowData` column should be used to separate input data into batches if available. Default value is `sample_id`
`projection.training_set`	optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a dimension-reduction model. Default value is `NULL` (DR is applied to the entire dataset)
`clustering.training_set`	optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a clustering model. Default value is `NULL` (clustering is applied to the entire dataset)
`precomputed_knn`	optional list: a previously computed `k`-NN matrix for input data as a list containing slots `Indices` and `Distances`. Overrides `knn.algorithm` Default value is `NULL`
`knn.k`	integer: if data denoising or the evaluation of DR or clustering tools requires generation of a `k`-NN matrix, what value of `k` should be used. Default value is `100`
`knn.algorithm`	string: if `k`-NN matrix needs to be constructed, which algorithm should be used. The options are: '`annoy`' '`cover_tree`', '`kd_tree`' and '`brute`'. Default value is '`kd_tree`', but it is recommended to use '`annoy`' if `Python` and `reticulate` is available
`knn.distance`	string: if `k`-NN matrix needs to be constructed, which distance metric should be used. The only option currently is '`euclidean`'. Default value is '`euclidean`'
`knn.features`	optional numeric or string vector: channels or indices of columns to use in construction of `k`-NN matrix. Default value is `NULL`, which causes the `k`-NN algorithm to use `idcs.transform_features`
`precomputed_dist`	optional numeric matrix: a previously computed distance matrix. Default value is `NULL`
`subpipelines`	object of class `BenchmarkSubpipeline` generated using the `Subpipeline` constructor
`n_params`	list of named n-parameter lists for each subpipeline (eg. `list(list(projection = c(20, 10), clustering = c(30, 30)))`)
`stability_repeat`	optional integer value between `2` and `1000`: number of repeated runs of each clustering step on full input. Default value is `NULL`
`stability_bootstrap`	optional integer value between `2` and `1000`: number of repeated runs of each clustering step on different bootstraps of input, in addition to one final run on full data. Default value is `NULL`. Overrides `cluster.stability_randomstart`
`benchmark_name`	string: benchmark name. Default value is '`SingleBench_Benchmark`'
`h5_path`	string: name of auxiliary HDF5 file to generate. Default value is `benchmark_name` (with an '`.h5`' extension)
`ask_overwrite`	logical: if `h5_path` is specified and an HDF5 file with the specified name exists already, should the user be asked before the file is overwritten? Default value is `TRUE`
`verbose`	logical value: specifies if informative messages should be produced throughout construction of the `Benchmark` object. Default value is `TRUE`

Once set up, use Evaluate to run the benchmark.

Input to the pipeline (expression data matrix/matrices) is passed as a path to an FCS file (or multiple FCS files) or a flowSet or SummarizedExperiment object, using the input argument. Projection and clustering tools (that should be applied to the data upon evaluation) are passed to the constructor using modules (see the Module constructor).

Each input data point needs to have a label, assigning it to a population. Labels can either be included in the input data object or supplied separately in a vector (input_labels).

Including labels directly is only possible with a flowSet or SummarizedExperiment (see convention in package HDCytoData). With a flowSet, include a 'population_id' channel with numeric flags and mapping of sorted numeric flags to population names, respectively, as columns of a data frame accessible for each flowFrame via @description$POPULATION_INFO. With a SummarizedExperiment, include a vector of population labels directly as the first column of rowData. In addition, colData should contain a channel_name, marker_name and (optionally) a marker_class slot.

A pipeline in made up of subpipelines, which are combinations of tools and their parameters. Each subpipeline can have one or both of two modules: projection and clustering. For instance, a single subpipeline for data smoothing and subsequent FlowSOM clustering is created using:

 subpipelines <- list()
 subpipelines[[1]] <- 
     Subpipeline(
         projection = list(Module(Fix('smooth', k = 50), n_param = 'n_iter')),
         clustering = Module(Fix('FlowSOM', grid_width = 25, grid_height = 25), n_param = 'n_clusters')
     )

Up to two n-parameters (one per projection step and one per clustering step) can be specified. This means that, within the subpipeline, we can do parameter sweeps over multiple combinations of values of these parameters, These values need to be specified in the Benchmark constructor. For instance, to try out multiple latent-space dimensions and target cluster counts, we can specify:

  n_params <- list()
  n_params[[1]] <- list(
      projection = rep(c(0, 1, 2, 4), each = 2),
      clustering = c(30, 40)
 )

If you want to set up a second subpipeline that re-uses the projection step, you can simply clone the same set-up:

 subpipeline[[2]] <-
     Subpipeline(
         projection = CloneFrom(1),
         clustering = Module(Fix('Depeche'), n_param = 'fixed_penalty')
     )
 
 n_params[[2]] <- list(
     projection = rep(c(0, 1, 2, 4, 5), each = 3),
     clustering = c(2, 3, 4)
 )

Since we used CloneFrom, any results that have been produced during the first subpipeline and are also used in the second one will be recycled (they are not produced again nor are they actually copied).

Of course n-parameter do not need to be specified, in which case they are simply not used (only one result is produced for the subpipeline). Similarly, only one subpipeline per pipeline can be generated (leading to a simgle result of the pipeline).

If you are interested in analysing stability of a clustering algorithm, run the tool multiple times (with a different random seed) or apply it multiple times to a bootstrap of the input data. That way, for your clustering evaluation metrics (scores) you will get a range of values (instead of a single values) for each subpipeline and n-parameter iteration. For repeated runs, use the stability_repeat parameter. For bootstraps, use the stability_bootstrap parameter.

Some projection and clustering tools allow you to train a model on a subset of your input data and map the results onto the rest. If your input dataset consists of multiple FCS files, flowFrames or SummarizedExperiment batches, you can specify the training set indices in projection.training_set and clustering.training_set.

If the manual labels of your input data are derived from a hierarchy of populations (ie. gating hierarchy for cytometry data), you can make use of the entire hierarchy for evaluation purposes. For instance, instead of using a 'CD4+ T cell' label, you can use 'Lymphocyte/T cell/CD4+ T cell' (using a path-like label with '/' as separator). Then, if you apply a clustering tool and match each cluster to a population present in the data, SingleBench can evaluate the quality of clustering more carefully. Specifically, instead of distinguishing between match versus mismatch, a scoring matrix is produced which penalises mismatches with different severity. For instance, to misclassify 'Lymphocyte/T cell/CD4+ T cell' as 'Lymphocyte/T cell' can be better than misclassifying it as 'Lymphocyte/T cell/CD8+ T cell', which is still better than misclassifying it as 'Lymphocyte/B cell/Alpha-Beta Mature B Cell'.

The scoring of each potential mismatch is based on the route from the true label to the predicted label through the label hierarchy tree. To parametrise the hierarchical penalty model, you can set 4 custom values. Firstly, the 'constant generalisation penalty' g_c penalises the first step taken in the direction of the tree root and the 'additive generalisation penalty' g_a penalises each step in that direction. Secondly, the 'constant misidentification penalty' m_a panelises the first step taken in the direction of the tree leaves and the 'additive misidentification penalty' m_a penalises each step in that direction. The values of these penalties are positive values, and the sum of penalties for a misclassification get subtracted from 1, which is the score for correct match. By default, g_c = 0, g_a = 0.2, m_c = 0.4, m_a = 0.

If you want to use hierarchical penalties, make sure you use the full labels (paths using '/' as separator), set the argument hierarchy to TRUE. Optionally, set hierarchy.params to a named list with slots g_c, g_a, m_c and m_a. Alternatively, instead of computing a scoring matrix for the penalties automatically, you can include a slot scoring_matrix, which contains the names of all labels as column and row names and a match/mismatch score for each combination of true and predicted label. This scoring matrix will then be used directly and will override any other paramters of the model.

Various dimension-reduction or denoising algorithms may require the construction of a k-nearest-neighbour graph. If any of them are included in the pipeline, a single k-NN matrix will be constructed at the start of evaluation. You can specify the k-NN building algorithm, the value of k and the distance metric to use by specifying knn.algorithm, knn.k, knn.distance

You can decide to score the results of projection steps using full distance matrices (distances of each point to each other point) of the original input matrix and each produced projection. This can be set up in the Evaluate function. (Beware that construction of a full distance matrix has quadratic complexity with regard to number of points.) To speed things up, you can load a pre-computed distance matrix of your original input data by specifying the value of parameter precomputed_dist. To compute such a distance matrix you can use as.matrix(stats::dist(.)) or coRanking:::euclidean_C(.). (However, know that even if you do provide a pre-computed distance matrix of the original data, distance matrices for each projection produced in the benchmark pipeline will still need to be computed, and this can take a long time for larger datasets.)

All large vectors and matrices generated during benchmark set-up and evaluation are stored in an auxiliary HDF5 file. Erasing it amounts to discarding results of the benchmark.

To create new wrappers for projection and clustering tools, use WrapTool.

Evaluate: runs all benchmark subpipelines and scores the performance of each tool
AddLayout: allows you to add a separate 2-dimensional layout of the input dataset or to use an existing projection (produced in the evaluation) as a visualisation layout.

davnovak/SingleBench documentation built on Dec. 19, 2021, 9:10 p.m.

davnovak/SingleBench index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

davnovak/SingleBench
Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Benchmark: 'Benchmark' object constructor
In davnovak/SingleBench: Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Description

Usage

Arguments

Details

Input data and tools

Manual labelling of data points

Subpipelines and n-parameters

Stability analysis of clustering

Train-and-map

Hierarchical labelling

sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction

Distance matrix construction

HDF5 integration

Creating new wrappers

See Also

Related to Benchmark in davnovak/SingleBench...

R Package Documentation

Browse R Packages

We want your feedback!

davnovak/SingleBench Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Benchmark: 'Benchmark' object constructor In davnovak/SingleBench: Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Description

Usage

Arguments

Details

Input data and tools

Manual labelling of data points

Subpipelines and n-parameters

Stability analysis of clustering

Train-and-map

Hierarchical labelling

sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction

Distance matrix construction

HDF5 integration

Creating new wrappers

See Also

Related to Benchmark in davnovak/SingleBench...

R Package Documentation

Browse R Packages

We want your feedback!

davnovak/SingleBench
Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Benchmark: 'Benchmark' object constructor
In davnovak/SingleBench: Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis