Benchmark: 'Benchmark' object constructor

Description Usage Arguments Details Input data and tools Manual labelling of data points Subpipelines and n-parameters Stability analysis of clustering Train-and-map Hierarchical labelling sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction Distance matrix construction HDF5 integration Creating new wrappers See Also

View source: R/01_BenchmarkSetUp_.R

Description

Creates an object of type Benchmark, used to configure a benchmarking pipeline and input data. This object allows to run projection (dimension-reduction/denoising) and clustering pipelines on high-dimensional single-cell data (such as from flow cytometry, mass cytometry, CITE-seq, scRNA-seq).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Benchmark(
  input,
  input_labels = NULL,
  hierarchy = FALSE,
  hierarchy.params = list(g_c = 0, g_a = 0.2, m_c = 0.4, m_a = 0),
  input_features = NULL,
  remove_labels = NULL,
  unassigned_labels = "unassigned",
  compensation_matrix = NULL,
  transform = NULL,
  transform_cofactor = NULL,
  transform_features = NULL,
  input_marker_types = NULL,
  batch_id = "sample_id",
  projection.training_set = NULL,
  clustering.training_set = NULL,
  precomputed_knn = NULL,
  knn.k = 100,
  knn.algorithm = "annoy",
  knn.distance = "euclidean",
  knn.features = NULL,
  precomputed_dist = NULL,
  subpipelines = NULL,
  n_params = NULL,
  stability_repeat = NULL,
  stability_bootstrap = NULL,
  benchmark_name = "SingleBench_Benchmark",
  h5_path = paste0(benchmark_name, ".h5"),
  ask_overwrite = TRUE,
  verbose = TRUE
)

Arguments

input

string vector, flowSet or SummarizedExperiments: input dataset given as vector of FCS file paths, flowSet object or SummarizedExperiment object

input_labels

optional string or factor vector (or a list of these vectors): if manual labels of data points are not included in the input FCS file(s), flowSet or SummarizedExperiment, specify them here. For a single FCS file or a flowSet or SummarizedExperiment, use a single vector (of length equal to number of expression matrix rows). If multiple FCS files are used as input, use a list of vectors, one for each of the files (in the same order as the FCS file paths).

hierarchy

logical: whether to use hierarchical penalties for mismatches in scoring clustering results. Default value is FALSE

hierarchy.params

optional list: parameter values for hierarchical penalties model

input_features

optional numeric or string vector: which columns of the input data should be used in analysis. Default value is NULL, which causes all columns to be used

remove_labels

optional string vector: names of population labels to exclude from analysis. Default value is NULL

unassigned_labels

optional string or string vector: names of population labels to handle as 'unassigned' events (not belonging to a manually annotated population) for purposes of computing evaluation metrics. Default value is 'unassigned'

compensation_matrix

optional numeric matrix: compensation matrix to use if input is uncompensated flow cytometry data. See flowCore::compensate. Default value is NULL, whereby compensation is not applied

transform

optional string: name of transformation function to apply to input expression data. If specified, it must be one of asinh, estimateLogicle. Default value is NULL, whereby transformation is not applied

transform_cofactor

optional numeric value: cofactor for a transformation function specified in transform. For example, asinh with cofactor 5 is usually appropriate for mass cytometry data. Default value is NULL

transform_features

optional numeric or string vector: channels or indices of columns to transform (if transform is specified). Intersection between the indices given here and the indices retrieved from idcs.input_features or using input_marker_types are taken (if either of those two parameters is specified). Default value is NULL, which translates to 'all features that are used in the analysis'

input_marker_types

optional string: if using SummarizedExperiment as input, which type(s) of markers (specified in vector marker_type of colData) should be used in analysis. This overrides idcs.input_features. Default value is NULL

batch_id

optional string: if using SummarizedExperiment as input, which rowData column should be used to separate input data into batches if available. Default value is sample_id

projection.training_set

optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a dimension-reduction model. Default value is NULL (DR is applied to the entire dataset)

clustering.training_set

optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a clustering model. Default value is NULL (clustering is applied to the entire dataset)

precomputed_knn

optional list: a previously computed k-NN matrix for input data as a list containing slots Indices and Distances. Overrides knn.algorithm Default value is NULL

knn.k

integer: if data denoising or the evaluation of DR or clustering tools requires generation of a k-NN matrix, what value of k should be used. Default value is 100

knn.algorithm

string: if k-NN matrix needs to be constructed, which algorithm should be used. The options are: 'annoy' 'cover_tree', 'kd_tree' and 'brute'. Default value is 'kd_tree', but it is recommended to use 'annoy' if Python and reticulate is available

knn.distance

string: if k-NN matrix needs to be constructed, which distance metric should be used. The only option currently is 'euclidean'. Default value is 'euclidean'

knn.features

optional numeric or string vector: channels or indices of columns to use in construction of k-NN matrix. Default value is NULL, which causes the k-NN algorithm to use idcs.transform_features

precomputed_dist

optional numeric matrix: a previously computed distance matrix. Default value is NULL

subpipelines

object of class BenchmarkSubpipeline generated using the Subpipeline constructor

n_params

list of named n-parameter lists for each subpipeline (eg. list(list(projection = c(20, 10), clustering = c(30, 30))))

stability_repeat

optional integer value between 2 and 1000: number of repeated runs of each clustering step on full input. Default value is NULL

stability_bootstrap

optional integer value between 2 and 1000: number of repeated runs of each clustering step on different bootstraps of input, in addition to one final run on full data. Default value is NULL. Overrides cluster.stability_randomstart

benchmark_name

string: benchmark name. Default value is 'SingleBench_Benchmark'

h5_path

string: name of auxiliary HDF5 file to generate. Default value is benchmark_name (with an '.h5' extension)

ask_overwrite

logical: if h5_path is specified and an HDF5 file with the specified name exists already, should the user be asked before the file is overwritten? Default value is TRUE

verbose

logical value: specifies if informative messages should be produced throughout construction of the Benchmark object. Default value is TRUE

Details

Once set up, use Evaluate to run the benchmark.

Input data and tools

Input to the pipeline (expression data matrix/matrices) is passed as a path to an FCS file (or multiple FCS files) or a flowSet or SummarizedExperiment object, using the input argument. Projection and clustering tools (that should be applied to the data upon evaluation) are passed to the constructor using modules (see the Module constructor).

Manual labelling of data points

Each input data point needs to have a label, assigning it to a population. Labels can either be included in the input data object or supplied separately in a vector (input_labels).

Including labels directly is only possible with a flowSet or SummarizedExperiment (see convention in package HDCytoData). With a flowSet, include a 'population_id' channel with numeric flags and mapping of sorted numeric flags to population names, respectively, as columns of a data frame accessible for each flowFrame via @description$POPULATION_INFO. With a SummarizedExperiment, include a vector of population labels directly as the first column of rowData. In addition, colData should contain a channel_name, marker_name and (optionally) a marker_class slot.

Subpipelines and n-parameters

A pipeline in made up of subpipelines, which are combinations of tools and their parameters. Each subpipeline can have one or both of two modules: projection and clustering. For instance, a single subpipeline for data smoothing and subsequent FlowSOM clustering is created using:

1
2
3
4
5
6
 subpipelines <- list()
 subpipelines[[1]] <- 
     Subpipeline(
         projection = list(Module(Fix('smooth', k = 50), n_param = 'n_iter')),
         clustering = Module(Fix('FlowSOM', grid_width = 25, grid_height = 25), n_param = 'n_clusters')
     )

Up to two n-parameters (one per projection step and one per clustering step) can be specified. This means that, within the subpipeline, we can do parameter sweeps over multiple combinations of values of these parameters, These values need to be specified in the Benchmark constructor. For instance, to try out multiple latent-space dimensions and target cluster counts, we can specify:

1
2
3
4
5
  n_params <- list()
  n_params[[1]] <- list(
      projection = rep(c(0, 1, 2, 4), each = 2),
      clustering = c(30, 40)
 )

If you want to set up a second subpipeline that re-uses the projection step, you can simply clone the same set-up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 subpipeline[[2]] <-
     Subpipeline(
         projection = CloneFrom(1),
         clustering = Module(Fix('Depeche'), n_param = 'fixed_penalty')
     )
 
 n_params[[2]] <- list(
     projection = rep(c(0, 1, 2, 4, 5), each = 3),
     clustering = c(2, 3, 4)
 )

Since we used CloneFrom, any results that have been produced during the first subpipeline and are also used in the second one will be recycled (they are not produced again nor are they actually copied).

Of course n-parameter do not need to be specified, in which case they are simply not used (only one result is produced for the subpipeline). Similarly, only one subpipeline per pipeline can be generated (leading to a simgle result of the pipeline).

Stability analysis of clustering

If you are interested in analysing stability of a clustering algorithm, run the tool multiple times (with a different random seed) or apply it multiple times to a bootstrap of the input data. That way, for your clustering evaluation metrics (scores) you will get a range of values (instead of a single values) for each subpipeline and n-parameter iteration. For repeated runs, use the stability_repeat parameter. For bootstraps, use the stability_bootstrap parameter.

Train-and-map

Some projection and clustering tools allow you to train a model on a subset of your input data and map the results onto the rest. If your input dataset consists of multiple FCS files, flowFrames or SummarizedExperiment batches, you can specify the training set indices in projection.training_set and clustering.training_set.

Hierarchical labelling

If the manual labels of your input data are derived from a hierarchy of populations (ie. gating hierarchy for cytometry data), you can make use of the entire hierarchy for evaluation purposes. For instance, instead of using a 'CD4+ T cell' label, you can use 'Lymphocyte/T cell/CD4+ T cell' (using a path-like label with '/' as separator). Then, if you apply a clustering tool and match each cluster to a population present in the data, SingleBench can evaluate the quality of clustering more carefully. Specifically, instead of distinguishing between match versus mismatch, a scoring matrix is produced which penalises mismatches with different severity. For instance, to misclassify 'Lymphocyte/T cell/CD4+ T cell' as 'Lymphocyte/T cell' can be better than misclassifying it as 'Lymphocyte/T cell/CD8+ T cell', which is still better than misclassifying it as 'Lymphocyte/B cell/Alpha-Beta Mature B Cell'.

The scoring of each potential mismatch is based on the route from the true label to the predicted label through the label hierarchy tree. To parametrise the hierarchical penalty model, you can set 4 custom values. Firstly, the 'constant generalisation penalty' g_c penalises the first step taken in the direction of the tree root and the 'additive generalisation penalty' g_a penalises each step in that direction. Secondly, the 'constant misidentification penalty' m_a panelises the first step taken in the direction of the tree leaves and the 'additive misidentification penalty' m_a penalises each step in that direction. The values of these penalties are positive values, and the sum of penalties for a misclassification get subtracted from 1, which is the score for correct match. By default, g_c = 0, g_a = 0.2, m_c = 0.4, m_a = 0.

If you want to use hierarchical penalties, make sure you use the full labels (paths using '/' as separator), set the argument hierarchy to TRUE. Optionally, set hierarchy.params to a named list with slots g_c, g_a, m_c and m_a. Alternatively, instead of computing a scoring matrix for the penalties automatically, you can include a slot scoring_matrix, which contains the names of all labels as column and row names and a match/mismatch score for each combination of true and predicted label. This scoring matrix will then be used directly and will override any other paramters of the model.

sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction

Various dimension-reduction or denoising algorithms may require the construction of a k-nearest-neighbour graph. If any of them are included in the pipeline, a single k-NN matrix will be constructed at the start of evaluation. You can specify the k-NN building algorithm, the value of k and the distance metric to use by specifying knn.algorithm, knn.k, knn.distance

Distance matrix construction

You can decide to score the results of projection steps using full distance matrices (distances of each point to each other point) of the original input matrix and each produced projection. This can be set up in the Evaluate function. (Beware that construction of a full distance matrix has quadratic complexity with regard to number of points.) To speed things up, you can load a pre-computed distance matrix of your original input data by specifying the value of parameter precomputed_dist. To compute such a distance matrix you can use as.matrix(stats::dist(.)) or coRanking:::euclidean_C(.). (However, know that even if you do provide a pre-computed distance matrix of the original data, distance matrices for each projection produced in the benchmark pipeline will still need to be computed, and this can take a long time for larger datasets.)

HDF5 integration

All large vectors and matrices generated during benchmark set-up and evaluation are stored in an auxiliary HDF5 file. Erasing it amounts to discarding results of the benchmark.

Creating new wrappers

To create new wrappers for projection and clustering tools, use WrapTool.

See Also


davnovak/SingleBench documentation built on Dec. 19, 2021, 9:10 p.m.