Description Usage Arguments Details Input data and tools Manual labelling of data points Subpipelines and n-parameters Stability analysis of clustering Train-and-map Hierarchical labelling sLROyfB2sYKHqD2D14wuAua26adweIqY-49–NN matrix construction Distance matrix construction HDF5 integration Creating new wrappers See Also
View source: R/01_BenchmarkSetUp_.R
Creates an object of type Benchmark
, used to configure a benchmarking pipeline and input data.
This object allows to run projection (dimension-reduction/denoising) and clustering pipelines on high-dimensional single-cell data (such as from flow cytometry, mass cytometry, CITE-seq, scRNA-seq).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | Benchmark(
input,
input_labels = NULL,
hierarchy = FALSE,
hierarchy.params = list(g_c = 0, g_a = 0.2, m_c = 0.4, m_a = 0),
input_features = NULL,
remove_labels = NULL,
unassigned_labels = "unassigned",
compensation_matrix = NULL,
transform = NULL,
transform_cofactor = NULL,
transform_features = NULL,
input_marker_types = NULL,
batch_id = "sample_id",
projection.training_set = NULL,
clustering.training_set = NULL,
precomputed_knn = NULL,
knn.k = 100,
knn.algorithm = "annoy",
knn.distance = "euclidean",
knn.features = NULL,
precomputed_dist = NULL,
subpipelines = NULL,
n_params = NULL,
stability_repeat = NULL,
stability_bootstrap = NULL,
benchmark_name = "SingleBench_Benchmark",
h5_path = paste0(benchmark_name, ".h5"),
ask_overwrite = TRUE,
verbose = TRUE
)
|
input |
string vector, |
input_labels |
optional string or factor vector (or a list of these vectors): if manual labels of data points are not included in the input FCS file(s), |
hierarchy |
logical: whether to use hierarchical penalties for mismatches in scoring clustering results. Default value is |
hierarchy.params |
optional list: parameter values for hierarchical penalties model |
input_features |
optional numeric or string vector: which columns of the input data should be used in analysis. Default value is |
remove_labels |
optional string vector: names of population labels to exclude from analysis. Default value is |
unassigned_labels |
optional string or string vector: names of population labels to handle as 'unassigned' events (not belonging to a manually annotated population) for purposes of computing evaluation metrics. Default value is ' |
compensation_matrix |
optional numeric matrix: compensation matrix to use if input is uncompensated flow cytometry data. See |
transform |
optional string: name of transformation function to apply to input expression data. If specified, it must be one of |
transform_cofactor |
optional numeric value: cofactor for a transformation function specified in |
transform_features |
optional numeric or string vector: channels or indices of columns to transform (if |
input_marker_types |
optional string: if using |
batch_id |
optional string: if using |
projection.training_set |
optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a dimension-reduction model. Default value is |
clustering.training_set |
optional vector of integers: if input dataset comprises multiple samples, which ones should be used as the training set for building a clustering model. Default value is |
precomputed_knn |
optional list: a previously computed |
knn.k |
integer: if data denoising or the evaluation of DR or clustering tools requires generation of a |
knn.algorithm |
string: if |
knn.distance |
string: if |
knn.features |
optional numeric or string vector: channels or indices of columns to use in construction of |
precomputed_dist |
optional numeric matrix: a previously computed distance matrix. Default value is |
subpipelines |
object of class |
n_params |
list of named n-parameter lists for each subpipeline (eg. |
stability_repeat |
optional integer value between |
stability_bootstrap |
optional integer value between |
benchmark_name |
string: benchmark name. Default value is ' |
h5_path |
string: name of auxiliary HDF5 file to generate. Default value is |
ask_overwrite |
logical: if |
verbose |
logical value: specifies if informative messages should be produced throughout construction of the |
Once set up, use Evaluate
to run the benchmark.
Input to the pipeline (expression data matrix/matrices) is passed as a path to an FCS file (or multiple FCS files) or a flowSet
or SummarizedExperiment
object, using the input
argument.
Projection and clustering tools (that should be applied to the data upon evaluation) are passed to the constructor using modules (see the Module
constructor).
Each input data point needs to have a label, assigning it to a population.
Labels can either be included in the input data object or supplied separately in a vector (input_labels
).
Including labels directly is only possible with a flowSet
or SummarizedExperiment
(see convention in package HDCytoData
).
With a flowSet
, include a 'population_id
' channel with numeric flags and mapping of sorted numeric flags to population names, respectively, as columns of a data frame accessible for each flowFrame
via @description$POPULATION_INFO
.
With a SummarizedExperiment
, include a vector of population labels directly as the first column of rowData
.
In addition, colData
should contain a channel_name
, marker_name
and (optionally) a marker_class
slot.
A pipeline in made up of subpipelines, which are combinations of tools and their parameters.
Each subpipeline can have one or both of two modules: projection and clustering.
For instance, a single subpipeline for data smoothing and subsequent FlowSOM
clustering is created using:
1 2 3 4 5 6 |
Up to two n-parameters (one per projection step and one per clustering step) can be specified.
This means that, within the subpipeline, we can do parameter sweeps over multiple combinations of values of these parameters,
These values need to be specified in the Benchmark
constructor.
For instance, to try out multiple latent-space dimensions and target cluster counts, we can specify:
1 2 3 4 5 |
If you want to set up a second subpipeline that re-uses the projection step, you can simply clone the same set-up:
1 2 3 4 5 6 7 8 9 10 |
Since we used CloneFrom
, any results that have been produced during the first subpipeline and are also used in the second one will be recycled (they are not produced again nor are they actually copied).
Of course n-parameter do not need to be specified, in which case they are simply not used (only one result is produced for the subpipeline). Similarly, only one subpipeline per pipeline can be generated (leading to a simgle result of the pipeline).
If you are interested in analysing stability of a clustering algorithm, run the tool multiple times (with a different random seed) or apply it multiple times to a bootstrap of the input data.
That way, for your clustering evaluation metrics (scores) you will get a range of values (instead of a single values) for each subpipeline and n-parameter iteration.
For repeated runs, use the stability_repeat
parameter.
For bootstraps, use the stability_bootstrap
parameter.
Some projection and clustering tools allow you to train a model on a subset of your input data and map the results onto the rest.
If your input dataset consists of multiple FCS files, flowFrame
s or SummarizedExperiment
batches, you can specify the training set indices in projection.training_set
and clustering.training_set
.
If the manual labels of your input data are derived from a hierarchy of populations (ie. gating hierarchy for cytometry data), you can make use of the entire hierarchy for evaluation purposes.
For instance, instead of using a 'CD4+ T cell' label, you can use 'Lymphocyte/T cell/CD4+ T cell' (using a path-like label with '/
' as separator).
Then, if you apply a clustering tool and match each cluster to a population present in the data, SingleBench
can evaluate the quality of clustering more carefully.
Specifically, instead of distinguishing between match versus mismatch, a scoring matrix is produced which penalises mismatches with different severity.
For instance, to misclassify 'Lymphocyte/T cell/CD4+ T cell' as 'Lymphocyte/T cell' can be better than misclassifying it as 'Lymphocyte/T cell/CD8+ T cell', which is still better than misclassifying it as 'Lymphocyte/B cell/Alpha-Beta Mature B Cell'.
The scoring of each potential mismatch is based on the route from the true label to the predicted label through the label hierarchy tree.
To parametrise the hierarchical penalty model, you can set 4 custom values.
Firstly, the 'constant generalisation penalty' g_c
penalises the first step taken in the direction of the tree root and the 'additive generalisation penalty' g_a
penalises each step in that direction.
Secondly, the 'constant misidentification penalty' m_a
panelises the first step taken in the direction of the tree leaves and the 'additive misidentification penalty' m_a
penalises each step in that direction.
The values of these penalties are positive values, and the sum of penalties for a misclassification get subtracted from 1
, which is the score for correct match.
By default, g_c = 0
, g_a = 0.2
, m_c = 0.4
, m_a = 0
.
If you want to use hierarchical penalties, make sure you use the full labels (paths using '/
' as separator), set the argument hierarchy
to TRUE
.
Optionally, set hierarchy.params
to a named list with slots g_c
, g_a
, m_c
and m_a
.
Alternatively, instead of computing a scoring matrix for the penalties automatically, you can include a slot scoring_matrix
, which contains the names of all labels as column and row names and a match/mismatch score for each combination of true and predicted label.
This scoring matrix will then be used directly and will override any other paramters of the model.
Various dimension-reduction or denoising algorithms may require the construction of a k-nearest-neighbour graph.
If any of them are included in the pipeline, a single k
-NN matrix will be constructed at the start of evaluation.
You can specify the k
-NN building algorithm, the value of k
and the distance metric to use by specifying knn.algorithm
, knn.k
, knn.distance
You can decide to score the results of projection steps using full distance matrices (distances of each point to each other point) of the original input matrix and each produced projection.
This can be set up in the Evaluate
function.
(Beware that construction of a full distance matrix has quadratic complexity with regard to number of points.)
To speed things up, you can load a pre-computed distance matrix of your original input data by specifying the value of parameter precomputed_dist
.
To compute such a distance matrix you can use as.matrix(stats::dist(.))
or coRanking:::euclidean_C(.)
.
(However, know that even if you do provide a pre-computed distance matrix of the original data, distance matrices for each projection produced in the benchmark pipeline will still need to be computed, and this can take a long time for larger datasets.)
All large vectors and matrices generated during benchmark set-up and evaluation are stored in an auxiliary HDF5 file. Erasing it amounts to discarding results of the benchmark.
To create new wrappers for projection and clustering tools, use WrapTool
.
Evaluate
: runs all benchmark subpipelines and scores the performance of each tool
AddLayout
: allows you to add a separate 2-dimensional layout of the input dataset or to use an existing projection (produced in the evaluation) as a visualisation layout.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.