WrapTool: Generate a projection or clustering tool wrapper
In davnovak/SingleBench: Framework for benchmarking and optimising dimension-reduction and clustering pipelines in expression data analysis

Description Usage Arguments Value Basic components of a tool wrapper Modelling functions

This function lets you create wrappers of projection or clustering tools. Then, you can include them in benchmark pipelines.

WrapTool(
  name,
  type,
  r_packages = NULL,
  python_modules = NULL,
  fun.build_model.single_input = NULL,
  fun.build_model.batch_input = NULL,
  fun.build_model = NULL,
  fun.extract = function(model) model,
  fun.apply_model.single_input = NULL,
  fun.apply_model.batch_input = NULL,
  fun.apply_model = NULL,
  prevent_parallel_execution = TRUE,
  use_python = !is.null(python_modules) && length(python_modules) > 0,
  use_original_expression_matrix = FALSE,
  use_knn_graph = FALSE
)

`name`	string: name of tool
`type`	string: type of tool type (either '`projection`' or '`clustering`')
`r_packages`	string vector: names of all `R` packages needed by the modelling functions
`python_modules`	optional string vector: names of `Python` modules needed by the modelling function (called via `reticulate`). Default value is `NULL`
`fun.build_model.single_input`	optional function: modelling function which accepts a single coordinate matrix as input data. Minimal signature `function(input, latent_dim)` or `function(input, n_clusters)`
`fun.build_model.batch_input`	optional function: modelling function which accepts a list of coordinate matrices as input data. Minimal signature `function(input, latent_dim)` or `function(input, n_clusters)`. Default value is `NULL`
`fun.build_model`	optional function: different parameter name for `fun.build_model.single_input`, if `fun.build_model.batch_input` is left as `NULL`
`fun.extract`	optional function: modelling function which accepts a model generated by `fun.build_model.single_input` or `fun.build_model.batch_input` as input. Signature `function(model)`. If unspecified, the `model` object itself is taken as result
`fun.apply_model.single_input`	optional function: modelling function which accepts a model generated by `fun.build_model.single_input` or `fun.build_model.batch_input` and new coordinate matrix as input. Signature `function(model, input)`. Default value is `NULL`
`fun.apply_model.batch_input`	optional function: modelling function which accepts a model generated by `fun.build_model.single_input` or `fun.build_model.batch_input` and a new list coordinate matrices as input. Signature `function(model, input)`. Default value is `NULL`
`fun.apply_model`	optional function: different parameter name for `fun.apply_model.single_input`, if `fun.apply_model.batch_input` is left as `NULL`
`prevent_parallel_execution`	logical: whether running the tool in parallel on multiple CPU cores should be prevented. Default value is `TRUE`
`use_python`	logical: whether the tool uses `Python` via `reticulate`. This is automatically set to `TRUE` if any `Python` modules are required. Otherwise, default value is `FALSE`
`use_original_expression_matrix`	logical: whether the tool uses original expression matrix apart from the output of the preceding dimension-reduction tool. Default value is `FALSE`
`use_knn_graph`	logical: whether the tool uses a `k`-nearest-neighbour graph of the input data. Default value is `FALSE`

This function returns a wrapper function that can be used in constructing a benchmark pipeline using Fix, Module and Subpipeline.

To create a wrapper, you need to specify a handful of components (as arguments to WrapTool). name is a unique string identifier. This is also included in the name of the wrapper (for example, FlowSOM will have wrapper.clustering.FlowSOM). type specifies whether it is a projection tool (for dimension reduction or denoising) or clustering tool. The string vector r_packages specifies names all required R packages and python_modules specifies names of required Python modules (that will be accessed via reticulate: the R/Python interface).

Modelling functions are the ones that do the work: transform input data. At least one of them (fun.build_model) needs to be specified.

fun.build_model.single_input takes a single coordinate matrix of data and returns a model. The model is an object from which the desired result (projection coordinate matrix or vector of cluster indices per data point) can be extracted. fun.build_model.batch_input, instead, takes a list of multiple coordinate matrices (one per sample) as input and returns a model.

If the tool does not distinguish between a single input matrix and multiple input matrices (it would just concatenate the inputs and apply fun.build_model.single_input), fun.build_model.batch_input can be left unspecified and it will be auto-generated. In that case, you can specify the function summarily as fun.build_model.

fun.extract is a function that takes a model object (generated by fun.build_model...) as input and extracts results of the model applied to the original input data. fun.apply_model.single_input takes a model object and a new coordinate matrix as input. It returns the result of applying the previously trained model on new data. fun.apply_model.batch_input takes a list of coordinate matrices as input and applies the model to new data.

Results of the ...batch_input functions should not be split up into lists according to the sizes of the original inputs: they always return a single coordinate matrix or cluster vector (the splitting per sample is implemented automatically).

Minimal function signatures

The minimal signature of a fun.build_model... function is function(input). Other arguments, with their default values, can (and should) be included: that way, changes in other parameters can be tested.

For example, a simple signature of a fun.build_model... function for the dimension-reduction tool t-SNE might be function(input, latent_dim = 2, perplexity = 2), allowing the user to alter target dimensionality or the perplexity parameter.

Signatures of the other modelling functions are fixed. For fun.extract it is function(model) and for fun.apply_model... it is function(model, input).

Additional inputs to model-building functions

If a clustering tool uses the original high-dimensional expression data as well as a projection (generated in the previous step by some projection method), then include the parameter expression in your function signature and set parameter use_original_expression_matrix to TRUE. expression is either a single matrix or a list of matrices, much like input. input, then, will be the output of the preceding projection tool in that given sub-pipeline.

If your tool uses a k-nearest_neighbour graph (k-NNG), you are encouraged to always use one that was computed at the beginning of your pipeline evaluation. (The k-NNG will be created if SingleBench knows it will run one or more tool that need it.) To do this, set use_knn_graph to TRUE and add the argument knn to the signature of your model-building functions. knn will then be a list of two names matrices: Indices for indices of nearest neighbours (row-wise) and Distances for distances to those neighbours.

Warning: the entries in Indices will be 1-indexed and the matrices do not contain a column for the 'zero-th' neighbour (for each point, the zero-th neighbour is itself). To modify the knn object (switch to 0-indexing or include zero-th neighobur), use the convertor kNNTweak inside your model-building function. For instance, to convert knn to only a matrix of indices that does include zero-th neighbours, is 1-indexed and k is lowered from its original value to 30, use: knn <- kNNGTweak(knn, only_indices = TRUE, zero_index = TRUE, zeroth_neighbours = TRUE, new_k = 30).

n-parameters

Most tools can accept custom numeric parameters. Any one of the arguments to a model-building function can be chosen as the n-parameter by the user: then, SingleBench can do parameter sweeps over different values of these parameters. Dimension-reduction tools, if possible, should have a parameter latent_dim for iterating over latent-space dimensionality. Clustering tools, if possible, should have a parameter n_clusters for iterating over target cluster count. If there is an option to determine number of clusters automatically, it might be a good idea to use n_clusters = 0 for this.

For methods that are made to run on multiple CPU cores, set prevent_parallel_execution to TRUE (otherwise, SingleBench may attempt to run them in parallel if the user wants repeated runs for stability analysis).

davnovak/SingleBench documentation built on Dec. 19, 2021, 9:10 p.m.