WrapTool: Generate a projection or clustering tool wrapper

Description Usage Arguments Value Basic components of a tool wrapper Modelling functions

View source: R/00_Wrappers.R

Description

This function lets you create wrappers of projection or clustering tools. Then, you can include them in benchmark pipelines.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
WrapTool(
  name,
  type,
  r_packages = NULL,
  python_modules = NULL,
  fun.build_model.single_input = NULL,
  fun.build_model.batch_input = NULL,
  fun.build_model = NULL,
  fun.extract = function(model) model,
  fun.apply_model.single_input = NULL,
  fun.apply_model.batch_input = NULL,
  fun.apply_model = NULL,
  prevent_parallel_execution = TRUE,
  use_python = !is.null(python_modules) && length(python_modules) > 0,
  use_original_expression_matrix = FALSE,
  use_knn_graph = FALSE
)

Arguments

name

string: name of tool

type

string: type of tool type (either 'projection' or 'clustering')

r_packages

string vector: names of all R packages needed by the modelling functions

python_modules

optional string vector: names of Python modules needed by the modelling function (called via reticulate). Default value is NULL

fun.build_model.single_input

optional function: modelling function which accepts a single coordinate matrix as input data. Minimal signature function(input, latent_dim) or function(input, n_clusters)

fun.build_model.batch_input

optional function: modelling function which accepts a list of coordinate matrices as input data. Minimal signature function(input, latent_dim) or function(input, n_clusters). Default value is NULL

fun.build_model

optional function: different parameter name for fun.build_model.single_input, if fun.build_model.batch_input is left as NULL

fun.extract

optional function: modelling function which accepts a model generated by fun.build_model.single_input or fun.build_model.batch_input as input. Signature function(model). If unspecified, the model object itself is taken as result

fun.apply_model.single_input

optional function: modelling function which accepts a model generated by fun.build_model.single_input or fun.build_model.batch_input and new coordinate matrix as input. Signature function(model, input). Default value is NULL

fun.apply_model.batch_input

optional function: modelling function which accepts a model generated by fun.build_model.single_input or fun.build_model.batch_input and a new list coordinate matrices as input. Signature function(model, input). Default value is NULL

fun.apply_model

optional function: different parameter name for fun.apply_model.single_input, if fun.apply_model.batch_input is left as NULL

prevent_parallel_execution

logical: whether running the tool in parallel on multiple CPU cores should be prevented. Default value is TRUE

use_python

logical: whether the tool uses Python via reticulate. This is automatically set to TRUE if any Python modules are required. Otherwise, default value is FALSE

use_original_expression_matrix

logical: whether the tool uses original expression matrix apart from the output of the preceding dimension-reduction tool. Default value is FALSE

use_knn_graph

logical: whether the tool uses a k-nearest-neighbour graph of the input data. Default value is FALSE

Value

This function returns a wrapper function that can be used in constructing a benchmark pipeline using Fix, Module and Subpipeline.

Basic components of a tool wrapper

To create a wrapper, you need to specify a handful of components (as arguments to WrapTool). name is a unique string identifier. This is also included in the name of the wrapper (for example, FlowSOM will have wrapper.clustering.FlowSOM). type specifies whether it is a projection tool (for dimension reduction or denoising) or clustering tool. The string vector r_packages specifies names all required R packages and python_modules specifies names of required Python modules (that will be accessed via reticulate: the R/Python interface).

Modelling functions

Modelling functions are the ones that do the work: transform input data. At least one of them (fun.build_model) needs to be specified.

fun.build_model.single_input takes a single coordinate matrix of data and returns a model. The model is an object from which the desired result (projection coordinate matrix or vector of cluster indices per data point) can be extracted. fun.build_model.batch_input, instead, takes a list of multiple coordinate matrices (one per sample) as input and returns a model.

If the tool does not distinguish between a single input matrix and multiple input matrices (it would just concatenate the inputs and apply fun.build_model.single_input), fun.build_model.batch_input can be left unspecified and it will be auto-generated. In that case, you can specify the function summarily as fun.build_model.

fun.extract is a function that takes a model object (generated by fun.build_model...) as input and extracts results of the model applied to the original input data. fun.apply_model.single_input takes a model object and a new coordinate matrix as input. It returns the result of applying the previously trained model on new data. fun.apply_model.batch_input takes a list of coordinate matrices as input and applies the model to new data.

Results of the ...batch_input functions should not be split up into lists according to the sizes of the original inputs: they always return a single coordinate matrix or cluster vector (the splitting per sample is implemented automatically).

Minimal function signatures

The minimal signature of a fun.build_model... function is function(input). Other arguments, with their default values, can (and should) be included: that way, changes in other parameters can be tested.

For example, a simple signature of a fun.build_model... function for the dimension-reduction tool t-SNE might be function(input, latent_dim = 2, perplexity = 2), allowing the user to alter target dimensionality or the perplexity parameter.

Signatures of the other modelling functions are fixed. For fun.extract it is function(model) and for fun.apply_model... it is function(model, input).

Additional inputs to model-building functions

If a clustering tool uses the original high-dimensional expression data as well as a projection (generated in the previous step by some projection method), then include the parameter expression in your function signature and set parameter use_original_expression_matrix to TRUE. expression is either a single matrix or a list of matrices, much like input. input, then, will be the output of the preceding projection tool in that given sub-pipeline.

If your tool uses a k-nearest_neighbour graph (k-NNG), you are encouraged to always use one that was computed at the beginning of your pipeline evaluation. (The k-NNG will be created if SingleBench knows it will run one or more tool that need it.) To do this, set use_knn_graph to TRUE and add the argument knn to the signature of your model-building functions. knn will then be a list of two names matrices: Indices for indices of nearest neighbours (row-wise) and Distances for distances to those neighbours.

Warning: the entries in Indices will be 1-indexed and the matrices do not contain a column for the 'zero-th' neighbour (for each point, the zero-th neighbour is itself). To modify the knn object (switch to 0-indexing or include zero-th neighobur), use the convertor kNNTweak inside your model-building function. For instance, to convert knn to only a matrix of indices that does include zero-th neighbours, is 1-indexed and k is lowered from its original value to 30, use: knn <- kNNGTweak(knn, only_indices = TRUE, zero_index = TRUE, zeroth_neighbours = TRUE, new_k = 30).

n-parameters

Most tools can accept custom numeric parameters. Any one of the arguments to a model-building function can be chosen as the n-parameter by the user: then, SingleBench can do parameter sweeps over different values of these parameters. Dimension-reduction tools, if possible, should have a parameter latent_dim for iterating over latent-space dimensionality. Clustering tools, if possible, should have a parameter n_clusters for iterating over target cluster count. If there is an option to determine number of clusters automatically, it might be a good idea to use n_clusters = 0 for this.

For methods that are made to run on multiple CPU cores, set prevent_parallel_execution to TRUE (otherwise, SingleBench may attempt to run them in parallel if the user wants repeated runs for stability analysis).


davnovak/SingleBench documentation built on Dec. 19, 2021, 9:10 p.m.