extract_sample_similarity: Internal function to extract the sample distance table.
In familiar: End-to-End Automated Machine Learning and Model Evaluation

extract_sample_similarity

R Documentation

Internal function to extract the sample distance table.

Description

Computes and extracts the sample distance table for samples analysed using a familiarEnsemble object to form a familiarData object. This table can be used to cluster samples, and is exported directly by extract_feature_expression.

Usage

extract_sample_similarity(
  object,
  data,
  cl = NULL,
  is_pre_processed = FALSE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  verbose = FALSE,
  message_indent = 0L,
  ...
)

Arguments

`object`	A `familiarEnsemble` object, which is an ensemble of one or more `familiarModel` objects.
`data`	A `dataObject` object, `data.table` or `data.frame` that constitutes the data that are assessed.
`cl`	Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallellisation.
`is_pre_processed`	Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the `data` argument is a `data.table` or `data.frame`.
`sample_limit`	(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20. This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. `list("sample_similarity"=100, "permutation_vimp"=1000)`. This parameter can be set for the following data elements: `sample_similarity` and `ice_data`.
`sample_cluster_method`	The method used to perform clustering based on distance between samples. These are the same methods as for the `cluster_method` configuration parameter: `hclust`, `agnes`, `diana` and `pam`. `none` cannot be used when extracting data for feature expressions. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_linkage_method`	The method used for agglomerative clustering in `hclust` and `agnes`. These are the same methods as for the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_similarity_metric`	Metric to determine pairwise similarity between samples. Similarity is computed in the same manner as for clustering, but `sample_similarity_metric` has different options that are better suited to computing distance between samples instead of between features: `gower`, `euclidean`. The underlying feature data is scaled to the `[0, 1]` range (for numerical features) using the feature values across the samples. The normalisation parameters required can optionally be computed from feature data with the outer 5% (on both sides) of feature values trimmed or winsorised. To do so append `⁠_trim⁠` (trimming) or `⁠_winsor⁠` (winsorising) to the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`verbose`	Flag to indicate whether feedback should be provided on the computation and extraction of various data elements.
`message_indent`	Number of indentation steps for messages shown during computation and extraction of various data elements.
`...`	Unused arguments.

Value

A data.table containing pairwise distance between samples. This data is only the upper triangular of the complete matrix (i.e. the sparse unitriangular representation). Diagonals will always be 0.0 and the lower triangular is mirrored.

familiar documentation built on Sept. 30, 2024, 9:18 a.m.