get_top_clusters: Implements kTSCR algorithm

Description Usage Arguments Value Examples

View source: R/get_top_clusters.R

Description

This algorithm find the top clusters for predicting a quantitative outcome variable (y) using input features (X). Clusters are defined by a singular feature, called an elder, and all other features that optimize prediction when compared with the elder (either as an indicator function, or as difference of rank). As such, the elder-sibling pairs represent transformed features (termed pairwise features), and all elder-sibling pairs comprise the cluster.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
get_top_clusters(
  y,
  X,
  Verbose = FALSE,
  restrict = FALSE,
  rank = FALSE,
  standardize_features = TRUE,
  cluster_corr_prop = 1,
  ct = 1,
  use_diff_rank = T
)

Arguments

y

vector with quantitative outcome variable

X

a feature matrix with samples as cols and features as rows (NOTE: this is a transpose of how a feature matrix is frequently thought of)

Verbose

a logical. Whether tobe verbose (TRUE) or not. Default is TRUE.

restrict

Defaults to FALSE. Otherwise, must be a vector of feature names to restrict analysis to. If provided, the output matrices will only contain pairwise feature scores for comparisons that always include at least one of the features from the restrict vector.In other words, the rows of the output matrices will be restricted to the features in restrict although the cols will still include all features This way, all restrict features are compared to all other features, which allows for the inclusion of non-restrict features in pairs, which may reflect important feature relationships. Otherwise, if restrict == false, all features will be considered.

rank

a logical. Whether the rank of the outcome variable should be used for prediction. Default is FALSE. Recommend against using TRUE for this parameter (currently used for testing).

standardize_features

a logical. Whether features in X should be standardized. Default is TRUE.

cluster_corr_prop

what proportion of the maximum (weighted) cluster correlation with y should be reflected by the chosen siblings. A hyperparameter. Default is 1 (meaning include all elder-sibling pairs in cluster)

ct

correlation threshold determined how much a new cluster must improve the current correlation with y in order to be added as a top cluster. A hyperparameter. Default is 1 (meaning any improvement is sufficient to add the next cluster within the greedy framework)

use_diff_rank

a logical. If true feature pairs are scored based on the difference in their per sample rank. Otherwise, pairwise scores are the indicator (I) of whether Xi>Xj

Value

a list with entries y (outcome variable), K (calculated score used for prediction), Correlation (cor(K, y)), ElderIndices (indices of elders in X), SiblingIndices (indices of siblings in X), Elders (elder variable names), siblings (sibling variable names)

Examples

1
2
3
4
5
C <- 100  # represents samples
R <- 200 # represents features
y <- rnorm(C) # represents outcome variable
X <- matrix(rbeta(R*C, 2, 3), nrow = R)  # simulate data matrix
res <- get_top_clusters(y, X)

mdkessler/kTSCR documentation built on Feb. 25, 2021, 10:31 p.m.