find_central_elements_by_cluster: Encapsulation of steps to create clusters and determine most...

View source: R/find_central_clone.R

find_central_elements_by_clusterR Documentation

Encapsulation of steps to create clusters and determine most central elements of each cluster

Description

Generate clusters using kmeans method, and determine most representative element for each cluster using a pca analysis (most central feature in pca space) , mhorn similarity index (most similar feature), or pearson/spearman correlation (most correlated feature).

Usage

find_central_elements_by_cluster(
  feature_df,
  anno_mark_font_size = 8,
  annotate_central_elements = T,
  annotate_central_elements_n_clusters = 40,
  central_element_circle_radius = 1/10,
  centrality_methods = "by-rank",
  cluster_id_width = NA,
  cluster_plot_sizes = NA,
  dist_method = "euclidean",
  file_prefix = "central_elements",
  grid_size = 100,
  grid_units = "mm",
  hclust_method = "complete",
  max_clusters = 40,
  max_depth = NA,
  min_clusters = 1L,
  my_threads = 1,
  my_seed = NA,
  output_central_elements = T,
  output_cumulative_variance = F,
  output_dir = ".",
  output_gmt = T,
  output_heatmap = F,
  output_pc1_vs_pc2 = F,
  output_ranked_central_elements = T,
  rank_clm = "Rank",
  rank_df = NULL,
  row_annotation_lwd = 0.25,
  row_annotation_width = 15,
  row_annotation_width_units = "mm",
  row_dend_lwd = 0.25,
  row_dend_width = 15,
  row_dend_width_units = "mm",
  hm_raster_quality = 5,
  show_hm_row_names = T
)

Arguments

feature_df

data.frame on which to perform PCA, mhorn or spearman analysis and kmeans clustering. Importantly: Rows must be named after features.

centrality_methods

A character vector with strings specifying the method for selecting the most central feature of a cluster:

  • two-in-a-row - using PCA, selects the feature that shows up two times in a row as we calculate sum of squares adding more and more PC's is selected

  • max-depth - using PCA, selects the feature with the maximum sum of squares calculated across the number of pc's requested as the "max_depth"

  • first-most-frequent - using PCA, determines the max sum of squares for 2 pcs, 3 pcs, 4 pcs ... up to N pc's and then picks the feature that showed up the most times across all those calculations

  • mhorn - feature most similar to others (ie, largest sum to all other elements) wins

  • spearman - feature most similar to others (ie, largest sum to all other elements) wins

  • pearson - feature most similar to others (ie, largest sum to all other elements) wins

  • by-rank - defaults to the most significant according to rank_df

cluster_id_width

An integer indicating how many characters to use for cluster group and cluster number id's. Defaults to one more than the number of characters in max_clusters.

cluster_plot_sizes

Integer vector indicating which cluster groups to save as plots with clusters circled and central elements labeled. Only used if centrality_methods is one of the pca options.

dist_method

String indicating the method to pass to stats::dist method for clustering

file_prefix

The text to be prepended to the file names for tables and plots

grid_size

Number to specify the size of the heatmap

grid_units

Number to specify the units corresponding to grid_size of the heatmap

hclust_method

String indicating the method to pass to stats::hclust method for clustering

max_clusters

Integer indicating the maximum number of clusters to split data into

max_depth

Integer indicating the maximum depth across principle components to use for determining most central element

min_clusters

Integer indicating the minimum number of clusters to split data into

my_threads

Integer value specifying to number of parallel processes to use when calculating mhorn indices. Defaults to 1.

my_seed

The seed key to use so clustering can be reproduced

output_central_elements

Boolean whether or not to save the table of central elements by cluster group

output_cumulative_variance

Boolean whether to save a plot of the cumulative variance explained by the pca axes. Only used if centrality_methods is one of the pca options.

output_dir

The base directory to which files and plots will be saved

output_gmt

Boolean whether or not to save the gmt data to file

output_heatmap

Boolean whether to save correlation heatmap to file. Ignored if centrality_methods is one of the PCA options.

output_ranked_central_elements

Boolean whether to save the table of unique central elements sorted by rank within cluster group

rank_clm

One-length character vector with the name of the column holding the initial rankings, if any, in either rank_df if one was sent, or in feature_df otherwise

rank_df

Data.frame with feature_df features by row in column one and rank_clm with numeric default ranking for tie-breaking. If <NA> rank_clm will be looked for in feature_df.

output_pc1_vs_pv2

Boolean whether to save a plot of the principle component 1 and 2 axes. Only used if centrality_methods is one of the pca options.

Value

Returns 3 variable list with cluster_members, seed, and results. Results is a named list of each centrality_methods with central_elements and either pca or correlations ( depending on the centrality_methods )


Benjamin-Vincent-Lab/binfotron documentation built on Oct. 1, 2024, 8:33 p.m.