plot_feature_similarity-methods: Plot heatmaps for pairwise similarity between features.
In familiar: End-to-End Automated Machine Learning and Model Evaluation

plot_feature_similarity

R Documentation

Plot heatmaps for pairwise similarity between features.

Description

This method creates a heatmap based on data stored in a familiarCollection object. Features in the heatmap are ordered so that more similar features appear together.

Usage

plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)

## S4 method for signature 'ANY'
plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)

## S4 method for signature 'familiarCollection'
plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)

Arguments

`object`	A `familiarCollection` object, or other other objects from which a `familiarCollection` can be extracted. See details for more information.
`feature_cluster_method`	The method used to perform clustering. These are the same methods as for the `cluster_method` configuration parameter: `none`, `hclust`, `agnes`, `diana` and `pam`. `none` cannot be used when extracting data regarding mutual correlation or feature expressions. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_linkage_method`	The method used for agglomerative clustering in `hclust` and `agnes`. These are the same methods as for the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_cluster_cut_method`	The method used to divide features into separate clusters. The available methods are the same as for the `cluster_cut_method` configuration parameter: `silhouette`, `fixed_cut` and `dynamic_cut`. `silhouette` is available for all cluster methods, but `fixed_cut` only applies to methods that create hierarchical trees (`hclust`, `agnes` and `diana`). `dynamic_cut` requires the `dynamicTreeCut` package and can only be used with `agnes` and `hclust`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_similarity_threshold`	The threshold level for pair-wise similarity that is required to form feature clusters with the `fixed_cut` method. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`draw`	(optional) Draws the plot if TRUE.
`dir_path`	(optional) Path to the directory where created performance plots are saved to. Output is saved in the `feature_similarity` subdirectory. If `NULL` no figures are saved, but are returned instead.
`split_by`	(optional) Splitting variables. This refers to column names on which datasets are split. A separate figure is created for each split. See details for available variables.
`facet_by`	(optional) Variables used to determine how and if facets of each figure appear. In case the `facet_wrap_cols` argument is `NULL`, the first variable is used to define columns, and the remaing variables are used to define rows of facets. The variables cannot overlap with those provided to the `split_by` argument, but may overlap with other arguments. See details for available variables.
`facet_wrap_cols`	(optional) Number of columns to generate when facet wrapping. If NULL, a facet grid is produced instead.
`ggtheme`	(optional) `ggplot` theme to use for plotting.
`gradient_palette`	(optional) Sequential or divergent palette used to colour the similarity or distance between features in a heatmap.
`gradient_palette_range`	(optional) Numerical range used to span the gradient. This should be a range of two values, e.g. `c(0, 1)`. Lower or upper boundary can be unset by using `NA`. If not set, the full metric-specific range is used.
`x_label`	(optional) Label to provide to the x-axis. If NULL, no label is shown.
`x_label_shared`	(optional) Sharing of x-axis labels between facets. One of three values: `overall`: A single label is placed at the bottom of the figure. Tick text (but not the ticks themselves) is removed for all but the bottom facet plot(s). `column`: A label is placed at the bottom of each column. Tick text (but not the ticks themselves) is removed for all but the bottom facet plot(s). `individual`: A label is placed below each facet plot. Tick text is kept.
`y_label`	(optional) Label to provide to the y-axis. If NULL, no label is shown.
`y_label_shared`	(optional) Sharing of y-axis labels between facets. One of three values: `overall`: A single label is placed to the left of the figure. Tick text (but not the ticks themselves) is removed for all but the left-most facet plot(s). `row`: A label is placed to the left of each row. Tick text (but not the ticks themselves) is removed for all but the left-most facet plot(s). `individual`: A label is placed below each facet plot. Tick text is kept.
`legend_label`	(optional) Label to provide to the legend. If NULL, the legend will not have a name.
`plot_title`	(optional) Label to provide as figure title. If NULL, no title is shown.
`plot_sub_title`	(optional) Label to provide as figure subtitle. If NULL, no subtitle is shown.
`caption`	(optional) Label to provide as figure caption. If NULL, no caption is shown.
`y_range`	(optional) Value range for the y-axis.
`y_n_breaks`	(optional) Number of breaks to show on the y-axis of the plot. `y_n_breaks` is used to determine the `y_breaks` argument in case it is unset.
`y_breaks`	(optional) Break points on the y-axis of the plot.
`rotate_x_tick_labels`	(optional) Rotate tick labels on the x-axis by 90 degrees. Defaults to `TRUE`. Rotation of x-axis tick labels may also be controlled through the `ggtheme`. In this case, `FALSE` should be provided explicitly.
`show_dendrogram`	(optional) Show dendrogram around the main panel. Can be `TRUE`, `FALSE`, `NULL`, or a position, i.e. `top`, `bottom`, `left` and `right`. Up to two positions may be provided, but only as long as the dendrograms are not on opposite sides of the heatmap: `top` and `bottom`, and `left` and `right` cannot be used together. A dendrogram can only be drawn from cluster methods that produce dendrograms, such as `hclust`. A dendrogram can for example not be constructed using the partitioning around medioids method (`pam`). By default, a dendrogram is drawn to the top and right of the panel.
`dendrogram_height`	(optional) Height of the dendrogram. The height is 1.5 cm by default. Height is expected to be grid unit (see `grid::unit`), which also allows for specifying relative heights.
`width`	(optional) Width of the plot. A default value is derived from the number of facets.
`height`	(optional) Height of the plot. A default value is derived from the number of features and the number of facets.
`units`	(optional) Plot size unit. Either `cm` (default), `mm` or `⁠in⁠`.
`export_collection`	(optional) Exports the collection if TRUE.
`...`	Arguments passed on to `as_familiar_collection`, `ggplot2::ggsave`, `extract_feature_similarity` `familiar_data_names` Names of the dataset(s). Only used if the `object` parameter is one or more `familiarData` objects. `collection_name` Name of the collection. `device` Device to use. Can either be a device function (e.g. png), or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If `NULL` (default), the device is guessed based on the `filename` extension. `scale` Multiplicative scaling factor. `dpi` Plot resolution. Also accepts a string input: "retina" (320), "print" (300), or "screen" (72). Applies only to raster output types. `limitsize` When `TRUE` (the default), `ggsave()` will not save images larger than 50x50 inches, to prevent the common error of specifying dimensions in pixels. `bg` Background colour. If `NULL`, uses the `plot.background` fill value from the plot theme. `create.dir` Whether to create new directories if a non-existing directory is specified in the `filename` or `path` (`TRUE`) or return an error (`FALSE`, default). If `FALSE` and run in an interactive session, a prompt will appear asking to create a new directory when necessary. `data` A `dataObject` object, `data.table` or `data.frame` that constitutes the data that are assessed. `is_pre_processed` Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the `data` argument is a `data.table` or `data.frame`. `cl` Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallellisation. `feature_similarity_metric` Metric to determine pairwise similarity between features. Similarity is computed in the same manner as for clustering, and `feature_similarity_metric` therefore has the same options as `cluster_similarity_metric`: `mcfadden_r2`, `cox_snell_r2`, `nagelkerke_r2`, `spearman`, `kendall` and `pearson`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects. `verbose` Flag to indicate whether feedback should be provided on the computation and extraction of various data elements. `message_indent` Number of indentation steps for messages shown during computation and extraction of various data elements. `estimation_type` (optional) Sets the type of estimation that should be possible. This has the following options: `point`: Point estimates. `bias_correction` or `bc`: Bias-corrected estimates. A bias-corrected estimate is computed from (at least) 20 point estimates, and `familiar` may bootstrap the data to create them. `bootstrap_confidence_interval` or `bci` (default): Bias-corrected estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The number of point estimates required depends on the `confidence_level` parameter, and `familiar` may bootstrap the data to create them. As with `detail_level`, a non-default `estimation_type` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="bci", "model_performance"="point")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, and `prediction_data`. `aggregate_results` (optional) Flag that signifies whether results should be aggregated during evaluation. If `estimation_type` is `bias_correction` or `bc`, aggregation leads to a single bias-corrected estimate. If `estimation_type` is `bootstrap_confidence_interval` or `bci`, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if `estimation_type` is `point`. The default value is equal to `TRUE` except when assessing metrics to assess model performance, as the default violin plot requires underlying data. As with `detail_level` and `estimation_type`, a non-default `aggregate_results` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"=TRUE, , "model_performance"=FALSE)`. This parameter exists for the same elements as `estimation_type`. `confidence_level` (optional) Numeric value for the level at which confidence intervals are determined. In the case bootstraps are used to determine the confidence intervals bootstrap estimation, `familiar` uses the rule of thumb `n = 20 / ci.level` to determine the number of required bootstraps. The default value is `0.95`. `bootstrap_ci_method` (optional) Method used to determine bootstrap confidence intervals (Efron and Hastie, 2016). The following methods are implemented: `percentile` (default): Confidence intervals obtained using the percentile method. `bc`: Bias-corrected confidence intervals. Note that the standard method is not implemented because this method is often not suitable due to non-normal distributions. The bias-corrected and accelerated (BCa) method is not implemented yet.

Details

This function generates area under the ROC curve plots.

Available splitting variables are: fs_method, learner, and data_set. By default, the data is split by fs_method and learner, with facetting by data_set.

Note that similarity is determined based on the underlying data. Hence the ordering of features may differ between facets, and tick labels are maintained for each panel.

Available palettes for gradient_palette are those listed by grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors, topo.colors and cm.colors, which correspond to the palettes of the same name in grDevices. If not specified, a default palette based on palettes in Tableau are used. You may also specify your own palette by using colour names listed by grDevices::colors() or through hexadecimal RGB strings.

Labeling methods such as set_fs_method_names or set_data_set_names can be applied to the familiarCollection object to update labels, and order the output in the figure.