validate_get_twcv: Check if color data are valid and get TWCV

View source: R/validate_get_twcv.R

validate_get_twcvR Documentation

Check if color data are valid and get TWCV

Description

Checks if passed color data are valid, i. e. are bountiful and varied enough according to passed validation criteria. This function is normally only used indirectly through 'Participant$check_valid_get_twcv()' or 'ParticipantGroup$get_valid_twcv()'.

Usage

validate_get_twcv(
  color_matrix,
  dbscan_eps = 20,
  dbscan_min_pts = 4,
  max_var_tight_cluster = 150,
  max_prop_single_tight_cluster = 0.6,
  safe_num_clusters = 3,
  safe_twcv = 250
)

Arguments

color_matrix

An n-by-3 numerical matrix where each row corresponds to a single point in 3D color space.

dbscan_eps

One-element numerical vector: radius of ‘epsilon neighborhood’ when applying DBSCAN clustering.

dbscan_min_pts

One-element numerical vector: Minimum number of points required in the epsilon neighborhood for core points (including the core point itself).

max_var_tight_cluster

One-element numerical vector: maximum variance for a cluster to be considered 'tight-knit'.

max_prop_single_tight_cluster

One-element numerical vector: maximum proportion of points allowed to be within a 'tight-knit' cluster (if this threshold is exceeded, the data are categorized as invalid).

safe_num_clusters

One-element numerical vector: minimum number of clusters that guarantees validity if points are 'non-tight-knit'.

safe_twcv

One-element numerical vector: minimum total within-cluster variance (TWCV) score that guarantees validity if points are 'non-tight-knit'.

Value

A list with components

valid

One-element logical vector

reason_invalid

One-element character vector, empty if valid is TRUE

twcv

One-element numeric (or NA if can't be calculated) vector, indicating TWCV

num_clusters

One-element numeric (or NA if can't be calculated) vector, indicating the number of identified clusters counting toward the tally compared with 'safe_num_clusters'

Details

This function relies heavily on the DBSCAN algorithm and its implementation in the R package 'dbscan', for clustering color points. For further information regarding the 'dbscan_eps' and 'dbscan_min_pts' parameters as well as DBSCAN itself, please see the 'dbscan' documentation. Once clustering is done, passed validation criteria are applied:

  • If too high a proportion of all color points (cut-off specified with ‘max_prop_single_tight_cluster') fall within a single ’tight-knit' cluster (with a cluster variance less than or equal to 'max_var_tight_cluster'), then the data are always classified as invalid.

  • If the first criterion is cleared, and points form more than 'safe_num_cluster' clusters, data are always classified as valid.

  • If the first criterion is cleared, and the Total Within-Cluster Variance (TWCV) score is greater than or equal to 'safe_twcv', data are always classified as valid.

Note that this means data can be classified as valid by either having at least 'safe_num_cluster' clusters, or by having points composing a smaller number of clusters but spaced relatively far apart within these clusters.

The DBSCAN 'noise' cluster only counts towards the 'cluster tally' (compared with 'safe_num_cluster') if it includes at least 'dbscan_min_pts' points. Points in the noise cluster are however always included in other calculations, e. g. total within-cluster variance (TWCV).

See Also

point_3d_variance for single-cluster variance, total_within_cluster_variance for TWCV.


synr documentation built on Aug. 23, 2022, 5:06 p.m.