View source: R/CLUSTERING-compare-clusterings.R
compare_clusterings | R Documentation |
Compare many different clustering algorithms with support for parallelization.
compare_clusterings(
series = NULL,
types = c("p", "h", "f", "t"),
configs = compare_clusterings_configs(types),
seed = NULL,
trace = FALSE,
...,
score.clus = function(...) stop("No scoring"),
pick.clus = function(...) stop("No picking"),
shuffle.configs = FALSE,
return.objects = FALSE,
packages = character(0L),
.errorhandling = "stop"
)
series |
A list of series, a numeric matrix or a data frame. Matrices and data frames are
coerced to a list row-wise (see |
types |
Clustering types. It must be any combination of (possibly abbreviated): "partitional", "hierarchical", "fuzzy", "tadpole." |
configs |
The list of data frames with the desired configurations to run. See
|
seed |
Seed for random reproducibility. |
trace |
Logical indicating that more output should be printed to screen. |
... |
Further arguments for |
score.clus |
A function that gets the list of results (and |
pick.clus |
A function to pick the best result. See Picking section. |
shuffle.configs |
Randomly shuffle the order of configs, which can be useful to balance load when using parallel computation. |
return.objects |
Logical indicating whether the objects returned by |
packages |
A character vector with the names of any packages needed for any functions used (distance, centroid, preprocessing, etc.). The name "dtwclust" is added automatically. Relevant for parallel computation. |
.errorhandling |
This will be passed to |
This function calls tsclust()
with different configurations and evaluates the results with the
provided functions. Parallel support is included. See the examples.
Parameters specified in configs
whose values are NA
will be ignored automatically.
The scoring and picking functions are for convenience, if they are not specified, the scores
and pick
elements of the result will be NULL
.
See repeat_clustering()
for when return.objects = FALSE
.
A list with:
results
: A list of data frames with the flattened configs and the corresponding scores
returned by score.clus
.
scores
: The scores given by score.clus
.
pick
: The object returned by pick.clus
.
proc_time
: The measured execution time, using base::proc.time()
.
seeds
: A list of lists with the random seeds computed for each configuration.
The cluster objects are also returned if return.objects
=
TRUE
.
The configurations for each clustering type can be evaluated in parallel (multi-processing) with the foreach package. A parallel backend can be registered, e.g., with doParallel.
If the .errorhandling
parameter is changed to "pass" and a custom score.clus
function is
used, said function should be able to deal with possible error objects.
If it is changed to "remove", it might not be possible to attach the scores to the results data
frame, or it may be inconsistent. Additionally, if return.objects
is TRUE
, the names given
to the objects might also be inconsistent.
Parallelization can incur a lot of deep copies of data when returning the cluster objects,
since each one will contain a copy of datalist
. If you want to avoid this, consider
specifying score.clus
and setting return.objects
to FALSE
, and then using
repeat_clustering()
.
The clustering results are organized in a list of lists in the following way (where only
applicable types
exist; first-level list names in bold):
partitional - list with
Clustering results from first partitional config
etc.
hierarchical - list with
Clustering results from first hierarchical config
etc.
fuzzy - list with
Clustering results from first fuzzy config
etc.
tadpole - list with
Clustering results from first tadpole config
etc.
If score.clus
is a function, it will be applied to the available partitional, hierarchical,
fuzzy and/or tadpole results via:
scores <- lapply(list_of_lists, score.clus, ...)
Otherwise, score.clus
should be a list of functions with the same names as the list above, so
that score.clus$partitional
is used to score list_of_lists$partitional
and so on (via
base::Map()
).
Therefore, the scores returned shall always be a list of lists with first-level names as above.
If return.objects
is TRUE
, the results' data frames and the list of TSClusters
objects are given to pick.clus
as first and second arguments respectively, followed by ...
.
Otherwise, pick.clus
will receive only the data frames and the contents of ...
(since the
objects will not be returned by the preceding step).
Note that the configurations returned by the helper functions assign special names to preprocessing/distance/centroid arguments, and these names are used internally to recognize them.
If some of these arguments are more complex (e.g. matrices) and should not be expanded,
consider passing them directly via the ellipsis (...
) instead of using pdc_configs()
. This
assumes that said arguments can be passed to all functions without affecting their results.
The distance matrices (if calculated) are not re-used across configurations. Given the way the
configurations are created, this shouldn't matter, because clusterings with arguments that can
use the same distance matrix are already grouped together by compare_clusterings_configs()
and pdc_configs()
.
Alexis Sarda-Espinosa
compare_clusterings_configs()
, tsclust()
# Fuzzy preprocessing: calculate autocorrelation up to 50th lag
acf_fun <- function(series, ...) {
lapply(series, function(x) {
as.numeric(acf(x, lag.max = 50, plot = FALSE)$acf)
})
}
# Define overall configuration
cfgs <- compare_clusterings_configs(
types = c("p", "h", "f", "t"),
k = 19L:20L,
controls = list(
partitional = partitional_control(
iter.max = 30L,
nrep = 1L
),
hierarchical = hierarchical_control(
method = "all"
),
fuzzy = fuzzy_control(
# notice the vector
fuzziness = c(2, 2.5),
iter.max = 30L
),
tadpole = tadpole_control(
# notice the vectors
dc = c(1.5, 2),
window.size = 19L:20L
)
),
preprocs = pdc_configs(
type = "preproc",
# shared
none = list(),
zscore = list(center = c(FALSE)),
# only for fuzzy
fuzzy = list(
acf_fun = list()
),
# only for tadpole
tadpole = list(
reinterpolate = list(new.length = 205L)
),
# specify which should consider the shared ones
share.config = c("p", "h")
),
distances = pdc_configs(
type = "distance",
sbd = list(),
fuzzy = list(
L2 = list()
),
share.config = c("p", "h")
),
centroids = pdc_configs(
type = "centroid",
partitional = list(
pam = list()
),
# special name 'default'
hierarchical = list(
default = list()
),
fuzzy = list(
fcmdd = list()
),
tadpole = list(
default = list(),
shape_extraction = list(znorm = TRUE)
)
)
)
# Number of configurations is returned as attribute
num_configs <- sapply(cfgs, attr, which = "num.configs")
cat("\nTotal number of configurations without considering optimizations:",
sum(num_configs),
"\n\n")
# Define evaluation functions based on CVI: Variation of Information (only crisp partition)
vi_evaluators <- cvi_evaluators("VI", ground.truth = CharTrajLabels)
score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick
# ====================================================================================
# Short run with only fuzzy clustering
# ====================================================================================
comparison_short <- compare_clusterings(CharTraj, types = c("f"), configs = cfgs,
seed = 293L, trace = TRUE,
score.clus = score_fun, pick.clus = pick_fun,
return.objects = TRUE)
## Not run:
# ====================================================================================
# Parallel run with all comparisons
# ====================================================================================
require(doParallel)
registerDoParallel(cl <- makeCluster(detectCores()))
comparison_long <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
configs = cfgs,
seed = 293L, trace = TRUE,
score.clus = score_fun,
pick.clus = pick_fun,
return.objects = TRUE)
# Using all external CVIs and majority vote
external_evaluators <- cvi_evaluators("external", ground.truth = CharTrajLabels)
score_external <- external_evaluators$score
pick_majority <- external_evaluators$pick
comparison_majority <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
configs = cfgs,
seed = 84L, trace = TRUE,
score.clus = score_external,
pick.clus = pick_majority,
return.objects = TRUE)
# best results
plot(comparison_majority$pick$object)
print(comparison_majority$pick$config)
stopCluster(cl); registerDoSEQ()
# ====================================================================================
# A run with only partitional clusterings
# ====================================================================================
p_cfgs <- compare_clusterings_configs(
types = "p", k = 19L:21L,
controls = list(
partitional = partitional_control(
iter.max = 20L,
nrep = 8L
)
),
preprocs = pdc_configs(
"preproc",
none = list(),
zscore = list(center = c(FALSE, TRUE))
),
distances = pdc_configs(
"distance",
sbd = list(),
dtw_basic = list(window.size = 19L:20L,
norm = c("L1", "L2")),
gak = list(window.size = 19L:20L,
sigma = 100)
),
centroids = pdc_configs(
"centroid",
partitional = list(
pam = list(),
shape = list()
)
)
)
# Remove redundant (shape centroid always uses zscore preprocessing)
id_redundant <- p_cfgs$partitional$preproc == "none" &
p_cfgs$partitional$centroid == "shape"
p_cfgs$partitional <- p_cfgs$partitional[!id_redundant, ]
# LONG! 30 minutes or so, sequentially
comparison_partitional <- compare_clusterings(CharTraj, types = "p",
configs = p_cfgs,
seed = 32903L, trace = TRUE,
score.clus = score_fun,
pick.clus = pick_fun,
shuffle.configs = TRUE,
return.objects = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.