OLD_NEWS.md

anticlust 0.4.0

2020-01-28

Major update. Many changes are due to the philosophy that future maintainability should be maximized. This mostly implies simplifying the exported functions, such as reducing the number of functions and the number of arguments for each exported function. I took the liberty to introduce some major changes to the existing functions; this version will be submitted to CRAN, after which such changes will no longer occur. Other changes bring about new possibilities for stimulus selection in experimental psychology (in particular, the new function matching()).

Major

Minor

anticlust 0.3.0

2019-10-30

This is a rather big update including several changes that may break code used with earlier versions.

anticlustering(
  iris[, -5],
  K = 3,
  categories = balanced_clustering(iris[, -5], K = 5)
)
anticlustering(
  iris[, -5],
  K = 3,
  categories = generate_exchange_partners(
    categories = iris[, 5], 
    p = 4
  )
)

[^addedargument]: Maybe the argument p will be added to the anticlustering() function in a future release (in this case, the function generate_exchange_partners() would be called internally). This is probably a useful extension.

anticlustering(
  iris[, -5],
  K = 3,
  categories = generate_exchange_partners(
    features = iris[, -5], 
    categories = iris[, 5], 
    p = 4, 
    similar = TRUE
  )
)

Moreover, several internal changes were implemented to the code base to enhance future maintainability. For example, to differentiate between distance and features input, custom S3 classes are added to the data matrices. For anyone inspecting the source code and wondering why I use customized classes instead of just using the class dist for distance input: usually, I want a complete matrix (having upper and lower triangular) and not a reduced matrix of class dist. Also, it is not really more effort to add a custom S3 class.

anticlust 0.2.9-5

2019-09-17

anticlust 0.2.9-4

2019-07-23

Internal change: Optimizing the exchange method with the default distance objective is now much faster. This is accomplished by only updating the sum of distances after each exchange, instead of recomputing all distances (see d51e59d)

This example illustrates the run time improvement:

# For N = 20 to N = 300, test run time for old and new 
# optimization of distance criterion:

n <- seq(20, 300, by = 20)
times <- matrix(nrow = length(n), ncol = 4)
times[, 1] <- n
colnames(times) <- c("n", "old_features_input", "old_distance_input", "new_distance")

for (i in seq_along(n)) {
  start <- Sys.time()

  # Simulate 2 features as input data
  data <- matrix(rnorm(n[i] * 2), ncol = 2)

  ## Old version: feature table as input
  ac1 <- anticlustering(
    data,
    K = rep_len(1:2, nrow(data)),
    objective = anticlust:::obj_value_distance
  )
  times[i, "old_features_input"] <- difftime(Sys.time(), start, units = "s")

  ## Old version: distance matrix as input
  ac2 <- anticlustering(
    dist(data),
    K = rep_len(1:2, nrow(data)),
    objective = anticlust:::distance_objective_
  )
  times[i, "old_distance_input"] <- difftime(Sys.time(), start, units = "s")

  start <- Sys.time()
  ac3 <- anticlustering(
    data,
    K = rep_len(1:2, nrow(data)),
    objective = "distance"
  )
  times[i, "new_distance"] <- difftime(Sys.time(), start, units = "s")

  ## Ensures that all methods have the same output
  stopifnot(all(ac1 == ac2))
  stopifnot(all(ac1 == ac3))
}

round(times, 2)

#         n old_features_input old_distance_input new_distance
#  [1,]  20               0.08               0.12         0.01
#  [2,]  40               0.26               0.50         0.03
#  [3,]  60               0.72               1.36         0.10
#  [4,]  80               1.07               2.62         0.22
#  [5,] 100               1.81               4.98         0.46
#  [6,] 120               3.84              11.17         0.82
#  [7,] 140               3.72              13.17         1.33
#  [8,] 160               5.20              20.65         2.16
#  [9,] 180               7.30              31.48         2.44
# [10,] 200               8.63              37.96         3.38
# [11,] 220              10.97              53.26         4.80
# [12,] 240              13.78              74.17         6.66
# [13,] 260              17.49             106.81         8.43
# [14,] 280              20.40             149.38        12.23
# [15,] 300              27.21             178.46        15.20

As shown in the code and in the output table, two different objective functions could be called when the exchange algorithm was employed, depending on the input: When a feature table was passed, the internal function anticlust:::obj_value_distance was called in each iteration of the exchange algorithm; When a distance matrix was passed, the internal function anticlust:::distance_objective_ was called in each iteration of the exchange algorithm. The former function computes all between-element distances within each set and returns their sum (using the R functions by, dist, sapply and sum). The latter function stores all distances and will index the relevant distances and return their sum. Interestingly, this indexing approach was a lot slower than recomputing all distances every iteration in the exchange algorithm.

In the new version, there no longer is a difference between a feature and distance input; in both cases, the sum of distances is updated based on only the relevant columns/rows in a distance matrix (that means, in each iteration of the exchange method, 4 rows/columns need to be investigated, independent of N). The new approach is a lot faster and especially benefial when we pass distance as input.

anticlust 0.2.9-3

2019-07-22

New feature:

anticlust 0.2.9-2

2019-07-18

New features:

Internal changes:

anticlust 0.2.9

2019-07-09

anticlust 0.2.8

2019-07-05

A new exported function is available: fast_anticlustering(). As the name suggests, it is optimized for speed and particularly useful for large data sets (many thousand elements). It uses the k-means variance objective because computing all pairwise distances for the cluster editing objective becomes computationally infeasible for large data sets. Additionally, it employs a speed-optimized exchange method. The number of exchange partners can be adjusted using the argument k_neighbours. Fewer exchange partners make it possible to apply the fast_anticlustering() function to very large data sets. The default value for k_neighbours is Inf, meaning that in the default case, each element is swapped with all other elements.

anticlust 0.2.7

2019-07-01

A big update:

Minor changes:

anticlust 0.2.6

2019-06-19

Minor update: plot_clusters now has an additional argument illustrate_variance. If this argument is set to TRUE, a cluster solution is illustrated with the k-means variance criterion.

anticlust 0.2.5

2019-05-27

The new version of anticlust now enables parallelization of the random sampling method, improving run time.

An example data set is now included with the package, courteously provided by Marie Lusia Schaper and Ute Bayen. For details, see ?schaper2019.

anticlust 0.2.4

2019-04-26

The new version of anticlust includes support for constraints induced by grouping variables.

In anticlustering, the default value of preclustering is now FALSE.



m-Py/anticlust documentation built on April 13, 2025, 11:17 p.m.