clustering: Clustering algorithm.

Description Usage Arguments Details Value How does this algorithm work? Examples

View source: R/app.R

Description

Discovering the behavior of variables in a set of clustering packages based on evaluation metrics.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
clustering(
  path = NULL,
  df = NULL,
  packages = NULL,
  algorithm = NULL,
  min = 3,
  max = 4,
  metrics = NULL,
  variables = FALSE
)

Arguments

path

The path of file. NULL It is only allowed to use path or df but not both at the same time. Only files in .dat, .csv or arff format are allowed.

df

data matrix or data frame, or dissimilarity matrix. NULL If you want to use training and test basketball variables.

packages

character vector with the packets running the algorithm. NULL The seven packages implemented are: cluster, ClusterR, advclust, amap, apcluster, gama, pvclust.
By default runs all packages.

algorithm

character vector with the algorithms implemented within the package. NULL The algorithms implemented are: fuzzy_cm,fuzzy_gg,fuzzy_gk, hclust,apclusterK,agnes,clara,daisy,
diana,fanny,mona,pam,gmm,kmeans_arma,kmeans_rcpp,mini_kmeans,gama,
pvclust.

min

An integer with the minimum number of clusters This data is necessary to indicate the minimum number of clusters when grouping the data. The default value is 3.

max

An integer with the maximum number of clusters. This data is necessary to indicate the maximum number of clusters when grouping the data. The default value is 4.

metrics

Character vector with the metrics implemented to evaluate the distribution of the data in clusters. NULL The night metrics implemented are: entropy, variation_information,
precision,recall,f_measure,fowlkes_mallows_index,connectivity,dunn,silhouette.

variables

an boolean which indicates that if we want to show as a result the variables of the datasets or the numerical value of the calculation of the metrics. The default value is F.

Details

The operation of this algorithm is to evaluate how the variables of a dataset or a set of datasets behave in different grouping algorithms. To do this, it is necessary to indicate the type of evaluation you want to make on the distribution of the data. To be able to execute the algorithm it is necessary to indicate the number of clusters min and max, the algorithms algorithm or packages packages that we want to cluster, the metrics metrics and if we want that the results of evaluation are the own classified variables or numerical values variables.

Value

a matrix with the result of running all the metrics of the algorithms contained in the packages we indicated. We also obtain information with the types of metrics, algorithms and packages executed.

How does this algorithm work?

This algorithm improves and complements existing implementations of clustering algorithms.

The approaches that exist, are many algorithms that run parallel to the algorithms, without being able to be compared between them. In addition, it was necessary to indicate which variable of the dataset is required to be executed. In addition, depending on the package there are some implementations or others to evaluate the groupings of data, so it is sometimes complicated to compare the groupings between different packages.

With this algorithm we can solve the problems mentioned above and determine which algorithm has the best behavior for the set of variables as well as the most efficient number of clusters.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
clustering(
     df = cluster::agriculture,
     min = 4,
     max = 5,
     algorithm='gmm',
     metrics='precision',
     variables = TRUE
)

## Not run: 
clustering(
      df = Clustering::weather,
      min = 2,
      max = 3,
      algorithm= c("gmm","kmeans_armaa"),
      metrics= c("precision","dunn"),
      variables = FALSE
)

## End(Not run)

laperez/Clustering documentation built on Aug. 1, 2020, 12:54 p.m.