Learning Clusterization"
In LearnClust: Learning Hierarchical Clustering Algorithms

library(LearnClust)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

LearnClust package allows users to learn how the algorithms get the solution.

The package implements distances between clusters.
It includes main functions that return the solution applying the algorithms.
It contains .details functions that explain the process used to get the solution. They help the user to understand how it gets the solution.

Datasets:

We initialize some datasets to use in the algorithms:

cluster1 <- matrix(c(1,2),ncol=2)

cluster2 <- matrix(c(2,4),ncol=2)

weight <- c(0.2,0.8)

vectorData <- c(1,1,2,3,4,7,8,8,8,10)
# vectorData <- c(1:10)

matrixData <- matrix(vectorData,ncol=2,byrow=TRUE)
print(matrixData)

dfData <- data.frame(matrixData)
print(dfData)
plot(dfData)

cMatrix <- matrix(c(2,4,4,2,3,5,1,1,2,2,5,5,1,0,1,1,2,1,2,4,5,1,2,1), ncol=3, byrow=TRUE)

cDataFrame <- data.frame(cMatrix)

Distances

The package includes different types of distance:

Euclidean Distance

edistance(cluster1,cluster2)

Manhattan Distance

mdistance(cluster1,cluster2)

Canberra Distance

canberradistance(cluster1,cluster2)

Chebyshev Distance

chebyshevDistance(cluster1,cluster2)

Octile Distance

octileDistance(cluster1,cluster2)

Each function has a .details version that explain how the calculus is done.

There are functions where some weights are applied to each element. These function are used in the extra algorithm. These functions are:

Euclidean Distance with weight applied.

edistanceW(cluster1,cluster2,weight)

Manhattan Distance with weight applied.

mdistanceW(cluster1,cluster2,weight)

Canberra Distance with weight applied.

canberradistanceW(cluster1,cluster2,weight)

Chebyshev Distance with weight applied.

chebyshevDistanceW(cluster1,cluster2,weight)

Octile Distance with weight applied.

octileDistanceW(cluster1,cluster2,weight)

Agglomerative Hierarchical Clustering

This algorithm uses some functions according to the theoretical process:

We prepare data to be used in the algorithms. We create a cluster with each values. They could be different R types (vector, matrix or data.frame)

list <- toList(vectorData)

# list <- toList(matrixData)

# list <- toList(dfData)

print(list)

We calculate the matrix distance using clusters from the first step. We use the distance and approach type that we want.

matrixDistance <- mdAgglomerative(list,'MAN','AVG')
print(matrixDistance)

We get the minimal value from the matrix distance, that is, the distance between closer clusters.

minDistance <- minDistance(matrixDistance)
print(minDistance)

With the minimal distance, we look for the clusters with this distance separation. We take the clusters that will be joined.

groupedClusters <- getCluster(minDistance, matrixDistance)
print(groupedClusters)

These two clusters will create a new one.

updatedClusters <- newCluster(list, groupedClusters)
print(updatedClusters)

We add the new cluster to the solution and repeat from step 2 to 5 until we get only one cluster.

The complete function that implement the algorithm is:

agglomerativeExample <- agglomerativeHC(dfData,'EUC','MAX')

plot(agglomerativeExample$dendrogram)
print(agglomerativeExample$clusters)
print(agglomerativeExample$groupedClusters)

The package includes some auxiliar functions to implement the algorithm. These functions are:

A function that updates active clusters. If two clusters have been joined, they will not be used again as individual clusters.

cleanClusters <- usefulClusters(updatedClusters)
print(cleanClusters)

Two functions that calculate the distance between clusters using distance and approach values given.

distances <- c(2,4,6,8)

clusterDistanceByApproach <- clusterDistanceByApproach(distances,'AVG')
print(clusterDistanceByApproach)

"clusterDistanceByApproach" get the value using approach type. This type could be "MAX","MIN", and "AVG"

clusterDistance <- clusterDistance(cluster1,cluster2,'MAX','MAN')
print(clusterDistance)

"clusterDistance" get the distance value between each element from one cluster to the other ones using distance type. This type could be "EUC", "MAN", "CAN", "CHE", and "OCT"

Agglomerative Hierarchical Clustering .DETAILS

This algorithm explains every function.

How clusters are initialized to be used. Initial data could be different R types (vector, matrix or data.frame)

list <- toList.details(vectorData)

# list <- toList(matrixData)

# list <- toList(dfData)

print(list)

How the matrix distance is created.

matrixDistance <- mdAgglomerative.details(list,'MAN','AVG')

Choosing the minimal distance avoiding cero values.

minDistance <- minDistance.details(matrixDistance)

Using the minimal distance, look for the clusters with this distance.

groupedClusters <- getCluster.details(minDistance, matrixDistance)

With the clusters, it creates a new one and remove the previous from the initial list.

updatedClusters <- newCluster.details(list, groupedClusters)

We add the new cluster to the solution and repeat from step 2 to 5 until we get only one cluster.

The complete function that explains the algorithm is:

agglomerativeExample <- agglomerativeHC.details(vectorData,'EUC','MAX')

Divisive Hierarchical Clustering

This algorithm uses some functions according to the theoretical process:

We prepare data to be used in the algorithms. We create a cluster with each values. They could be different R types (vector, matrix or data.frame)

 # list <- toListDivisive(vectorData)

# list <- toListDivisive(matrixData)

 list <- toListDivisive(dfData[1:4,])

print(list)

With every cluster, the algorithm has to create all posible subclusters joining inicial clusters.

clustersList <- initClusters(list)
print(clustersList)

We calculate the matrix distance using clusters from second step. We use the distance and approach type that we prefer.

matrixDistance <- mdDivisive(clustersList,'MAN','AVG',list)
print(matrixDistance)

We get the maximal value from the matrix distance, that is, the distance between far away clusters.

maxDistance <- maxDistance(matrixDistance)
print(maxDistance)

With the maximal distance, we look for the clusters with this distance separation. We take the clusters that will be divided.

dividedClusters <- getClusterDivisive(maxDistance, matrixDistance)
print(dividedClusters)

Two new subclusters will be created from the initial one and added to the solution.
We repeat from step 2 to 5 until any cluster could be divided again.

The complete function that implement the algorithm is:

divisiveExample <- divisiveHC(dfData[1:4,],'MAN','AVG')
print(divisiveExample)

The package uses the same auxiliar functions as the previous to implement the algorithm. These functions are:

clusterDistanceByApproach
clusterDistance
complementaryClusters: checks if the clusters we are going to divide are complementary, that is, every initial cluster is in one or in the other cluster, but never in both. This condition allows to not loose any cluster when the division is done.

data <- c(1,2,1,3,1,4,1,5)

components <- toListDivisive(data)

cluster1 <- matrix(c(1,2,1,3),ncol=2,byrow=TRUE)
cluster2 <- matrix(c(1,4,1,5),ncol=2,byrow=TRUE)
cluster3 <- matrix(c(1,6,1,7),ncol=2,byrow=TRUE)

complementaryClusters(components,cluster1,cluster2)

complementaryClusters(components,cluster1,cluster3)

Its ".details" version, explains how the functions checks this condition:

complementaryClusters.details(components,cluster1,cluster2)

Divisive Hierarchical Clustering .DETAILS

This algorithm explains every function.

How clusters are initialized to be used. Initial data could be different R types (vector, matrix or data.frame)

# list <- toListDivisive.details(vectorData)

# list <- toListDivisive(matrixData)

 list <- toListDivisive(dfData[1:4,])

print(list)

How to create all posible clusters to be divided.

clustersList <- initClusters.details(list)

How the matrix distance is created.

matrixDistance <- mdDivisive.details(clustersList,'MAN','AVG',list)

Choosing the maximal distance, the far away clusters.

maxDistance <- maxDistance.details(matrixDistance)

Using the maximal distance, look for the clusters with this distance.

dividedClusters <- getClusterDivisive.details(maxDistance, matrixDistance)

We add the new clusters to the solution and repeat from step 2 to 5 until any cluster could be divided again.

The complete function that explains the algorithm is:

divisiveExample <- divisiveHC.details(dfData[1:4,],'MAN','AVG')
print(divisiveExample)

Correlative Hierarchical Clustering

This example shows how the algorithm works step by step. 1. Input data is initialized creating a cluster with each data frame row.

initData <- initData(cDataFrame)
print(initData)

The algorithm checks if the input target is acceptable, if not, it initializes the target.

target <- c(1,2,3)

initTarget <- initTarget(target,cDataFrame)
print(initTarget)

If users want it, the algorithm will normalize weight's values.

weight <- c(5,7,6)

weights <- normalizeWeight(TRUE,weight,cDataFrame)
print(weights)

It calculates distances between clusters applying weights and distance definition given.

cluster1 <- matrix(c(1,2,3),ncol=3)
cluster2 <- matrix(c(2,5,8),ncol=3)

weight <- c(3,7,4)

distance <- distances(cluster1,cluster2,'CHE',weight)
print(distance)

Finally, the complete algorithm sorts the distances and sort the clusters aswell. It presents the solution as a sorted clusters list, with the distances or using a dendrogram.

target <- c(5,5,1)

weight <- c(3,7,5)

correlation <- correlationHC(cDataFrame, target,  weight)

print(correlation$sortedValues)

print(correlation$distances)

plot(correlation$dendrogram)

Correlative Hierarchical Clustering .DETAILS

This example shows how the algorithm works step by step.

How input data is initialized.

initData <- initData.details(cDataFrame)

How the algorithm checks if the input target is acceptable, and if not, how it initializes the target.

targetValid <- c(1,2,3)

targetInvalid <- c(1,2)

initTarget <- initTarget.details(targetValid,cDataFrame)

initTarget <- initTarget.details(targetInvalid,cDataFrame)

How the normalization process is done.

weight <- c(5,7,6)

weights <- normalizeWeight.details(TRUE,weight,cDataFrame)

weights <- normalizeWeight.details(FALSE,weight,cDataFrame)

weights <- normalizeWeight.details(FALSE,NULL,cDataFrame)

How it calculates distances between clusters applying weights and distance definition given.

cluster1 <- matrix(c(1,2,3),ncol=3)
cluster2 <- matrix(c(2,5,8),ncol=3)

weight <- c(3,7,4)

distance <- distances.details(cluster1,cluster2,'CHE',weight)

The complete function that explains the algorithm is:

target <- c(5,5,1)

weight <- c(3,7,5)

correlation <- correlationHC.details(cDataFrame, target,  weight)

print(correlation$sortedValues)

print(correlation$distances)

plot(correlation$dendrogram)