In thecomeonman/CURD: Implementation of the CURD algorithm for clustering

Introduction

To the extent of our knowledge, there are four issues that are somewhat under-explored in the existing body of research on clustering algorithms, and therefore represent an opportunity for improvement.

Dealing with categorical features: In practice, categorical features have often been transformed to numeric ones through the use of indicator variables for each category. This is not optimal in case of ordinal features, or when there are many categories.
Interpretability: In real life problems such as market segmentation, the ability to interpret clusters and verify them against business knowledge is a key factor that drives adoption. Most clustering results require considerable post-processing effort to aid interpretation.
Stability: Algorithms whose results are likely to vary drastically based on small changes in initial conditions (e.g. k-means) are not desirable in many circumstances [6,7].
Scalability: Clustering algorithms of quadratic complexity or higher (such as DBSCAN or hierarchical clustering) cannot deal efficiently with large datasets.

Decision tree-based clustering approaches offer a way to address the above issues in a reasonable manner: (a) the idea of splitting a node based on values of a feature is perhaps most easily applied to categorical data; (b) the tree structure allows for easy interpretability (each cluster can be viewed as an if-then rule leading from the root of the tree, i.e., the entire dataset to the leaf node, i.e., the cluster) and the ability to discover hierarchies among clusters; (c) the approach leads to stable results that do not vary across multiple iterations and; (d) since the approach works on increasingly smaller versions of the dataset, the complexity is sub-quadratic.

Existing tree-based methods in the literature [8] are built on the following assumption: clusters represent non-random agglomerations of points in the feature space and can therefore be discovered by a decision tree that attempts to distinguish between points in the actual dataset and randomly generated points. While this approach shows much promise, its effectiveness is also predicated on the precise assumptions underlying the generation of random data.

We propose a different approach to tree construction: we hypothesize that a non-random agglomeration would manifest itself through interdependence between the features. In other words, if one categorical feature can be well expressed in terms of the others, then it represents a separation of points in the feature space. We apply this principle inside a recursive partitioning algorithm to generate the tree, and therefore the clusters.

Methodology

The principal intuition behind the design of our proposed algorithm is as follows: If we wish to split the node in a tree based on a feature, it has to be one that divides the dataset into distinct groups. A useful heuristic, therefore, is to choose the feature that can be best predicted using the others. Further splits can be based on the same principle, subject to caveats such as the number of data points left for splitting.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 12,
  fig.height = 12
)

Running the Model

# CLEAR THE WORKSPACE
rm(list = ls())
set.seed(1)
# gc()

library(data.table)

library(CURD)
library(randomForest)
library(rpart)
library(cluster)
library(clue)

library(ggplot2)
library(ggrepel)
library(gridExtra)
library(scales)

library(DT)

theme_set(theme_bw(12))

# After running the model, should there also be some summaries created?
bDoAnalysis = T

# Should the results be compared with some other algos
bCompareOtherAlgos = T
bDoKMeans = T
bDoHierarachical = T
# Comparing MBKM needs some other code which we can't publish
# at the moment.
bDoMBKM = F

We will use the UCI voting dataset. ( Add reference. )

# You can replace dtUCIVotingRecords with any other dataset you want to 
# run this on. Review the LoadingData chunk and the ModelParameters chunk.

# Removing missing data rows for convenience
dtDataset = dtUCIVotingRecords[
   !apply(
      dtUCIVotingRecords, 
      1,
      function(x) { 
         any(x == '?')
      }
   )
]

# rm(dtUCIVotingRecords)

dtDataset = dtDataset[complete.cases(dtDataset)]

setnames(
   dtDataset,
   make.names(colnames(dtDataset))
)

There is some pre-processing that needs to happen on the data. This should be handled within the algorithm's code but for now it's separately listed outside.

cClassLabelFeat = 'Class'

# Choose either RF or Rpart right now. If only one independent feature remains,
# it defaults to Rpart
cClassifier = "RF"

# number of layers the tree is allowed to have
nMaxTreeDepth = 6

# Explain
bKeepFeatureInMixAfterSplit = FALSE

# Explain
bKeepFeatureInPredAfterSplit = TRUE

# Explain
bUseOptimalChunkPerformance = TRUE

# Features to use (for predicting, node splitting)
# todo, what's the difference between vNodeCandidates and vPredictorFeatures?
vNodeCandidates = setdiff(
   colnames(dtDataset), 
   cClassLabelFeat
)

vPredictorFeatures = setdiff(
   colnames(dtDataset), 
   cClassLabelFeat
)

# how many observations must be there for a split to be attempted
nMinRows = max(
   5, 
   round(0.1 * nrow(dtDataset))
)

# Some other columns that will be required
vKeyFeatures = 'ID'
dtDataset[, ID := .I]
dtDataset[, ClusteringRule := "Root"]
dtDataset[, ParentRule := "Root"]
dtDataset[, NodeLabel := "Root"]
dtDataset[, Depth := 1]

vRows = 1:nrow(dtDataset)

# Initial node
lRootNode = list(
   cNodeLabel = 'Root',
   cClusteringRule = "Root",
   nDepth = 1,
   vRows = vRows,
   cParentSplitFeat = "",
   vParentSplitVals = c(),
   cSplitFeat = ""
)

# Mark out ordinal features
# These can be handled better than categorical features
vcOrdinalFeatures = c()

# Listing out features and associated details
dtFeatureChunking = data.table(
   NodeCandidate = vNodeCandidates
)

dtFeatureChunking[, 
   FeatureType := 'Nominal' 
]

dtFeatureChunking[
   NodeCandidate %in% vcOrdinalFeatures, 
   FeatureType := 'Ordinal'
]

# We can allow for chunking of ordinal features, i.e. grouping neighbouring
# ordinal values into one bucket
dtFeatureChunking[, 
   Chunk := FALSE 
]

dtFeatureChunking[
   NodeCandidate %in% vcOrdinalFeatures, 
   Chunk := T 
]
dtFeatureChunking[
   NodeCandidate %in% vcOrdinalFeatures, 
   MinNumChunks := 2 
]
dtFeatureChunking[
   NodeCandidate %in% vcOrdinalFeatures, 
   MaxNumChunks := 3 
]

The clustering algo can be run on this dataset using the function RecursivePartitioningTreeClustering.

lTree = RecursivePartitioningTreeClustering(
   dtDataset = dtDataset,
   vRows = vRows,
   vNodeCandidates = vNodeCandidates,
   vPredictorFeatures = vPredictorFeatures,
   vKeyFeatures = vKeyFeatures,
   dtFeatureChunking = dtFeatureChunking,
   lTree = lRootNode,
   nTreeDepth = 0,
   nMinRows = nMinRows,
   nMaxTreeDepth = nMaxTreeDepth,
   cClassifier = cClassifier,
   bKeepFeatureInMixAfterSplit = bKeepFeatureInMixAfterSplit,
   bKeepFeatureInPredAfterSplit = bKeepFeatureInPredAfterSplit,
   bUseOptimalChunkPerformance = bUseOptimalChunkPerformance
)

# table(dtDataset$ClusteringRule)
# View(dtDataset)

Exploring the Results

Using the AssignCluster function, we can predict the clustering assignment for each observation in the dataset.

There are various other functions to help us analyse the results further. These functions are needed only on such demonstrations since on actual problems we won't have access to the actual labels of the dataset.

if ( bDoAnalysis ) {

   # ---------------------------------------------------------------------------

   iNbrClasses = length(
      unique(
         unlist(
            dtDataset[,
               cClassLabelFeat, 
               with = F
            ]
         )
      )
   )

   dtClassClustering = QualityOfClustering(
      AssignCluster(
         lTree,
         dtDataset
      ),
      cClassLabelFeat
   )

   dtNewClusteringObject = TreeStretchedToBottom(
      lTree = lTree,
      iParentNodeIndex = 0,
      bStretch = T
   )

   dtNewClusteringObject = merge(
      dtNewClusteringObject,
      setnames(dtClassClustering[, list(
          NodePrediction = NodeLabel[which.max(LayerCumPct)],
          Proportion = max(LayerCumPct)
        ), 
      list(Node)
      ], 'Node','NodeLabel'),
      'NodeLabel'
    )

   ggplot(dtNewClusteringObject[Stretch == F]) +
      geom_segment(
         aes(
            y = LayerNbr - 1, 
            yend = LayerNbr, 
            x = ParentNodeIndex, 
            xend = NodeLabel
         )
      )  +
      geom_point(
         aes(
            y = LayerNbr, 
            x = NodeLabel,
            color = NodePrediction,
            size = Members
         )
      ) +
      geom_text_repel(
         aes(
            y = LayerNbr, 
            x = NodeLabel, 
            label = paste(NodeColumnName,':',NodeColumnValue), 
            color = NodePrediction
         )
      ) +
      # geom_text_repel(aes(y = LayerNbr, x = NodeLabel, label = paste(NodeLabel,':',NodeColumnName,':',NodeColumnValue))) +
      # scale_colour_discrete(guide = 'none') +
      labs(
         list(
            title = 'The tree',
            y = NULL
         )
      ) +
      theme(
         # axis.text.x = element_text(
         #    angle = 90, 
         #    hjust = 1
         # ),
         axis.title.y = element_blank(),
         axis.text.y = element_blank(),
         axis.ticks.y = element_blank(),
         axis.title.x = element_blank(),
         axis.text.x = element_blank(),
         axis.ticks.x = element_blank(),
         legend.position = 'bottom',
         panel.grid = element_blank()
      )

}

if ( bDoAnalysis ) {

   #----------------------------------------------------------------------------

   # Evaluating the clustering tree from a classification standpoint

   # Create a tree summary dataset
   # todo, default dtTreeSummary to NULL hand handle initialisation inside
   # the function

   dtTreeSummary = TraverseTreeToMeasureClassificationQuality(
      dtTreeSummary = data.table(),
      dtDataset = dtDataset,
      lTree = lTree,
      cClassLabelFeat = cClassLabelFeat
   )

   # See classification accuracy as a function of depth
   dtClassificationAccuracyByClassDepth = GetClassificationAccuracyByClassDepth(
      dtTreeSummary
   )

   ggplot(
      data = dtClassificationAccuracyByClassDepth[Depth>1],
      aes(
         x = Depth,
         y = NodeMajorityPc,
         group = NodeMajorityClass,
         colour = NodeMajorityClass
      )
   ) +
      geom_line() +
      geom_point() +
      scale_y_continuous(labels = percent) +
      labs(
         list(
            title = 'Change in clusteriing accuracy over layers',
            caption = 'At each layer of the tree, the proportion of rows correctly assigned to a particular label. 
            This is calculated across the effective remaining tree if you were to cut all the layers off below the specified layer, 
            i.e. all leaf nodes upto that layer, and all non-leaf nodes in that layer.',
            y = 'Poportion of data correctly clustered.'

         )
      ) + 
      theme(
         legend.position = 'bottom'
      )

   # View(dtClassificationAccuracyByClassDepth)

}

if ( bDoAnalysis ) {

   ggplot( dtClassClustering ) +
      geom_point(aes(x = factor(Node), y = PctOfNode, color = Class, size  = Count)) +
      facet_wrap(~Layer, scale = 'free_x', nrow = 1) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      scale_y_continuous(labels = percent) +
      labs(
         list(
            title = 'Node accuracy',
            caption = 'Each point is a unique node - layer - class combination.
            Shows the proportion in the node (y) of each class (colour), and the number of members (size), across each layer.
            Each node of a layer lies on the x axis.
            Helps identify nodes which are not being clusterd well.',
            y = 'Pct of node',
            x = 'Node ID'
         )
      ) + 
      theme(
         legend.position = 'bottom'
      )

}

if ( bDoAnalysis ) {

   ggplot(dtClassClustering) +
      geom_point(aes(x = NodeLabel, y = PctOfNode, color = Class, size  = Count)) +
      facet_wrap(~Layer, scale = 'free_x') +
      labs(
         list(
            title = 'Label accuracy',
            caption = 'Each point is a unique node - layer - class combination.
            Shows the proportion in the node (y) of each class (colour), and the number of members (size), across each layer.
            The x axis is the label assigned to the node.
            Helps identify classes which are not being clusterd well.',
            y = 'Pct of node',
            x = 'Label'
         )
      ) + 
      scale_y_continuous(labels = percent) +
      theme(
         legend.position = 'bottom'
      )

}

if ( bDoAnalysis & F ) {


   ggplot(dtClassClustering) +
      geom_point(aes(x = LayerCumPct, y = PctOfNode, color = Correct, size  = Count)) +
      facet_wrap(~Layer) +
      labs(
         list(
            title = 'Spread of population clustering accuracy',
            caption = 'Each point is a unique node - layer - class combination.
            Shows the proportion in the node (y) of each class (colour), and the number of members (size), across each layer.
            The CDF of all the data lies along the x axis. The data is sorted by proportion in the node.
            The least accurately clustered node-class lies on the bottom left, and the most accurately clustered node-class on the top right.
            Helps see the spread of the accuracy of the clustering.'
         )
      ) + 
      scale_y_continuous(labels = percent) +
      theme(
         legend.position = 'bottom'
      )


}

Performance on other algorithms

We will compare the algorithm with hierarchical cluster by Ward method and K-means.

Elaborate on the details.

if ( bCompareOtherAlgos ) {

   # ---------------------------------------------------------------------------
   # Summarising results from this algo
   # ---------------------------------------------------------------------------

   dtResultsAlgo = data.table()

   for ( iDepth in c(2:dtTreeSummary[, max(Depth)]) ) {

      # if ( ( iNbrClasses ^ (iDepth - 1) ) > 16 ) {
      #   break
      # }

      dtDataset[, 
         TreeCut := gsub(
         x = NodeLabel,
         pattern = paste0(
            '(^',
            paste0(
               rep(
               '.*?>', 
               iDepth
               ), 
               collapse = ''
            ),
            ').*'
         ),
         replacement = '\\1'
         )
      ]


      dtThisResult = data.table(
         table(
         dtDataset[, 
            c(
               cClassLabelFeat,
               'TreeCut'
            ), 
            with = F
         ]
         )
      )

      setnames(
         dtThisResult,
         c('Class','Cluster2','Members')
      )

      dtThisResult[, Cluster := .GRP, Cluster2]
      dtThisResult[, Cluster2 := NULL]

      dtThisResult[, 
         NbrClusters := length(unique(Cluster))
      ]

      dtResultsAlgo = rbind(
         dtResultsAlgo,
         dtThisResult,
         fill = T
      )

      setDT(dtResultsAlgo)


   }

   dtResultsAlgo[, Method := 'Algo']




   # ---------------------------------------------------------------------------
   # Summarising results from hierarchical
   # ---------------------------------------------------------------------------


   dtResultsHierarachical = data.table()

   if ( bDoHierarachical ) {

      mDistanceMatrix = melt(
         copy(
            dtDataset[, 
            vPredictorFeatures, 
            with = F
            ]
         )[, 
            SNO := .I
         ], 
         measure.vars = vPredictorFeatures
      ) 

      if ( T ) {

         # this is slower but has lesser memory footprint.
         # have commented the other logic out
         mDistanceMatrixAggregated = merge(
            mDistanceMatrix[variable == vPredictorFeatures[1]],
            mDistanceMatrix[variable == vPredictorFeatures[1]],
            c('variable'),
            allow.cartesian = T
         )[,
            list(
               SNO.x, SNO.y, N = 0
            )
         ]

         for ( cPredictorFeature in vPredictorFeatures) {

            # print(cPredictorFeature)

            mDistanceMatrixAggregated = rbind(
               mDistanceMatrixAggregated,
               merge(
                  mDistanceMatrix[variable == cPredictorFeature],
                  mDistanceMatrix[variable == cPredictorFeature],
                  c('variable'),
                  allow.cartesian = T
               )[
                  value.x == value.y,
                  list(
                     SNO.x, SNO.y, N = 1
                  )
               ]
            )

            setDT(mDistanceMatrixAggregated)

            mDistanceMatrixAggregated = mDistanceMatrixAggregated[, 
               list( N = sum(N, na.rm = T) ), 
               list(SNO.x, SNO.y)
            ]

         } 

         mDistanceMatrix = mDistanceMatrixAggregated
         mDistanceMatrix[,
            N := length(vPredictorFeatures) - N
         ]
         rm(mDistanceMatrixAggregated)


      } else {

         mDistanceMatrix = rbindlist(
            lapply(
               vPredictorFeatures,
               function (x) {
               # print(x)
               merge(
                  mDistanceMatrix[variable == x],
                  mDistanceMatrix[variable == x],
                  c('variable'),
                  allow.cartesian = T
               )[
                  value.x == value.y,
                  list(
                     SNO.x, SNO.y
                  )
               ]
               }
            )
         )


         mDistanceMatrix = mDistanceMatrix[, 
            list( N = as.integer(as.character(.N)) ), 
            list(SNO.x, SNO.y)
         ][,
            N := length(vPredictorFeatures) - N
         ]

      }

      mDistanceMatrix = dcast(
         mDistanceMatrix,
         SNO.x~SNO.y,
         value.var = 'N',
         fun.aggregate = mean
      )

      mDistanceMatrix[is.na(mDistanceMatrix)] = 0

      fPrintTable = function(
         k,
         vcValues,
         Tree
      ) {

         # print(
            table(
            cbind(
               vcValues, 
               cutree(
                  Tree, 
                  k =k
               )
            )
            )
         # )

      }

      for ( cMethod in c('ward')) {
      # for ( cMethod in c('single','complete','ward','weighted')) {

         # print(cMethod)

         HierarchicalModel = agnes(
            mDistanceMatrix[, -1], 
            diss = T,
            method = cMethod
         )


         for ( i in dtResultsAlgo[, unique(NbrClusters)] ) {

            # if ( i > 16 ) {
            #   break
            # }

            dtResultsHierarachical = rbind(
               dtResultsHierarachical,
               data.table(fPrintTable(
                  k = i,
                  vcValues = dtDataset[, cClassLabelFeat, with = F][, cClassLabelFeat, with = F],
                  Tree = HierarchicalModel
               ))[, NbrClusters := i][, Method := paste0('H:',cMethod)],
               fill = T
            )

         }

      }

      rm(mDistanceMatrix)

      setDT(dtResultsHierarachical)
      setnames(
         dtResultsHierarachical, 
         c(
            'Class',
            'Cluster',
            'Members',
            'NbrClusters',
            'Method'
         )
      )

   }


   # ---------------------------------------------------------------------------
   # Summarising results from k-modes
   # ---------------------------------------------------------------------------


   dtResultsMBKM = data.table()

   if ( bDoMBKM ) {

      dtDataset[, SNO2 := .I ]
      lIndicatorDataset = CreateIndicatorDatasetFromCategorical(
         dtCategoricalDataset = dtDataset[, c(vPredictorFeatures, 'SNO2'), with = F],
         vCategoricalFeatures = vPredictorFeatures,
         vKeyFeatures = c('SNO2')
      )

      for ( i in dtResultsAlgo[, unique(NbrClusters)] ) {

         # if ( i > 16 ) {
         #   break
         # }

         lClusteringResults = MiniBatchKModesVectorized(
            dtClusteringDataset = lIndicatorDataset$dtIndicatorDataset, 
            vIndicatorFeatures = lIndicatorDataset$vIndicatorFeatures, 
            vCategoricalFeatures = lIndicatorDataset$vCategoricalFeatures, 
            vKeyFeatures = c('SNO2'),
            vCatNumPerFeature = lIndicatorDataset$vCatNumPerFeature,
            nNumClusters = i, 
            nBatchSize = nrow(dtDataset), 
            nMaxIter = 10, 
            bBatchUpdate=FALSE,
            cDistanceMetric = 'hellinger', 
            vFeatureWeights = array(1,length(lIndicatorDataset$vCategoricalFeatures)),
            lInitCenters = list(method="multitry", centers=NULL)
         )

         qwe = MeasureDistancesFromClusterCentersVectorized(
            dtClusteringDataset = lIndicatorDataset$dtIndicatorDataset, 
            dtClusterCenters = lClusteringResults$vCenters, 
            vIndicatorFeatures = lIndicatorDataset$vIndicatorFeatures,
            vCategoricalFeatures = lIndicatorDataset$vCategoricalFeatures,
            vKeyFeatures = lIndicatorDataset$vKeyFeatures,
            vCatNumPerFeature = lIndicatorDataset$vCatNumPerFeature,
            cDistanceMetric = 'hellinger', 
            vFeatureWeights = array(1,length(lIndicatorDataset$vCategoricalFeatures)) 
         )

         # for (c in 1:length(lClusteringResults$lCenters))
         # {
         #   print(lClusteringResults$lCenters[[c]]$center)  
         # }

         dtResultsMBKM = rbind(
            dtResultsMBKM,
            data.table(
               table(
                  merge(
                     dtDataset[, c('SNO2',cClassLabelFeat), with = F], 
                     qwe$centnum, 
                     'SNO2'
                  )[, 
                     c(cClassLabelFeat, 'clusternum'), 
                     with  = F
                  ]
               )
            )[, 
               NbrClusters := i
            ][, 
               Method := 'MBKM'
            ],
            fill = T
         )

      }

      setnames(
         dtResultsMBKM, 
         c(
         'Class',
         'Cluster',
         'Members',
         'NbrClusters',
         'Method'
         )
      )

   }



   # ---------------------------------------------------------------------------
   # Summarising results from k-means
   # ---------------------------------------------------------------------------


   dtResultsKM = data.table()

   if  ( bDoKMeans ) {

      dtKMDataset = dtDataset[, 
         c(vPredictorFeatures, cClassLabelFeat), 
         with = F
      ]

      dtKMDataset[, SNO := .I]

      for ( i in seq(length(vPredictorFeatures)) ) {

         # print(vPredictorFeatures[i])

         if ( grepl(x = vPredictorFeatures[i], pattern = 'Ord') ) {

            dtKMDataset[,
               vPredictorFeatures[i] := as.numeric(gsub(
                  x = get(vPredictorFeatures[i]),
                  pattern = '\\(.*',
                  replacement = ''
               ))
            ]

         } else {

            dtToMerge = dcast(
               dtKMDataset[, c('SNO', vPredictorFeatures[i]), with = F], 
               as.formula(paste0('SNO~', vPredictorFeatures[i])),
               fun.aggregate = length,
               value.var = vPredictorFeatures[i]
            )

            setnames(
               dtToMerge, 
               setdiff(colnames(dtToMerge), 'SNO'),
               paste0(vPredictorFeatures[i],'.',setdiff(colnames(dtToMerge), 'SNO'))
            )

            dtKMDataset = merge(
               dtKMDataset[, 
                  !vPredictorFeatures[i], 
                  with = F
               ],
               dtToMerge,
               'SNO'
            )

         }

      }

      dtKMDataset[, SNO := NULL]

      for ( i in dtResultsAlgo[, unique(NbrClusters)] ) {

         setDT(dtKMDataset)
         kmeansForComparison = kmeans(
            x = dtKMDataset[, !c(cClassLabelFeat, 'Cluster'), with = F], 
            centers = i
         )

         dtKMDataset[,
            Cluster := cl_predict(
               kmeansForComparison,
               dtKMDataset[, !c(cClassLabelFeat, 'Cluster'), with = F]
            )
         ]

         dtKMDatasetResults = dtKMDataset[,
            list(Members = .N),
            c('Cluster', cClassLabelFeat)
         ]

         setnames(
            dtKMDatasetResults,
            cClassLabelFeat, 
            'Class'
         )

         dtResultsKM = rbind(
            dtResultsKM,
            dtKMDatasetResults[, 
               NbrClusters := i
            ][,
               Method := 'KM'
            ],
            fill = T
         )

      }

      setnames(
         dtResultsKM, 
         c(
            'Cluster',
            'Class',
            'Members',
            'NbrClusters',
            'Method'
         )
      )



   }

}

if ( bCompareOtherAlgos ) {

   # Comparing results from various algos
   # ---------------------------------------------------------------------------

   dtResults = data.table()
   dtResults = rbind(dtResults, dtResultsMBKM, fill = T)
   dtResults = rbind(dtResults, dtResultsKM, fill = T)
   dtResults = rbind(dtResults, dtResultsHierarachical, fill = T)
   dtResults = rbind(dtResults, dtResultsAlgo, fill = T)
   setDT(dtResults)
   dtResults[, Cluster := as.integer(Cluster)]

   dtResults = merge(
      dtResults,
      setnames(
         dtResults[,
            expand.grid(
               min(Cluster):max(Cluster),
               unique(Class)
            ), 
            list(
               NbrClusters,
               Method
            )
         ],
         c('Var1','Var2'),
         c('Cluster','Class')
      ),
      by = c('Method','NbrClusters','Cluster','Class'),
      all = T
   )

   dtResults[is.na(Members), Members := 0]

   setkey(
      dtResults, 
      Method, NbrClusters, Cluster, Class
   )

   dtResults[, 
      Members_pct := Members / sum(Members), 
      list(Method, NbrClusters, Cluster)
   ]

   dtResults[, 
      ClusterMembers := max(cumsum(Members)), 
      list(Method, NbrClusters, Cluster)
   ] 

   dtResults[, 
      MembersPrevious_pct := c(0,head(Members_pct, -1)), 
      list(Method, NbrClusters, Cluster)
   ]

   setkey(
      dtResults, 
      Method, NbrClusters, Cluster
   )

   dtResults[, 
      ClusterMembersPrevious := c(0,head(ClusterMembers, -1)), 
      list(Method, NbrClusters, Cluster, Class)
   ]

   dtResults[, 
      AssignedClass := Class[which.max(Members)], 
      list(Method, NbrClusters, Cluster)
   ]

}

if ( bCompareOtherAlgos ) {

   # datatable(
   #    dcast(
   #       dtResults[, 
   #          list(
   #             Propotion_pct = sum(Members_pct)
   #          ), 
   #          list(
   #             Method,
   #             CorrectlyClustered = AssignedClass == Class, 
   #             NbrClusters, 
   #             Cluster
   #          )
   #       ][
   #          CorrectlyClustered == F
   #       ][, 
   #          list(
   #             MeanPropotion_pct = mean(Propotion_pct)
   #          ), 
   #          list( 
   #             Method, 
   #             CorrectlyClustered, 
   #             NbrClusters
   #          )
   #       ][, 
   #          # Dataset := cRunName
   #       ], 
   #       # Dataset + NbrClusters ~ Method, 
   #       NbrClusters ~ Method, 
   #       value.var = 'MeanPropotion_pct',
   #       fun.aggregate = function( x ) { paste( round(x * 100), '%' )}
   #       
   #    ),
   #    rownames = F, 
   #    caption = 'Mean error across each cluster'
   # )

   ggplot(
      dtResults[, 
         list(
            Propotion_pct = sum(Members_pct)
         ), 
         list(
            Method,
            CorrectlyClustered = AssignedClass == Class, 
            NbrClusters, 
            Cluster
         )
      ][
         CorrectlyClustered == F
      ][, 
         list(
            Misattribution_pct = mean(Propotion_pct)
         ), 
         list( 
            Method, 
            CorrectlyClustered, 
            NbrClusters
         )
      ]
   ) + 
      geom_bar(
         aes(
            x = Method,
            y = Misattribution_pct
         ), 
         stat = 'identity'
      ) + 
      facet_grid(~NbrClusters) +
      scale_y_continuous(labels = percent) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      labs(
         list(
            title = paste0('Average % data misattributed across clusters')
         )
      )


}

if ( bCompareOtherAlgos ) {

   ggplot(
      dtResults[, 
         list(
            Misattribution_pct = sum(Members[AssignedClass != Class]) / 
               sum(Members)
         ), 
         list(
            Method, 
            NbrClusters
         )
      ]
   ) + 
      geom_bar(
         aes(
            x = Method,
            y = Misattribution_pct
         ), 
         stat = 'identity'
      ) + 
      facet_grid(~NbrClusters) +
      scale_y_continuous(labels = percent) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      labs(
         list(
            title = paste0('Total % data misattributed')
         )
      )

}

if ( bCompareOtherAlgos ) {


   ggplot(dtResults) +
      geom_text(aes(x = Class, y = Cluster, label = paste0(Members, ' (', round(100*Members_pct,0), '%)'), color = Class)) + 
      facet_grid(Method~NbrClusters) + 
      scale_colour_discrete(guide = 'none') +
      scale_y_continuous(
         breaks = seq(max(dtResults[, Cluster])),
         limits = c(min(dtResults[,Cluster])-1, max(dtResults[,Cluster])+1 )
      ) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      labs(
         list(
            title = paste0('Misattribution% by cluster')
         )
      )

}

if ( bCompareOtherAlgos ) {

   ggplot(dtResults) + 
      geom_bar(aes(x = Cluster, y = Members, fill = Class), position = 'stack', stat = 'identity') + 
      coord_flip() +
      facet_grid(Method~NbrClusters)+
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      labs(
         list(
            title = paste0('Count by cluster')
         )
      )


}

```r

if ( bCompareOtherAlgos & F ) {

# --------------------------------------------------------------------------- # Tabulating for paper # ---------------------------------------------------------------------------

for ( iDepth in c(2:4) ) {

  dtDataset[, 
     TreeCut := gsub(
        x = NodeLabel,
        pattern = paste0(
           '(^',
           paste0(
              rep(
                 '.*?>', 
                 iDepth
              ), 
              collapse = ''
           ),
           ').*'
        ),
        replacement = '\\1'
     )
  ]

  # print(
  #    table(
  #       dtDataset[, 
  #          c(
  #             cClassLabelFeat,
  #             'TreeCut'
  #          ), 
  #          with = F
  #       ]
  #    )
  # )

  qwe = setnames(
     data.table(
        fPrintTable(
           k = iDepth,
           vcValues = dtDataset[, cClassLabelFeat, with = F],
           Tree = HierarchicalModel
           )
     ),
     c('Class','Cluster','Members')
  )

  qwe[, Cluster := as.integer(Cluster)]
  qwe = merge(
     qwe,
     qwe[, list(Cluster = min(qwe[,Cluster]):max(qwe[,Cluster])), Class],
     by = c('Cluster','Class'),
     all = T
  )
  qwe[is.na(Members), Members := 0]

  setkey(qwe, Cluster, Class)
  qwe[, Members_pct := Members / sum(Members), Cluster]
  qwe[, ClusterMembers := cumsum(Members)][, ClusterMembers := max(ClusterMembers), Cluster] 
  qwe[, MembersPrevious_pct := c(0,head(Members_pct, -1)), Cluster]
  setkey(qwe, Class, Cluster)
  qwe[, ClusterMembersPrevious := c(0,head(ClusterMembers, -1)), Class]



  grid.arrange(
     ggplot(qwe) +
        geom_text(aes(x = Class, y = Cluster, label = paste0(Members, ' (', round(100*Members_pct,0), '%)'), color = Class)),
     ggplot(qwe) + 
        geom_bar(aes(x = Cluster, y = Members, fill = Class), color = 'black', stat = 'identity') + 
        coord_flip(),
     ggplot(qwe) + 
        geom_bar(aes(x = Cluster, y = Members_pct, fill = Class), color = 'black', stat = 'identity') + 
        coord_flip(),
     ggplot(qwe) + 
        geom_rect(aes(ymin = ClusterMembers, ymax = ClusterMembersPrevious, xmin = MembersPrevious_pct, xmax = Members_pct + MembersPrevious_pct, fill = Class), color = 'black') +
        ylab('Composition of total') + xlab('% Composition of cluster'),
     name = paste('qweqwe', iDepth)
  )

  dtDataset[, TreeCut := NULL]

}

thecomeonman/CURD documentation built on May 20, 2019, 7:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com