R/step3clusters.R
In traj: Trajectory Analysis

Documented in step3clusters

#' @title Cluster Trajectories According to the Subset of Measures Selected Previously
#' @description Classify trajectories based on the factors identified in step2factors. 
#' 
#' 
#' @param trajFactors Object generated by \code{\link{step2factors}}. Contains
#'data factors, eigenvalues, principal factors as well as
#'the original data.
#'@param nstart Integer number designating the number of
#'seedings that \code{\link[stats]{kmeans}} should do in order to cluster the
#trajectories. Defaults to 50.
#' @param nclusters Integer number indicating the number
#'of clusters to use in order to classify the trajectories. If \code{NULL},
#'the function selects the number of clusters based on an automated criteria specified by index.. Defaults
#'to \code{NULL}.
#' @param criteria String indicating the criteria to
#'select the number of clusters. Defaults to
#'\code{ccc} (Cubic clustering criterion).
#' @param forced.factors (Optional) Vector containing the names of the measures calculated in
#'\code{\link{step1measures}} to force as factors for the clustering. This vector will override the factors selected by \code{\link{step2factors}}. Available options: "m1", "m2", "m3", \dots ,"m23" and "m24".  Defaults to \code{NULL}. See details.
#'
#'
#' @return The function returns a \code{traj} object that contains objects carried through steps 1 and 2 which includes the original data, measures and factors.
#' 
#' Furthermore, it includes a data.frame containing the ID corresponding to each trajectory, and the cluster number in which the trajectory was classified. This is stored in the \code{clusters} field of the \code{traj} object. It also contains the cluster distribution of the observations.
#' 
#' Methods to plot the output of \code{step3clusters} include:\cr
#'\item{plot}{ plots a 10 person sample from every cluster}
#'\item{\link{plotMedTraj}}{plots the median trajectory of the clusters}
#'\item{\link{plotMeanTraj}}{plots the mean trajectory of the clsuters}
#'\item{\link{plotBoxplotTraj}}{produce a boxplot of trajectories of every cluster} 
#'
#' @details 
#'
#'  If \code{nclusters} is set to \code{NULL}, the function will use the
#'\code{\link[NbClust]{NbClust}} function to select the
#'optimal number of clusters. The \code{NbClust} function
#'uses \code{kmeans} as the cluster analysis method. Te measures are standardized within
#'the \link{step3clusters} function prior to clustering. The criteria
#'to be computed can be chosen by the \code{criteria} argument.
#'The list of available methods and criteria can be found
#'in the \code{NbClust} help page. Criteria compatible with \code{step3clusters} are:
#'  "ch", "kl", "ccc", "hartigan", "scott", "trcovw", "tracew" and "friedman". It is important 
#'to note that some of these criteria will not always yield the same number of clusters when 
#'run multiple times. Increasing \code{nstart} will generally stabilize the results.
#'
#'The function then uses \code{\link[stats]{kmeans}} in order to cluster the trajectories
#'in the required number of clusters. If \code{nclusters} is
#'set to \code{NULL}, then the number of clusters is computed by
#\code{NbClust}, if it is set to a positive non-zero integer,
#'then the data will be classified into that number of clusters.
#'\code{kmeans} uses the \code{nstart} argument in order to select how
#'many random sets should be run during its execution. If
#'the function does not converge, increasing \code{nstart} can
#'improve the result. PLease consult the \code{\link[stats]{kmeans}} help page for more information.
#'
#'When \code{forced.factors} is set to \code{NULL}, the function will select the factors identified
#'by \code{step2factors} in order to cluster the trajectories. When the parameter is set to a vector,
#'it must contain at least one measure name such as: "m1", "m2", "m3", \dots ,"m23" and "m24". The function will then
#'cluster the trajectories using the stated measures. These measures are generated by \code{step1measures}. They range from "m1" to "m24". All of these measures are found in the \code{trajMeasures} object.
#'
#'When the plot function is run without changing the default values, only a \code{traj} object
#'is required. The function will generate a multiplot of all
#'the clusters. In each plot, 10 randomly selected
#'trajectories will be traced. The same number of trajectories for each cluster
#'will be plotted. If the function is rerun, the plots will
#'not look the same because the trajectories are randomly sampled.
#'Seeding is required in order replicate a plot.
#'
#'If \code{color.vect} is \code{NULL}, the function will randomly assign
#'a color to each trajectory. The same colors will be used
#'for all the trajectories in each plot. If specific colors
#'are chosen, there must be as many colors in the vector as
#'there are trajectories to be plotted or an error will
#'thrown.
#'
#'If \code{clust.num} is set to an integer, the cluster associated
#'with that integer will be plotted. Only that one will be
#'displayed among the available clusters.
#'
#'The print function displays the number of observations used in the computation of \code{traj},
#'the number of clusters as well as the number of observations in each one and
#'the measures set as factors. These factors are used to cluster the data. 
#'The number of decimal places is defaulted to 2, it can be changed in the arguments 
#'of \code{\link[traj]{step3clusters}}.
#'
#'The summary function displays the number of observations analysed as well as the total number of 
#'clusters into which the data was classified.
#'Prints the eigenvalues used to determine the number of 
#'factors to be selected in \code{\link[traj]{step2factors}}.
#'Prints summary statistics of each of the factors by cluster.
#'The number of decimal places is defaulted to 2, it can be changed in the parameters 
#'of \code{\link[traj]{step3clusters}}.
#'
#' 
#'@author Marie-Pierre Sylvestre, Dan Vatnik
#'
#'marie-pierre.sylvestre@umontreal.ca
#'
#' @examples
#' \dontrun{
#'# Setup data 
#'data = example.data$data
#'
#'# Run step1measures, step2factors and step3clusters
#'s1 = step1measures(data, ID=TRUE)
#'s2 = step2factors(s1)
#'s3 = step3clusters(s2)
#'
#'# Print and plot 'traj' object
#'s3
#'plot(s3)
#'
#'# Run step3clusters with predetermined number of clusters
#'s3.4clusters = step3clusters(s2, nclusters=4)
#'
#'# Display 'traj' object s3.4clusters
#'summary(s3.4clusters)
#'plot(s3.4clusters)
#'
#'s3$cluster[1:10,]
#'
#' }
#'
#'
#' @rdname step3clusters
#' 
#' @seealso 
#' \code{\link[NbClust]{NbClust}}
#'\code{\link[stats]{kmeans}}
#'\code{\link[traj]{step1measures}}
#'\code{\link[traj]{step2factors}}
#'\code{\link[graphics]{plot}}
#'
#'
#'
#' @export 
step3clusters <- function(trajFactors, nclusters = NULL, nstart = 50, criteria = "ccc", forced.factors = NULL)
  {

  if (is.null(forced.factors)) {
    data = data.frame(ID = trajFactors$factors[,1], apply(trajFactors$factors[,-1], 2, scale))   ###############ICI
  }
  else {
    data = data.frame(trajFactors$measurments$ID, apply(trajFactors$measurments[, c(forced.factors)], 2, scale)) ###############ICI
    colnames(data)[1] = "output"
  }

    # Error checking
    # if(class(data) != "data.frame")
  if(!is.data.frame(data)) #ICI
      stop("data must be a data.frame")

    if(nclusters > nrow(data) && !is.null(nclusters))
      stop("Requesting more clusters in 'nclusters' than available rows in data.")

    #Sizing data
    dim.of.data = dim(data)
    sample.size = dim.of.data[1]


    # Deal with IDs
    IDvector = data[,1]
    data = data[,-1]

    max.num.obs  = dim(data)[2]

    cluster.est = NULL

    # Calculate the number of clusters to use
    if(is.null(nclusters)){
      cluster.est = NbClust(data, method = "kmeans", index = criteria)

      all.criteria = as.data.frame(cluster.est$All.index)
      all.criteria.x = as.integer(rownames(all.criteria))

      par(mfrow = c(1,2))

      plot(all.criteria.x, as.matrix(all.criteria),
           main = paste(criteria, " criteria " , "versus Clusters"),
           xlab = "Clusters",
           ylab = "Criteria")

      num.clust = cluster.est$Best.nc[1]
      nclusters = num.clust
      wss <- (nrow(data)-1)*sum(apply(data,2,var))

      for (i in 2:15) wss[i] <- sum(kmeans(data,
                                           centers=i)$withinss)
      plot(1:15, wss, type="b", xlab="Number of Clusters",
           ylab="Within groups sum of squares", main = "Scree Plot for Number of Clusters")

    }
    else
    {
      if(nclusters < 1 ) stop("forcec.clust must be larger than 0")
      else  num.clust = round(nclusters)
    }

    # use k-means to split the data into the designated number of clusters
    cluster.data = kmeans(data, centers = num.clust, nstart=nstart)

    # Bind the cluster position to ID vector
    output = cbind.data.frame(ID = IDvector, cluster = cluster.data$cluster)
    #output = as.data.frame(output)
    #names(output) = c("ID", "cluster")

    table.output = rbind(table(output$cluster) , table(output$cluster) / sum(table(output$cluster)) * 100)
    rownames(table.output) = c("(#)", "(%)")

    data = cbind(IDvector, data)
    names(data)[1] = "output"

    # Create object "traj" to export
    structure(list(clusters = output, clust.distr = table(output$cluster),
                   clust.estim = as.data.frame(cluster.est$All.index),
                   factors  = data,
                   e.values = trajFactors$e.values, princ.fact = trajFactors$princ.fact,
                   measurments = trajFactors$measurments, data = trajFactors$data,
                   time = trajFactors$time), class='traj')

}