#' @title Cluster Trajectories According to the Subset of Measures Selected Previously
#' @description Classify trajectories based on the factors identified in step2factors.
#'
#'
#' @param trajFactors Object generated by \code{\link{step2factors}}. Contains
#'data factors, eigenvalues, principal factors as well as
#'the original data.
#'@param nstart Integer number designating the number of
#'seedings that \code{\link[stats]{kmeans}} should do in order to cluster the
#trajectories. Defaults to 50.
#' @param nclusters Integer number indicating the number
#'of clusters to use in order to classify the trajectories. If \code{NULL},
#'the function selects the number of clusters based on an automated criteria specified by index.. Defaults
#'to \code{NULL}.
#' @param criteria String indicating the criteria to
#'select the number of clusters. Defaults to
#'\code{ccc} (Cubic clustering criterion).
#' @param forced.factors (Optional) Vector containing the names of the measures calculated in
#'\code{\link{step1measures}} to force as factors for the clustering. This vector will override the factors selected by \code{\link{step2factors}}. Available options: "m1", "m2", "m3", \dots ,"m23" and "m24". Defaults to \code{NULL}. See details.
#'
#'
#' @return The function returns a \code{traj} object that contains objects carried through steps 1 and 2 which includes the original data, measures and factors.
#'
#' Furthermore, it includes a data.frame containing the ID corresponding to each trajectory, and the cluster number in which the trajectory was classified. This is stored in the \code{clusters} field of the \code{traj} object. It also contains the cluster distribution of the observations.
#'
#' Methods to plot the output of \code{step3clusters} include:\cr
#'\item{plot}{ plots a 10 person sample from every cluster}
#'\item{\link{plotMedTraj}}{plots the median trajectory of the clusters}
#'\item{\link{plotMeanTraj}}{plots the mean trajectory of the clsuters}
#'\item{\link{plotBoxplotTraj}}{produce a boxplot of trajectories of every cluster}
#'
#' @details
#'
#' If \code{nclusters} is set to \code{NULL}, the function will use the
#'\code{\link[NbClust]{NbClust}} function to select the
#'optimal number of clusters. The \code{NbClust} function
#'uses \code{kmeans} as the cluster analysis method. Te measures are standardized within
#'the \link{step3clusters} function prior to clustering. The criteria
#'to be computed can be chosen by the \code{criteria} argument.
#'The list of available methods and criteria can be found
#'in the \code{NbClust} help page. Criteria compatible with \code{step3clusters} are:
#' "ch", "kl", "ccc", "hartigan", "scott", "trcovw", "tracew" and "friedman". It is important
#'to note that some of these criteria will not always yield the same number of clusters when
#'run multiple times. Increasing \code{nstart} will generally stabilize the results.
#'
#'The function then uses \code{\link[stats]{kmeans}} in order to cluster the trajectories
#'in the required number of clusters. If \code{nclusters} is
#'set to \code{NULL}, then the number of clusters is computed by
#\code{NbClust}, if it is set to a positive non-zero integer,
#'then the data will be classified into that number of clusters.
#'\code{kmeans} uses the \code{nstart} argument in order to select how
#'many random sets should be run during its execution. If
#'the function does not converge, increasing \code{nstart} can
#'improve the result. PLease consult the \code{\link[stats]{kmeans}} help page for more information.
#'
#'When \code{forced.factors} is set to \code{NULL}, the function will select the factors identified
#'by \code{step2factors} in order to cluster the trajectories. When the parameter is set to a vector,
#'it must contain at least one measure name such as: "m1", "m2", "m3", \dots ,"m23" and "m24". The function will then
#'cluster the trajectories using the stated measures. These measures are generated by \code{step1measures}. They range from "m1" to "m24". All of these measures are found in the \code{trajMeasures} object.
#'
#'When the plot function is run without changing the default values, only a \code{traj} object
#'is required. The function will generate a multiplot of all
#'the clusters. In each plot, 10 randomly selected
#'trajectories will be traced. The same number of trajectories for each cluster
#'will be plotted. If the function is rerun, the plots will
#'not look the same because the trajectories are randomly sampled.
#'Seeding is required in order replicate a plot.
#'
#'If \code{color.vect} is \code{NULL}, the function will randomly assign
#'a color to each trajectory. The same colors will be used
#'for all the trajectories in each plot. If specific colors
#'are chosen, there must be as many colors in the vector as
#'there are trajectories to be plotted or an error will
#'thrown.
#'
#'If \code{clust.num} is set to an integer, the cluster associated
#'with that integer will be plotted. Only that one will be
#'displayed among the available clusters.
#'
#'The print function displays the number of observations used in the computation of \code{traj},
#'the number of clusters as well as the number of observations in each one and
#'the measures set as factors. These factors are used to cluster the data.
#'The number of decimal places is defaulted to 2, it can be changed in the arguments
#'of \code{\link[traj]{step3clusters}}.
#'
#'The summary function displays the number of observations analysed as well as the total number of
#'clusters into which the data was classified.
#'Prints the eigenvalues used to determine the number of
#'factors to be selected in \code{\link[traj]{step2factors}}.
#'Prints summary statistics of each of the factors by cluster.
#'The number of decimal places is defaulted to 2, it can be changed in the parameters
#'of \code{\link[traj]{step3clusters}}.
#'
#'
#'@author Marie-Pierre Sylvestre, Dan Vatnik
#'
#'marie-pierre.sylvestre@umontreal.ca
#'
#' @examples
#' \dontrun{
#'# Setup data
#'data = example.data$data
#'
#'# Run step1measures, step2factors and step3clusters
#'s1 = step1measures(data, ID=TRUE)
#'s2 = step2factors(s1)
#'s3 = step3clusters(s2)
#'
#'# Print and plot 'traj' object
#'s3
#'plot(s3)
#'
#'# Run step3clusters with predetermined number of clusters
#'s3.4clusters = step3clusters(s2, nclusters=4)
#'
#'# Display 'traj' object s3.4clusters
#'summary(s3.4clusters)
#'plot(s3.4clusters)
#'
#'s3$cluster[1:10,]
#'
#' }
#'
#'
#' @rdname step3clusters
#'
#' @seealso
#' \code{\link[NbClust]{NbClust}}
#'\code{\link[stats]{kmeans}}
#'\code{\link[traj]{step1measures}}
#'\code{\link[traj]{step2factors}}
#'\code{\link[graphics]{plot}}
#'
#'
#'
#' @export
step3clusters <- function(trajFactors, nclusters = NULL, nstart = 50, criteria = "ccc", forced.factors = NULL)
{
if (is.null(forced.factors)) {
data = data.frame(ID = trajFactors$factors[,1], apply(trajFactors$factors[,-1], 2, scale)) ###############ICI
}
else {
data = data.frame(trajFactors$measurments$ID, apply(trajFactors$measurments[, c(forced.factors)], 2, scale)) ###############ICI
colnames(data)[1] = "output"
}
# Error checking
# if(class(data) != "data.frame")
if(!is.data.frame(data)) #ICI
stop("data must be a data.frame")
if(nclusters > nrow(data) && !is.null(nclusters))
stop("Requesting more clusters in 'nclusters' than available rows in data.")
#Sizing data
dim.of.data = dim(data)
sample.size = dim.of.data[1]
# Deal with IDs
IDvector = data[,1]
data = data[,-1]
max.num.obs = dim(data)[2]
cluster.est = NULL
# Calculate the number of clusters to use
if(is.null(nclusters)){
cluster.est = NbClust(data, method = "kmeans", index = criteria)
all.criteria = as.data.frame(cluster.est$All.index)
all.criteria.x = as.integer(rownames(all.criteria))
par(mfrow = c(1,2))
plot(all.criteria.x, as.matrix(all.criteria),
main = paste(criteria, " criteria " , "versus Clusters"),
xlab = "Clusters",
ylab = "Criteria")
num.clust = cluster.est$Best.nc[1]
nclusters = num.clust
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Scree Plot for Number of Clusters")
}
else
{
if(nclusters < 1 ) stop("forcec.clust must be larger than 0")
else num.clust = round(nclusters)
}
# use k-means to split the data into the designated number of clusters
cluster.data = kmeans(data, centers = num.clust, nstart=nstart)
# Bind the cluster position to ID vector
output = cbind.data.frame(ID = IDvector, cluster = cluster.data$cluster)
#output = as.data.frame(output)
#names(output) = c("ID", "cluster")
table.output = rbind(table(output$cluster) , table(output$cluster) / sum(table(output$cluster)) * 100)
rownames(table.output) = c("(#)", "(%)")
data = cbind(IDvector, data)
names(data)[1] = "output"
# Create object "traj" to export
structure(list(clusters = output, clust.distr = table(output$cluster),
clust.estim = as.data.frame(cluster.est$All.index),
factors = data,
e.values = trajFactors$e.values, princ.fact = trajFactors$princ.fact,
measurments = trajFactors$measurments, data = trajFactors$data,
time = trajFactors$time), class='traj')
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.