R/tclust.R

Defines functions tclustfsda

Documented in tclustfsda

######
##  VT::12.12.2018
##
##
##  roxygen2::roxygenise("C:/users/valen/onedrive/myrepo/R/fsdaR", load_code=roxygen2:::load_installed)
##
#'  Computes trimmed clustering with scatter restrictions
#'
#' @description Partitions the points in the n-by-v data matrix
#'  \code{Y} into \code{k} clusters. This partition minimizes the trimmed sum,
#'  over all clusters, of the within-cluster sums of point-to-cluster-centroid
#'  distances. Rows of Y correspond to points, columns correspond to variables.
#'  Returns in the output object of class \code{\link{tclustfsda.object}} an n-by-1 vector
#'  \code{idx} containing the cluster indices of each point. By default,
#'  \code{tclustfsda()} uses (squared), possibly constrained, Mahalanobis distances.
#'
#' @details This iterative algorithm initializes \code{k} clusters randomly and performs
#'  concentration steps in order to improve the current cluster assignment. The number of
#'  maximum concentration steps to be performed is given by input parameter \code{refsteps}.
#'  For approximately obtaining the global optimum, the system is initialized \code{nsamp}
#'  times and concentration steps are performed until convergence or \code{refsteps} is
#'  reached. When processing more complex data sets higher values of \code{nsamp} and
#'  \code{refsteps} have to be specified (obviously implying extra computation time).
#'  However, if more then 10 per cent of the iterations do not converge, a warning message
#'  is issued, indicating that \code{nsamp} has to be increased.

#' @param x An n x p data matrix (n observations and p variables).
#'  Rows of x represent observations, and columns represent variables.
#'
#'  Missing values (NA's) and infinite values (Inf's) are allowed,
#'  since observations (rows) with missing or infinite values will
#'  automatically be excluded from the computations.
#'
#' @param k Number of groups.
#'
#' @param alpha A scalar between 0 and 0.5 or an integer specifying the number of
#'  observations which have to be trimmed. If \code{alpha=0}, \code{tclust} reduces to
#'  traditional model based or mixture clustering (mclust): see for example the
#'  Matlab function \code{gmdistribution}.
#'
#'  More in detail, if \code{0 < alpha < 1} clustering is based on \code{h = floor(n * (1-alpha))}
#'  observations, else if alpha is an integer greater than 1 clustering is based on \code{h = n - floor(alpha)}.
#'  If \code{monitoring=TRUE}, \code{alpha} is a vector which specifies the values of
#'  trimming levels which have to be considered - contains decresing elements which
#'  lie in the interval 0 and 0.5.
#'  For example if \code{alpha=c(0.1, 0.05, 0)}, \code{tclust()} considers these 3 values of trimming level.
#'  The default for alpha is vector \code{alpha=c(0.1, 0.05, 0)}. The sequence is
#'  forced to be monotonically decreasing.

#' @param restrfactor Positive scalar which constrains the allowed differences among group scatters.
#'  Larger values imply larger differences of group scatters. On the other hand
#'  a value of 1 specifies the strongest restriction forcing all
#'  eigenvalues/determinants to be equal and so the method looks
#'  for similarly scattered (respectively spherical) clusters.
#'  The default is to apply \code{restrfactor} to eigenvalues. In order to
#'  apply \code{restrfactor} to determinants it is necessary to use the
#'  optional input argument \code{restrtype}.
#'
#' @param monitoring If \code{monitoring=TRUE} TCLUST is performed for a series of values
#'  of the trimming factor \code{alpha} given \code{k} (number of groups) and given
#'  \code{c} (restriction factor).
#
#' @param plot If \code{plot=FALSE} (default) or \code{plot=0}  no plot is produced.
#'  If \code{plot=TRUE} and \code{monitoring=FALSE} a plot with the classification
#'  is shown (using the spmplot function). The plot can be:
#'  \itemize{
#'  \item for \code{p = 1}, a histogram of the univariate data,
#'  \item for \code{p = 2}, a bivariate scatterplot,
#'  \item for \code{p > 2}, a scatterplot matrix generated by the MATLAB function \code{spmplot()}.
#'  }
#'  When \code{p >= 2} the following additional features are offered
#'  (for \code{p = 1} the behaviour is forced to be as for \code{plots=TRUE}):
#'
#'  \itemize{
#'  \item \code{plot = 'contourf'} adds in the background of the bivariate scatterplots a
#'    filled contour plot. The colormap of the filled contour is based on grey levels as default.
#'    This argument may also be inserted in a field named 'type' of a list. In the latter case
#'    it is possible to specify the additional field 'cmap', which changes the default colors
#'    of the color map used. The field 'cmap' may be a three-column matrix of values in the
#'    range [0,1] where each row is an RGB triplet that defines one color. Check the colormap
#'    function for additional informations.
#'  \item \code{plot = 'contour'} adds in the background of the bivariate scatterplots a contour plot.
#'      The colormap of the contour is based on grey levels as default. This argument may also be
#'      inserted in a field named \code{type} of a list. In the latter case it is possible to specify
#'      the additional field \code{cmap}, which changes the default colors of the color map used. The field
#'      \code{cmap} may be a three-column matrix of values in the range [0,1] where each row is an RGB
#'      triplet that defines one color. Check the \code{colormap()} (MATLAB) function for additional information.
#'  \item \code{plot = 'ellipse'} superimposes confidence ellipses to each group in the bivariate
#'      scatterplots. The size of the ellipse is \code{qchisq(0.95, 2)}, i.e. the confidence level used
#'      by default is 95 percent. This argument may also be inserted in a field named \code{type} of
#'      a list. In the latter case it is possible to specify the additional field \code{conflev},
#'      which specifies the confidence level to use and it is a value between 0 and 1.
#'  \item \code{plot = 'boxplotb'} superimposes on the bivariate scatterplots the bivariate boxplots
#'      for each group, using the boxplotb function. This argument may also be inserted in a field
#'      named \code{type} of a list.
#'  }
#'
#'  The parameter \code{plot} can be also a list and in this case its elements are:
#'  \itemize{
#'  \item \code{type} - specifies the type of plot as when plot option is a character.
#'      Therefore, plots$type can be one of 'contourf', 'contour', 'ellipse' or 'boxplotb'.
#'  \item \code{cmap} - used to set a colormap for the plot type (MATLAB style). For example, plot$cmap = 'autumn'.
#'      See the MATLAB help of colormap for a list of colormap possiblilites.
#'  \item \code{conflev} - this is the confidence level for the confidence ellipses.
#'      It must me a scalar between 0 and 1.
#'  }
#'
#' If \code{plot=TRUE} and \code{monitoring=TRUE} two plots are shown.
#'  The first plot (\emph{monitor plot}) shows three panels with the monitoring between two
#'  consecutive values of alpha: (i) the change in classification using ARI index (top panel),
#'  (ii) the change in centroids using squared euclidean distances (central panel) and
#'  (iii) the change in covariance matrices using squared euclidean distance (bottom panel).
#'
#'  The second plot (\emph{gscatter plot}) shows a series of subplots which monitor the classification
#'  for each value of \code{alpha}. In order to make sure that consistent labels are used for the
#'  groups, between two consecutive values of \code{alpha}, we assign label \code{r} to a group if
#'  this group shows the smallest distance with group \code{r} for the previous value of \code{alpha}.
#'  The type of plot which is used to monitor the stability of the classification depends on the
#'  data dimensionality \code{p}.
#'  \itemize{
#'  \item for \code{p = 1}, a histogram of the univariate data (the MATLAB function \code{histFS()} is called),
#'  \item for \code{p = 2}, a bivariate scatterplot (the MATLAB function \code{gscatter()} is called),
#'  \item for \code{p > 2}, a scatterplot of the first two principal components (function \code{gscatter()}
#'      is called and we show on the axes titles the percentage of variance explained by the first
#'      two principal components).
#'  }
#'
#'  Also in the case of \code{monitoring=TRUE} the parameter \code{plot} can be a list and its elements are:
#'  \itemize{
#'  \item name:  character vector which enables to specify which plot to display.
#'      \code{name = "gscatter"} produces a figure with a series of subplots which show the
#'      classification for each value of \code{alpha}. \code{name = "monitor"} shows a figure
#'      with three panels which monitor between two consecutive values of alpha the change in
#'      classification using ARI index (top panel), the change in centroids using squared
#'      euclidean distances (central panel), the change in covariance matrices using squared
#'      euclidean distance (bottom panel). If this field is not specified, by default
#'      \code{name=c("gscatter", "monitor")} and both figures will be shown.
#'  \item alphasel: a numeric vector which specifies for which values of alpha it
#'      is possible to see the classification. For example if \code{alphasel = c(0.05, 0.02)},
#'      the classification will be shown just for \code{alpha=0.05} and \code{alpha=0.02}.
#'      If this field is not specified \code{alphasel=alpha} and therefore the classification
#'      is shown for each value of alpha.
#'  }
#'
#' @param nsamp If a scalar, it contains the number of subsamples which will be extracted.
#'  If \code{nsamp = 0} all subsets will be extracted. Remark - if the number of all possible
#'  subset is greater than 300 the default is to extract all subsets, otherwise just 300.
#'  If \code{nsamp} is a matrix it contains in the rows the indexes of the subsets which
#'  have to be extracted. \code{nsamp} in this case can be conveniently generated by
#'  function \code{subsets()}. \code{nsamp} can have \code{k} columns or \code{k * (p + 1)}
#'  columns. If \code{nsamp} has \code{k} columns the \code{k} initial centroids each
#'  iteration i are given by \code{X[nsamp[i,] ,]} and the covariance matrices are equal
#'  to the identity.
#'
#'  If \code{nsamp} has \code{k * (p + 1)} columns, the initial centroids and covariance
#'  matrices in iteration \code{i} are computed as follows:
#'  \itemize{
#'  \item X1 <- X[nsamp[i ,] ,]
#'  \item mean(X1[1:p + 1, ]) contains the initial centroid for group 1
#'  \item cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
#'  \item mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
#'  \item cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
#'  \item ...
#'  \item mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
#'  \item cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.
#'  }
#'
#'  REMARK: If \code{nsamp} is not a scalar, the option \code{startv1} given below is ignored.
#'  More precisely, if \code{nsamp} has \code{k} columns \code{startv1 = 0} else if
#'  \code{nsamp} has \code{k*(p+1)} columns option \code{startv1=1}.
#'
#' @param refsteps Number of refining iterations in each subsample. Default is \code{refsteps=15}.
#'  \code{refsteps = 0} means "raw-subsampling" without iterations.
#'
#' @param reftol Tolerance of the refining steps. The default value is 1e-14
#'
#' @param equalweights A logical specifying wheather cluster weights in the concentration
#'  and assignment steps shall be considered. If \code{equalweights=TRUE} we are (ideally)
#'  assuming equally sized groups, else if \code{equalweights = false} (default) we allow for
#'  different group weights. Please, check in the given references which functions
#' are maximized in both cases.
#'
#' @param mixt Specifies whether mixture modelling or crisp assignment approach to model
#'  based clustering must be used. In the case of mixture modelling parameter mixt also
#'  controls which is the criterion to find the untrimmed units in each step of the maximization.
#'  If \code{mixt >=1} mixture modelling is assumed else crisp assignment.
#'  The default value is \code{mixt=0}, i.e. crisp assignment. Please see
#'  for details the provided references.
#'  The parameter \code{mixt} also controls the criterion to select the units to trim.
#'  If \code{mixt = 2} the \code{h} units are those which give the largest contribution
#'  to the likelihood, else if \code{mixt=1} the criterion to select the \code{h} units is
#'  exactly the same as the one which is used in crisp assignment.
#'
#' @param msg  Controls whether to display or not messages on the screen. If \code{msg==TRUE}
#'  messages are displayed on the screen. If \code{msg=2}, detailed messages are displayed,
#'  for example the information at iteration level.
#'
#' @param nocheck Check input arguments. If \code{nocheck=TRUE} no check is performed
#'  on matrix \code{X}. The default \code{nocheck=FALSE}.
#'
#' @param startv1 How to initialize centroids and covariance matrices. Scalar.
#'  If \code{startv1=1} then initial centroids and covariance matrices are based
#'  on \code{(p+1)} observations randomly chosen, else each centroid is initialized
#'  taking a random row of input data matrix and covariance matrices are initialized
#'  with identity matrices. The default value is\code{startv1=1}.
#'
#'  Remark 1: in order to start with a routine which is in the required parameter space,
#'  eigenvalue restrictions are immediately applied.
#'
#' Remark 2 - option \code{startv1} is used just if \code{nsamp} is a scalar
#'  (see for more details the help associated with \code{nsamp}).
#'
#'
#' @param RandNumbForNini pre-extracted random numbers to initialize proportions.
#'  Matrix of size k-by-nrow(nsamp) containing the random numbers which
#'  are used to initialize the proportions of the groups. This option is effective just if
#'  \code{nsamp} is a matrix which contains pre-extracted subsamples. The purpose of this
#'  option is to enable to user to replicate the results in case routine \code{tclustreg*()}
#'  is called using a parfor instruction (as it happens for example in routine IC, where
#'  \code{tclustreg()} is called through a parfor for different values of the restriction factor).
#'  The default is that \code{RandNumbForNini} is empty - then uniform random numbers are used.
#'
#' @param restrtype Type of restriction to be applied on the cluster scatter matrices.
#'  Valid values are \code{'eigen'} (default), or \code{'deter'}.
#'  \code{"eigen"} implies restriction on the eigenvalues while \code{"deter"}
#'  implies restriction on the determinants.
#'
#' @param UnitsSameGroup List of the units which must (whenever possible) have
#'  a particular label. For example \code{UnitsSameGroup=c(20, 26)}, means that
#'  group which contains unit 20 is always labelled with number 1. Similarly,
#'  the group which contains unit 26 is always labelled with number 2, (unless
#'  it is found that unit 26 already belongs to group 1).
#'  In general, group which contains unit \code{UnitsSameGroup(r)} where \code{r=2, ...length(kk)-1}
#'  is labelled with number \code{r} (unless it is found that unit \code{UnitsSameGroup(r)}
#'  has already been assigned to groups \code{1, 2, ..., r-1}.
#'
#' @param numpool The number of parallel sessions to open. If numpool is not defined,
#'  then it is set equal to the number of physical cores in the computer.
#'
#' @param cleanpool Logical, indicating if the open pool must be closed or not.
#'  It is useful to leave it open if there are subsequent parallel sessions to execute,
#'  so that to save the time required to open a new pool.
#'
#' @param trace Whether to print intermediate results. Default is \code{trace=FALSE}.
#'
#' @param ... potential further arguments passed to lower level functions.
#'
#' @return  Depending on the input parameter \code{monitoring}, one of
#'    the following objects will be returned:
#'
#'    \enumerate{
#'    \item \code{\link{tclustfsda.object}}
#'    \item \code{\link{tclusteda.object}}
#'    }
#' @references
#'      Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008).
#'      A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics,
#'      Vol. 36, 1324-1345. \doi{10.1214/07-AOS515}.
#'
#' @examples
#'  \dontrun{
#'
#'  data(hbk, package="robustbase")
#'  (out <- tclustfsda(hbk[, 1:3], k=2))
#'  class(out)
#'  summary(out)
#'
#'  ##  TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1)
#'  ##      and restriction factor (c=10000)
#'  data(geyser2)
#'  (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000))
#'
#'  ## Use the plot options to produce more complex plots ----------
#'
#'  ##  Plot with all default options
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE)
#'
#'  ##  Default confidence ellipses.
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse")
#'
#'  ##  Confidence ellipses specified by the user: confidence ellipses set to 0.5
#'  plots <- list(type="ellipse", conflev=0.5)
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)
#'
#'  ##  Confidence ellipses set to 0.9
#'  plots <- list(type="ellipse", conflev=0.9)
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)
#'
#'  ##  Contour plots
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour")
#'
#'  ##  Filled contour plots with additional options: contourf plot with a named colormap.
#'  ##  Here we define four MATLAB-like colormaps, but the user can define anything else,
#'  ##  presented by a matrix with three columns which are the RGB triplets.
#'
#'  summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65),
#'                                 x2=seq(from=0.5, to=1, length=65),
#'                                 x3=rep(0.4, 65)))
#'  spring <- as.matrix(data.frame(x1=rep(1, 65),
#'                                 x2=seq(from=0, to=1, length=65),
#'                                 x3=seq(from=1, to=0, length=65)))
#'  winter <- as.matrix(data.frame(x1=rep(0, 65),
#'                                 x2=seq(from=0, to=1, length=65),
#'                                 x3=seq(from=1, to=0, length=65)))
#'  autumn <- as.matrix(data.frame(x1=rep(1, 65),
#'                                 x2=seq(from=0, to=1, length=65),
#'                                 x3=rep(0, 65)))
#'
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
#'        plot=list(type="contourf", cmap=autumn))
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
#'        plot=list(type="contourf", cmap=winter))
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
#'        plot=list(type="contourf", cmap=spring))
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
#'        plot=list(type="contourf", cmap=summer))
#'
#'
#'  ##  We compare the output using three different values of restriction factor
#'  ##      nsamp is the number of subsamples which will be extracted
#'  data(geyser2)
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse")
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse")
#'  out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse")
#'
#'  ##  TCLUST applied to M5 data: A bivariate data set obtained from three normal
#'  ##  bivariate distributions with different scales and proportions 1:2:2. One of the
#'  ##  components is very overlapped with another one. A 10 per cent background noise is
#'  ##  added uniformly distributed in a rectangle containing the three normal components
#'  ##  and not very overlapped with the three mixture components. A precise description
#'  ##  of the M5 data set can be found in Garcia-Escudero et al. (2008).
#'  ##
#'
#'  data(M5data)
#'  plot(M5data[, 1:2])
#'
#'  ##  Scatter plot matrix
#'  library(rrcov)
#'  plot(CovClassic(M5data[,1:2]), which="pairs")
#'
#'  out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE)
#'  out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE)
#'  out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000,
#'          plot=TRUE, equalweights=TRUE)
#'  out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE)
#'
#'  ##  TCLUST with simulated data: 5 groups and 5 variables
#'  ##
#'  n1 <- 100
#'  n2 <- 80
#'  n3 <- 50
#'  n4 <- 80
#'  n5 <- 70
#'  p <- 5
#'  Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p)
#'  Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p)
#'  Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p)
#'  Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p)
#'  Y5 <- matrix(rnorm(n5 * p), ncol=p)
#'
#'  group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5))
#'  Y <- Y1
#'  Y <- rbind(Y, Y2)
#'  Y <- rbind(Y, Y3)
#'  Y <- rbind(Y, Y4)
#'  Y <- rbind(Y, Y5)
#'  dim(Y)
#'  table(group)
#'  out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE)
#'
#'  ##  Automatic choice of best number of groups for Geyser data ------------------------
#'  ##
#'  data(geyser2)
#'  maxk <- 6
#'  CLACLA <- matrix(0, nrow=maxk, ncol=2)
#'  CLACLA[,1] <- 1:maxk
#'  MIXCLA <- MIXMIX <- CLACLA
#'
#'  for(j in 1:maxk) {
#'      out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5)
#'      CLACLA[j, 2] <- out$CLACLA
#'  }
#'
#'  for(j in 1:maxk) {
#'      out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2)
#'      MIXMIX[j ,2] <- out$MIXMIX
#'      MIXCLA[j, 2] <- out$MIXCLA
#'  }
#'
#'  oldpar <- par(mfrow=c(1,3))
#'  plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA")
#'  plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX")
#'  plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA")
#'  par(oldpar)
#'
#'
#'  ##  Monitoring examples ------------------------------------------
#'
#'  ##  Monitoring using Geyser data
#'
#'  ##  Monitoring using Geyser data (all default options)
#'  ##  alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0)
#'  ##  is used while the restriction factor is set to c=12
#'  out <- tclustfsda(geyser2, k=3, monitoring=TRUE)
#'
#'  ##  Monitoring using Geyser data with alpha and c specified.
#'  out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE)
#'
#'  ##  Monitoring using Geyser data with plot argument specified as a list.
#'  ##      The trimming levels to consider in this case are 31 values of alpha
#'  ##
#'  out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
#'          plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE)
#'
#'  ##  Monitoring using Geyser data with argument UnitsSameGroup
#'  ##
#'  ##      Make sure that group containing unit 10 is in a group which is labelled "group 1"
#'  ##      and group containing unit 12 is in group which is labelled "group 2"
#'  ##
#'  ##      Mixture model is used (mixt=2)
#'  ##
#'  out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
#'          mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE)
#'
#'  ##  Monitoring using M5 data
#'  data(M5data)
#'
#'  ##  alphavec=vector which contains the trimming levels to consider
#'  ##  in this case 31 values of alpha are considered
#'  alphavec <- seq(0.10, 0, by=-0.02)
#'  out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE,
#'      nsamp=1000, plots=TRUE)
#'  }
#' @export
#' @author FSDA team, \email{valentin.todorov@@chello.at}

tclustfsda <- function(x, k, alpha, restrfactor=12, monitoring=FALSE, plot=FALSE,
        nsamp, refsteps=15, reftol=10e-14, equalweights=FALSE, mixt=0, msg=FALSE, nocheck=FALSE,
        startv1=1, RandNumbForNini, restrtype= c("eigen", "deter"),
        UnitsSameGroup, numpool, cleanpool,
        trace=FALSE, ...)
{
	if(is.data.frame(x))
	    x <- data.matrix(x)
	else if(!is.matrix(x))
	    x <- matrix(x, length(x), 1,
			dimnames = list(names(x), deparse(substitute(x))))
    if(!is.numeric(x)) stop("x is not a numeric")
    storage.mode(x) <- "double"

    dx <- dim(x)
    xn <- (dnx <- dimnames(x))[[2]]
    xn <- if (!is.null(xn))
        xn
    else if (dx[2] > 1)
        paste("X", 1:dx[2], sep = "")
    else if(dx[2])
        "X"
    dimnames(x) <- list(dnx[[1]], xn)

    n <- nrow(x)
    p <- ncol(x)


    ## We do not need a 'control' argument for now
    ##  if(missing(control))
    ##      control <- list(...)
    control <- list(...)

    xplots <- 0
    if(is.logical(plot))
        xplots <- ifelse(plot, 1, 0)
    else  if(is.numeric(plot) && plot >= 0 && plot <= 2)
        xplots <- plot
    else if(is.character(plot))
    {
        cx <- c("contourf", "contour", "ellipse", "boxplotb")
        if(plot %in% cx)
            xplots <- plot
        else
            stop("Invalid parameter 'plot'. If it is a character, it should contain one or more of the following: ",  paste0(cx, collapse=","))
    }else  if(is.list(plot))
    {
        if(!monitoring)
        {
            cx <- c("type", "cmap", "conflev")
            if(all(names(plot) %in% cx))
                xplots <- plot
            else
                stop("Invalid parameter 'plot'. If it is a list, it should contain one or more of the following: ",  paste0(cx, collapse=","))
        } else {
            cx <- c("name", "alphasel")
            if(all(names(plot) %in% cx))
                xplots <- plot
            else
                stop("Invalid parameter 'plot'. If it is a list, it should contain one or more of the following: ",  paste0(cx, collapse=","))
        }

        ##  VT::30.07.2021 - fixed an issue with colormaps - check that the supploed argument plots$cmap is a matrix with three columns
        if("cmap" %in% names(xplots))
        {
            xplots$cmap <- if(is.data.frame(xplots$cmap)) data.matrix(xplots$cmap) else xplots$cmap
            if(!is.numeric(xplots$cmap) || !is.matrix(xplots$cmap) || ncol(xplots$cmap) != 3)
                stop("'plot$cmap' must be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color!")
        }
    }else
        stop("Invalid parameter 'plot'. Should be TRUE/FALSE or 0, 1 or character name of a plot or a list with graphical parameters")

    control$plots <- xplots
    if(!missing(nsamp)){
        control$nsamp <- nsamp
        if(is.matrix(nsamp) && !missing(RandNumbForNini))
            control$RandNumbForNini <- RandNumbForNini
    }
    control$refsteps <- refsteps
    control$reftol <- reftol
    control$equalweights <- equalweights
    control$mixt <- mixt
    if(!monitoring)
        control$Ysave <- 1

    if(monitoring)
    {
        if(!missing(UnitsSameGroup))
            control$UnitsSameGroup <- UnitsSameGroup
        if(!missing(numpool))
            control$numpool <- numpool
        if(!missing(cleanpool))
            control$cleanpool <- ifelse(cleanpool, 1, 0)
    }

    xmsg <- 0
    if(is.logical(msg))
        xmsg <- ifelse(msg, 1, 0)
    else  if(is.numeric(msg) && msg >= 0 && msg <= 2)
        xmsg <- msg
    else
        stop("Invalid parameter 'msg'. Should be TRUE/FALSE or 0, 1, 2.")
    control$msg <- xmsg
    xtemp <- 0
    if(is.logical(nocheck))
        xtemp <- ifelse(nocheck, 1, 0)
    else  if(is.numeric(nocheck) && nocheck >= 0 && nocheck <= 1)
        xtemp <- nocheck
    else
        stop("Invalid parameter 'nocheck'. Should be TRUE/FALSE or 0, 1.")
    control$nocheck <- xtemp
    control$startv1 <- startv1

    control$restrtype <- match.arg(restrtype)

    outclass <- if(monitoring) "tclusteda" else "tclustfsda"

    if(missing(alpha))
        alpha <- if(!monitoring) 0 else c(0.1, 0.05, 0)

    ## ES 27.06.2018: parameters that are mandatory to the MATLAB function
    ## cannot be put into the MATLAB function because they have to be supplied
    ## to the function individually and not in (name, value) pairs

    parlist = c(.jarray(x, dispatch=TRUE))
    parlist = c(parlist, .jnew("java/lang/Double", as.double(k)))           # k
    if(length(alpha) == 1)
        parlist = c(parlist, .jnew("java/lang/Double", as.double(alpha)))   # alpha
    else
        parlist = c(parlist, .jarray(alpha, dispatch=TRUE))

    parlist = c(parlist, .jnew("java/lang/Double", as.double(restrfactor))) # restrfactor

    paramNames = names(control)
    if(trace)
        print(control)

    if(length(paramNames) > 0)
    {
        for (i in 1:length(paramNames)) {
            paramName = paramNames[i]
            paramValue = control[[i]]

            matlabValue = rType2MatlabType(paramName, paramValue)
            parlist = c(parlist, .jnew("java/lang/String", paramName), matlabValue)
        }
    }

    if(!monitoring)
    {
        out <- callFsdaFunction("tclust", "[Ljava/lang/Object;", 1, parlist)
        if(is.null(out))
            return(NULL)

        arr1 = .jcast(out[[1]], "com/mathworks/toolbox/javabuilder/MWStructArray")
        arr = .jnew("org/jrc/ipsc/globesec/sitaf/fsda/FsdaMWStructArray", arr1)

        if(trace) {
            cat("\nReturning from MATLAB tclust().  Fields returned by MATLAB: \n")
            print(arr$fieldNames())
        }

        ## we will reuse them for the elements of emp, if it is structure
        ##  (in case of non-convergence)
        xcolnames <- if(is.null(dimnames(x)[[2]])) paste0("X", 1:ncol(x)) else dimnames(x)[[2]]
        xrownames <- paste0("C", 1:k)

        muopt = .jevalArray(arr$get("muopt", as.integer(1)), "[[D", simplify = TRUE)
        dimnames(muopt) <- list(xrownames, xcolnames)
        muopt <- t(muopt)

        sigmaopt = .jevalArray(arr$get("sigmaopt", as.integer(1)), "[[D", simplify = TRUE)
        if(k == 1)
            dimnames(sigmaopt) <- list(xcolnames, xcolnames)
        else
            dimnames(sigmaopt) <- list(xcolnames, xcolnames, xrownames)

        idx = as.vector(as.matrix(.jevalArray(arr$get("idx", as.integer(1)), "[[D", simplify = TRUE)))

        size = as.matrix(.jevalArray(arr$get("siz", as.integer(1)), "[[D", simplify = TRUE))

        if(notconverged <- (nrow(size) < k + ifelse(alpha > 0, 1, 0))) {
            cat("\nNumber of requested clusters =", k,
                "\nNumber of estimated clusters =", nrow(size) - ifelse(alpha > 0, 1, 0), "\n")
            warning("The total number of estimated clusters is smaller than the number supplied")
        }

        size_rows <- paste0("C", size[,1])
        size_cols <- c("Cluster", "Size", "Percent")
        dimnames(size) <- list(size_rows, size_cols)

        postprob <- as.matrix(.jevalArray(arr$get("postprob", as.integer(1)), "[[D", simplify = TRUE))
        rows <- if(is.null(dimnames(x)[[1]])) 1:nrow(x) else dimnames(x)[[1]]
        cols <- paste0("C", 1:k)
        dimnames(postprob) <- list(rows, cols)

        MIXMIX <- if(as.integer(arr$hasField("MIXMIX", as.integer(1))) != 1) NULL
                     else as.vector(as.matrix(.jevalArray(arr$get("MIXMIX", as.integer(1)), "[[D", simplify = TRUE)))
        MIXCLA <- if(as.integer(arr$hasField("MIXCLA", as.integer(1))) != 1) NULL
                     else as.vector(as.matrix(.jevalArray(arr$get("MIXCLA", as.integer(1)), "[[D", simplify = TRUE)))
        CLACLA <- if(as.integer(arr$hasField("CLACLA", as.integer(1))) != 1) NULL
                     else as.vector(as.matrix(.jevalArray(arr$get("CLACLA", as.integer(1)), "[[D", simplify = TRUE)))

        emp <- .jevalArray(arr$get("emp", as.integer(1)), "[[D", simplify = TRUE)

        ## VT::18.08.2022
        ##  When the number of clusters estimated is less than the number
        ##  of clusters requested, emp is not a double (emp=0) but is a structure,
        ##  containing the empirical counterparts of
        ##  idx, muopt, sigmaopt and size
        if(typeof(emp) == "double")
            emp <- as.vector(as.matrix(emp))
        else {
            emp <- unwrapComplexNumericCellArray(emp)

            ## a list with 4 elements:
            ##  1. idxemp (n-by-1 vector)
            ##  2. muemp (k-by-p matrix)
            ##  3. sigmaemp (p-by-p-by-k array)
            ##  4. sizemp (Matrix of size (k+1)-by-3)

            idxemp <- as.vector(emp[[1]])
            muemp <- matrix(emp[[2]], ncol=p)
            dimnames(muemp) <- list(xrownames, xcolnames)
            muemp <- t(muemp)

            sigmaemp <- array(emp[[3]], dim=c(p, p, k))
            if(k == 1)
                dimnames(sigmaemp) <- list(xcolnames, xcolnames)
            else
                dimnames(sigmaemp) <- list(xcolnames, xcolnames, xrownames)

            sizeemp <- matrix(emp[[4]], ncol=3)
            size_rows <- paste0("C", 0:k)
            size_cols <- c("Cluster", "Size", "Percent")
            dimnames(sizeemp) <- list(size_rows, size_cols)

            emp <- list(idxemp=idxemp, muemp=muemp, sigmaemp=sigmaemp, sizeemp=sizeemp)
        }

        notconver <- as.vector(as.matrix(.jevalArray(arr$get("notconver", as.integer(1)), "[[D", simplify = TRUE)))
        bs <- as.vector(as.matrix(.jevalArray(arr$get("bs", as.integer(1)), "[[D", simplify = TRUE)))
        obj <- as.vector(as.matrix(.jevalArray(arr$get("obj", as.integer(1)), "[[D", simplify = TRUE)))
        equalweights <- as.vector(as.matrix(.jevalArray(arr$get("equalweights", as.integer(1)), "[[D", simplify = TRUE)))

        h <- as.vector(as.matrix(.jevalArray(arr$get("h", as.integer(1)), "[[D", simplify = TRUE)))
        fullsol <- as.vector(as.matrix(.jevalArray(arr$get("fullsol", as.integer(1)), "[[D", simplify = TRUE)))
        Y <- if(as.integer(arr$hasField("Y", as.integer(1))) != 1) NULL
                    else as.matrix(.jevalArray(arr$get("Y", as.integer(1)), "[[D", simplify = TRUE))

        ans <- list(call=match.call(), alpha=alpha, k=k, muopt=muopt, sigmaopt=sigmaopt, idx=idx, size=size, postprob=postprob, emp=emp,
                MIXMIX=MIXMIX, MIXCLA=MIXCLA, CLACLA=CLACLA,
                notconver=notconver, bs=bs, obj=obj, equalweights=equalweights, h=h, fullsol=fullsol, X=Y)

        ans$MIXMIX <- MIXMIX
        ans$MIXCLA <- MIXCLA
        ans$CLACLA <- CLACLA

    } else
    {
        out <- callFsdaFunction("tclusteda", "[Ljava/lang/Object;", 1, parlist)
        if(is.null(out))
            return(NULL)

        arr1 = .jcast(out[[1]], "com/mathworks/toolbox/javabuilder/MWStructArray")
        arr = .jnew("org/jrc/ipsc/globesec/sitaf/fsda/FsdaMWStructArray", arr1)

        if(trace) {
            cat("\nReturning from MATLAB tclusteda().  Fields returned by MATLAB: \n")
            print(arr$fieldNames())
        }

        cols <- if(is.null(dimnames(x)[[2]])) paste0("X", 1:ncol(x)) else dimnames(x)[[2]]
        rows <- if(is.null(dimnames(x)[[1]])) 1:nrow(x) else dimnames(x)[[1]]
        alphas <- paste0("alpha_", alpha)

        MU = .jevalArray(arr$get("MU", as.integer(1)), "[[D", simplify = TRUE)

        ## VT::27.07.2021 - if alpha is a single number, the array MU will be two-dimensional,
        ##  not three-dimensional as expected by the following setting of dimnames. Change it
        ##  to trhee-dimensional array, with the third dimension having just one element.
        if(length(alpha) == 1)
             MU <- array(MU, c(dim(MU), 1))

        dimnames(MU) <- list(paste0("C", 1:k), cols, alphas)

        ## Convert a cell array with the length of alpha to a list. Each cell contains
        ##  a 3D-array with dimension p x p x k
        SIGMA = cell2list(.jevalArray(arr$get("SIGMA", as.integer(1)), "[[D", simplify = TRUE))
        names(SIGMA) <- alphas
        for(i in 1:length(SIGMA))
            dimnames(SIGMA[[i]]) <- list(cols, cols, paste0("C", 1:k))

        IDX = as.matrix(.jevalArray(arr$get("IDX", as.integer(1)), "[[D", simplify = TRUE))
        dimnames(IDX) <- list(rows, paste0("alpha_", alpha))

        ## VT::27.07.2021 - if alpha is a single number, the matrix Amon contains 0 rows (i.e. length(alpha)-1).
        ##  Actually the returned matrix is also with 0 columns. Therefore do not set the dimnames.
        Amon = as.matrix(.jevalArray(arr$get("Amon", as.integer(1)), "[[D", simplify = TRUE))
        if(nrow(Amon)[1] > 0)
            dimnames(Amon) <- list(1:nrow(Amon), c("alpha", "ARI", "dist_centers", "dist_covs"))

        Y <- if(as.integer(arr$hasField("Y", as.integer(1))) != 1) NULL
                    else as.matrix(.jevalArray(arr$get("Y", as.integer(1)), "[[D", simplify = TRUE))

        ans = list(call=match.call(), alpha=alpha, k=k, restrfactor=restrfactor, MU=MU, SIGMA=SIGMA, IDX=IDX, Amon=Amon, X=Y)
    }
    freeMatlabResources(out)

    class(ans) <- outclass
    return (ans)
}

Try the fsdaR package in your browser

Any scripts or data that you put into this service are public.

fsdaR documentation built on March 31, 2023, 8:18 p.m.