R/clustering.R
In clustord: Cluster Ordinal Data via Proportional Odds or Ordered Stereotype

Documented in clustord rerun

#' Likelihood-based clustering using Ordered Stereotype Models (OSM), Proportional
#' Odds Models (POM) or Binary Models
#'
#' Likelihood-based clustering with parameters fitted using the EM algorithm.
#' You can perform clustering on rows or columns of a data matrix, or biclustering
#' on both rows and columns simultaneously. You can include any number of
#' covariates for rows and covariates for columns.
#' Ordinal models used in the package are Ordered Stereotype Model (OSM),
#' Proportional Odds Model (POM) and a dedicated Binary Model for binary data.
#'
#' You can select your own input parameters, or starting values will be
#' generated by running kmeans or by fitting simpler models and feeding the
#' outputs into the final model as starting values.
#'
#' The starting point for clustering is a data matrix of response values that
#' are binary or categorical. You may also have a data frame of covariates that
#' are linked to the rows of the data matrix, and may also have a data frame of
#' covariates that are linked to the columns of the data matrix.
#'
#' For example, if clustering data from fishing trawls, where the rows are trawl
#' events and columns are species caught, then you could also supply a gear
#' covariate linked to the rows, representing gear used on each trawl event, and
#' could additionally supply species covariates linked to the columns,
#' representing auxiliary information about each species. There is no
#' requirement to provide any covariates, and you can provide only row
#' covariates, or only column covariates.
#'
#' Before running \code{clustord}, you need to run \code{\link{mat2df}} to
#' convert the data matrix into a long form data frame. The data frame needs to
#' have at least three columns, \code{Y} and \code{ROW} and \code{COL}. Each row
#' in the data frame corresponds to a single cell in the original data matrix;
#' the response value in that cell is given by \code{Y}, and the row and column
#' indices of that cell in the matrix are given by \code{ROW} and \code{COL}.
#'
#' \code{\link{mat2df}} also allows you to supply data frames of row or column
#' covariates which will be incorporated into \code{long.df}.
#'
#' Then, to run the \code{clustord} function, you need to enter your chosen
#' formula and model, and the number of clusters you want to fit. The formula
#' model_structure is akin to that for \code{glm}, but with a few restrictions. You
#' can include any number of covariates in the same way as for a multiple
#' regression model, though unlike for \code{glm}, you can include both row and
#' column covariates.
#'
#' Note that, unlike \code{glm}, you should not specify a \code{family}
#' argument; the \code{model} argument is used instead.
#'
#' \code{formula} \strong{argument details}
#'
#' In the following description of different models, the Binary model is used
#' for simplicity when giving the mathematical descriptions of the models, but
#' you can use any of the following models with the Ordered Stereotype or
#' Proportional Odds Models as well.
#'
#' In the \code{formula} argument, the response must be exactly \code{Y}. You
#' cannot use any functions of \code{Y} as the response, nor can you include
#' \code{Y} in any terms on the right hand side of the formula. \code{Y} is the
#' name in \code{clustord} of the response values in the original data matrix.
#'
#' The \code{formula} argument has 4 special variables: \code{ROWCLUST},
#' \code{COLCLUST}, \code{ROW} and \code{COL}. There are some restrictions on
#' how these can be used in the formula, as they are not covariates, but instead
#' act as indicators of the clustering model_structure you want to use.
#'
#' All other variables in the formula will be any covariates that you want to
#' include in the model, and these are unrestricted, and can be used in the same
#' way as in \code{glm}.
#'
#' \code{ROWCLUST} and \code{COLCLUST} are used to indicate what row clustering
#' model_structure you want, and what column clustering model_structure you want,
#' respectively. The inclusion of \code{ROWCLUST} as a single term indicates
#' that you want to include a row clustering effect in the model. In the
#' simplest row clustering model, for Binary data with \strong{row clustering}
#' effects only, the basic function call would be
#'
#' \code{clustord(Y ~ ROWCLUST, model="Binary", long.df=long.df)}
#'
#' and the model fitted would have the form:
#'
#' Logit(P(Y = 1)) = mu + rowc_coef_r
#'
#' where mu is the intercept term, and rowc_coef_r is the row cluster effect
#' that will be applied to every row from the original data matrix that is a
#' member of row cluster r. The inclusion of \code{ROWCLUST} corresponds to the
#' inclusion of rowc_coef_r.
#'
#' Note that we are not using notation involving greek letters, because (a) we
#' ran out of letters for all the different types of parameters in the model and
#' (b) with this many parameters, it would be difficult to remember which ones
#' are which.
#'
#' Similarly to row clustering, the formula \code{Y ~ COLCLUST} would perform
#' \strong{column clustering}, with model Logit(P(Y = 1)) = mu + colc_coef_c,
#' where colc_coef_c is the column cluster effect that will be applied to every
#' column from the original data matrix that is a member of column cluster c.
#'
#' Including both \code{ROWCLUST} and \code{COLCLUST} in the same formula
#' indicates that you want to perform biclustering, i.e. you want to cluster the
#' rows and the columns of the original data matrix simultaneously. If included
#' without interaction, then the terms just correspond to including rowc_coef_r
#' and colc_coef_c in the model:
#'
#' The formula
#'
#' \code{Y ~ ROWCLUST + COLCLUST}
#'
#' is the simplest possible \strong{biclustering} model,
#' Logit(P(Y = 1)) = mu + rowc_coef_r + colc_coef_c
#'
#' If you want to include interaction between the rows and columns, i.e. you
#' want to perform block biclustering where each block corresponds to a row
#' cluster r and a column cluster c, then that model has a matrix of parameters
#' indexed by r and c.
#'
#' \code{clustord(Y ~ ROWCLUST*COLCLUST, model="Binary", ...)} has the model
#' Logit(P(Y = 1)) = mu + rowc_colc_coef_rc
#'
#' This model can instead be called using the equivalent formula
#' \code{Y ~ ROWCLUST + COLCLUST + ROWCLUST:COLCLUST}.
#'
#' You can instead use the formula \code{Y ~ ROWCLUST:COLCLUST}. Mathematically,
#' this is equivalent to the previous two. In regression, the models would not
#' be equivalent but in clustering, they are equivalent, and have the same
#' number of independent parameters overall. If you include the main effects,
#' then that reduces the number of independent parameters in the interaction
#' term compared to if you just use the interaction term (see below section about
#' \code{initvect}).
#'
#' You cannot include just one of the main effects alongside the interaction
#' term, i.e. you cannot use \code{Y ~ ROWCLUST + ROWCLUST:COLCLUST} or
#' \code{Y ~ COLCLUST + ROWCLUST:COLCLUST}. This is for simplicity in the code,
#' and to avoid confusion when interpreting the results.
#'
#' However, \code{clustord} allows a lot more flexibility than this. The
#' variables \code{ROW} and \code{COL} are used to indicate that you want to
#' also include \strong{individual row or column effects}, respectively.
#'
#' For example, if you are clustering binary data that indicates the presence/
#' absence of different species (columns) at different trawl events (rows), and
#' you know that one particular species is incredibly common, then you can
#' include column effects in the model, which will allow for the possibility
#' that two columns may correspond to species with different probabilities of
#' appearing in the trawl.
#'
#' You can add individual column effects along with
#' row clustering, or you can add individual row effects along with column clustering.
#' The formula for row clustering with individual column effects (without
#' interaction) is
#'
#' \code{Y ~ ROWCLUST + COL}
#'
#' which corresponds to Binary model
#'
#' Logit(P(Y = 1)) = mu + rowc_coef_r + col_coef_j
#'
#' So if two cells from the data matrix are in the same row cluster, but in
#' different columns, they will not have the same probability of Y = 1.
#'
#' You can also add interaction between the individual row/column effects and
#' the clustering effects.
#'
#' If you still want to be able to see the row cluster and column effects
#' separately, then you use \code{Y ~ ROWCLUST*COL} or
#' \code{Y ~ ROWCLUST + COL + ROWCLUST:COL} (these are both the same), which
#' have model
#'
#' Logit(P(Y = 1)) = mu + rowc_coef_r + col_coef_j + rowc_col_coef_rj
#'
#' As before, rowc_coef_r and col_coef_j are the row cluster effects and
#' individual column effects, and rowc_col_coef_rj are the interaction terms.
#'
#' Alternatively, you can use the mathematically-equivalent formula
#'
#' \code{Y ~ ROWCLUST:COL} which has model
#'
#' Logit(P(Y = 1)) = mu + rowc_col_coef_rj
#'
#' where the row cluster effects and individual column effects are absorbed into
#' the matrix rowc_col_coef_rj. These models are the same mathematically, the
#' only differences between them are in how they are constrained (see below in
#' the section about the \code{initvect} argument) and how they should be
#' interpreted.
#'
#' Note that if you were using covariates, then it would not be equivalent to
#' leave out the main effects and just use the interaction terms, but the
#' clustering models don't work quite the same as regression models with
#' covariates.
#'
#' Equivalently, if you want to cluster the columns, you can include individual
#' row effects alongside the column clusters, i.e.
#'
#' \code{Y ~ COLCLUST + ROW} or \code{Y ~ COLCLUST + ROW + COLCLUST:ROW},
#'
#' depending on whether you want the interaction terms or not.
#'
#' You are \strong{not} able to include individual row effects with row clusters,
#' or include individual column effects with column clusters, because there is
#' not enough information in ordinal or binary data to fit these models. As a
#' consequence, you cannot include individual row or column effects if you are
#' doing biclustering, e.g.
#'
#' \code{Y ~ ROWCLUST + COLCLUST + ROW} or \code{Y ~ ROwCLUST + COLCLUST + COL}
#'
#' are not permitted.
#'
#' From version 1 of the package, you can now also include \strong{covariates}
#' alongside the clustering patterns. The basic way to do this is include them
#' as additions to the clustering model_structure. For example, including one row
#' covariate \code{xr} to a row clustering model would have the formula
#'
#' \code{Y ~ ROWCLUST + xr}
#'
#' with Binary model Logit(P(Y = 1)) = mu + rowc_coef_r + row_coef_1*xr_i
#'
#' where row_coef_1 is the coefficient of xr_i, just as in a typical regression
#' model.
#'
#' Additional row covariates can also be included, and you can include
#' interactions between them, and functions of them, as in regression models, e.g.
#'
#' \code{Y ~ ROWCLUST + xr1*log(xr2)}
#'
#' which would have the Binary model
#'
#' Logit(P(Y = 1)) = mu + rowc_coef_r + row_coef1*xr1_i + row_coef2*log(xr2_i) +
#'                          row_coef3*xr1_i*log(xr2_i)
#'
#' If instead you want to add column covariates to the model, they work in the
#' same way after they've been added to the \code{long.df} data frame using
#' \code{\link{mat2df}}, but they are indexed by j instead of i. Simplest model,
#' with one single column covariate xc, would have formula
#'
#' \code{Y ~ ROWCLUST + xc}
#'
#' with Binary model Logit(P(Y = 1)) = mu + rowc_coef_r + col_coef1*xc_j
#'
#' You can use any functions of or interactions between column covariates, just
#' as with row covariates. You can similarly add row or column covariates to
#' column clustering or biclustering models.
#'
#' You can include \strong{interactions between covariates} and \code{ROWCLUST}
#' or \code{COLCLUST} in the formula. But these are \strong{not quite} the same
#' as interactions between covariates. The formula
#'
#' \code{Y ~ ROWCLUST*xr}
#'
#' where \code{xr} is some row covariate, corresponds to the Binary model
#'
#' Logit(P(Y = 1)) = mu + rowc_coef_r + cov_coef*xr_i + rowc_row_coef_r1*xr_i
#'
#' What this means is that there is a term in the linear predictor that involves
#' the row covariate xr (which has the index i because it is a row covariate),
#' and each cluster (indexed by r) has a different coefficient for
#' that covariate (as distinct from the non-interaction covariate models above,
#' which have the same coefficients for the covariates regardless of which
#' cluster the row is in).
#'
#' This is different from interaction terms involving only covariates, where two
#' or more covariates appear multiplied together in the model and then have a
#' shared coefficient term attached to them. In a clustering/covariate
#' interaction, the row or column clustering pattern controls the coefficients
#' rather than adding a different type of covariate.
#'
#' Note that the pure cluster effect rowc_coef_r is also included in the model
#' automatically, in the same way that a regression formula \code{Y ~ x1*x2}
#' would include the individual x1 and x2 effects as well as the interaction
#' between x1 and x2.
#'
#' The coefficients for row clusters interacting with row coefficients are named
#' \code{row_cluster_row_coef} in the output of \code{clustord} because you
#' can also have coefficients for interactions between row clustering and column
#' covariates, or column clustering and row covariates, or column clustering and
#' column covariates. Row clustering interacting with column covariates would
#' look something like
#'
#' \code{Y ~ ROWCLUST*xc}
#'
#' with Binary model Logit(P(Y = 1)) = mu + rowc_coef_r + rowc_col_coef_r1*xc_j
#'
#' The other combinations of clustering and covariates work similarly.
#' \code{rowc_col_coef_rl} and the other similar coefficients have two indices.
#' Their first index is the index of the cluster, and their second index is the
#' index of the covariate among the list of covariates interacting with that
#' direction of clustering. So if there are two row covariates \code{xr1} and
#' \code{xr2} interacting with three row clusters, that gives you 6
#' coefficients:
#'
#' \code{rowc_col_coef_11, rowc_col_coef_12,
#' rowc_col_coef_21, rowc_col_coef_22,
#' rowc_col_coef_31, rowc_col_coef_32}.
#'
#' and you can also have a three-way interaction between row cluster and those
#' two covariates, which would add the coefficients \code{rowc_col_coef_r3}
#' for the \code{xr1:xr2} term.
#'
#' You can instead add covariates that interact with column clusters, which will
#' have parameters \code{colc_row_coef_cm}, where \code{m} here indexes just the
#' covariates interacting with column cluster.
#'
#' If you have covariates interacting with row clusters and other covariates
#' interacting with column clusters, then you will have parameters
#' \code{rowc_cov_coef_rl} \strong{and} \code{colc_cov_coef_cm}.
#'
#' An example of this is the model
#'
#' \code{Y ~ ROWCLUST + xr1 + ROWCLUST:xr1 + xc1 + COLCLUST + COLCLUST:log(xc1)}
#'
#' This has main effects for row clusters and column clusters, i.e.
#' \code{ROWCLUST} and \code{COLCLUST}. It also has two covariate terms not
#' interacting with clusters, \code{xr1} and \code{xc1}. It also has 1 covariate
#' term interacting with row clusters, \code{xr1}, with coefficients
#' \code{rowc_cov_coef_r1}, and 1 covariate term interacting with column
#' clusters, \code{log(xc1)}, with coefficients \code{colc_cov_coef_c1}.
#'
#' \strong{Restrictions on \code{formula}}
#'
#' The primary restriction on the \code{formula} argument is that that you
#' \strong{cannot} use functions of \code{ROW}, \code{COL}, \code{ROWCLUST} or
#' \code{COLCLUST}, such as \code{log(ROW)} or I(COLCLUST^2). That is because
#' they are not covariates, and cannot be manipulated like that; instead, they
#' are indicators for particular elements of the clustering model_structure.
#'
#' If performing biclustering, i.e. if \code{ROWCLUST} and \code{COLCLUST} are
#' both in the model, and you want to include the interaction between them, then
#' you can use the interaction between them on its own, or you can include both
#' main effects, but you are not allowed to use just one main effect alongside
#' the interaction. That is, you can use
#'
#' \code{Y ~ ROWCLUST + COLCLUST + ROWCLUST:COLCLUST} or \code{Y ~ ROWCLUST*COLCLUST},
#'
#' or you can use \code{Y ~ ROWCLUST:COLCLUST}, and these two types of
#' biclustering model will have different parameter constraints (see below under
#' \code{initvect} details), but you \strong{cannot} use
#'
#' \code{Y ~ ROWCLUST + ROWCLUST:COLCLUST} or \code{Y ~ COLCLUST + ROWCLUST:COLCLUST}
#'
#' As stated above, you also cannot include individual row effects alongside
#' row clustering, and you cannot use individual column effects alongside
#' column clustering, i.e. if \code{ROWCLUST} is in the formula, then \code{ROW}
#' \strong{cannnot} be in the formula, and if \code{COLCLUST} is in the formula
#' then \code{COL} \strong{cannot} be in the formula.
#'
#' If you are including \code{COL} with \code{ROWCLUST}, then you can include
#' the interaction between them but that is the \strong{only} permitted interaction
#' term that involves \code{COL}, and similarly the interaction between
#' \code{ROW} and \code{COLCLUST} is the \strong{only} permitted interaction
#' term that involves \code{ROW}. But you can include those interactions in the
#' form
#'
#' \code{Y ~ ROWCLUST + COL + ROWCLUST:COL} or as \code{Y ~ ROWCLUST*COL}, or as
#' \code{Y ~ ROWCLUST:COL}.
#'
#' These are the only permitted uses of the \code{COL} term, and there are
#' equivalent constraints on the inclusion of \code{ROW}.
#'
#' As stated above, you can include interactions between \code{ROWCLUST} or
#' \code{COLCLUST} and covariates, but you \strong{cannot} include three-way
#' interactions between \code{ROWCLUST}, \code{COLCLUST} and one or more
#' covariates are \strong{not permitted} in \code{clustord}, mostly because
#' of the prohibitive number of parameter values that would need to be fitted,
#' and the difficulty of interpreting such a model. That is, you cannot use
#' formulae such as \code{Y ~ ROWCLUST*COLCLUST*xr}, which would have Binary
#' model Logit(P(Y = 1)) = mu + bi_cluster_row_coef_rc1*xr_i.
#'
#' \code{model} \strong{argument details}
#'
#' The three models available in \code{clustord} are the Binary model, which
#' is a Bernoulli model equivalent to the binary model in the package
#' \code{clustglm}, the Proportional Odds Model (POM) and the Ordered Stereotype
#' Model (OSM).
#'
#' Many Binary model examples have been given above, which have the general
#' form
#'
#' logit(P(Y = 1)) = mu + <<linear terms>>
#'
#' where the linear terms can include row or column clustering effects,
#' individual row or column effects, and row or column covariates, with or
#' without interactions with row or column clustering.
#'
#' The Proportional Odds Model and the Ordered Stereotype Model have the same
#' model_structure for the linear terms, but the overall model equation is different.
#'
#' The Proportional Odds Model (\code{model = "POM"}) has the form
#'
#' logit(P(Y <= k)) = log(P(Y <= k)/P(Y > k)) = mu_k - <<linear terms>>
#'
#' So the simplest POM for row clustering would be
#'
#' logit(P(Y <= k)) = mu_k - rowc_coef_r
#'
#' and the model including individual column effects and no interactions would be
#'
#' logit(P(Y <= k)) = mu_k - rowc_coef_r - col_coef_j
#'
#' Note that the linear-term coefficients have negative signs for the
#' Proportional Odds Models. This is so that as the row cluster index increases,
#' or as the column index increases, Y is more likely to fall at higher values
#' (see Ch4 of Agresti, 2010).
#'
#' The Ordered Stereotype model (\code{model = "OSM"}) has the form
#'
#' log(P(Y = k)/P(Y = 1)) = mu_k + phi_k(<<linear terms>>)
#'
#' So the simplest OSM for row clustering would be
#'
#' log(P(Y = k)/P(Y = 1)) = mu_k + phi_k*rowc_coef_r
#'
#' and the model including individual column effects and no interactions would be
#'
#' log(P(Y = k)/P(Y = 1)) = mu_k + phi_k(rowc_coef_r + col_coef_j)
#'
#' Note that the OSM is \strong{not} a cumulative logit model, unlike the POM.
#' The model describes the log of the kth level relative to the first level,
#' which is the baseline category, but the patterns for k = 2 may be different
#' than the patterns for k = 3. They are linked, because the linear terms will
#' be the same, but they may not have the same shape. In this sense, the OSM is
#' more flexible/less restrictive than the POM.
#'
#' See Anderson (1984) for the original definition of the ordered stereotype
#' model, and see Fernández et al. (2016) for the application to clustering.
#'
#' The phi_k parameters may be treated as "score" parameters. After fitting the
#' OSM, the fitted phi_k values can give some indication of what the true
#' separation is between the categories. Even if the default labelling of the
#' categories is from 1 to n, that doesn't mean that the categories are actually
#' equally spaced in reality. But the fitted phi_k values from the OSM can be
#' treated as data-derived numerical labels for the categories. Moreover, if two
#' categories have very similar fitted phi_k values, e.g. if phi_2 = 0.11 and
#' phi_3 = 0.13, that suggests that there is not enough information in the data
#' to distinguish between categories 2 and 3, and so you might as well merge
#' them into a single category to simplify the model-fitting process and the
#' interpretation of the results.
#'
#' \code{initvect} \strong{argument details}
#'
#' Initvect is the vector of starting values for the parameters, made up of
#' sections for each different type of parameter in the model. Note that the
#' length of each section of initvect is the number of \strong{independent}
#' parameter values, not the overall number of parameter values of that type.
#'
#' If you want to supply a vector of starting values for the EM algorithm, you
#' need to be careful how many values you supply, and the order in which you
#' include them in \code{initvect}, and you should \strong{CHECK} the output
#' list of parameters (which is the full set of parameter values, including
#' dependent ones, broken up into each type of parameter) to check that your
#' initvect model_structure is correct for the formula you have specified.
#'
#' For example, the number of \code{mu} values will always be 1 fewer than the
#' number of categories in the data, and the remaining value of mu is dependent
#' on those q-1 values. In the OSM for data with 3 categories, the first value
#' of mu for category 1 will be 0, and then the other 2 values of mu for
#' categories 2 and 3 will be the independent values of mu. For the POM for data
#' with 5 categories, the first 4 values of mu will be the independent values
#' and then the last value of mu is infinity, because the probability of Y being
#' in category 5 is defined as 1 minus the sum of the probabilities for the
#' other 4 levels.
#'
#' \code{q} is the number of levels in the values of y, \code{n} is the number
#' of rows in the original data matrix, and \code{p} is the number of columns in
#' the original data matrix.
#'
#' For Binary,
#' There is one independent value for \code{mu}, i.e. q = 2.
#'
#' Ignore \code{phi}, which is not used in the Binary model.
#'
#' For OSM,
#' The starting values for \code{mu_k} are length \code{q-1}, and the model
#' has \code{mu_1} = 0 always, so the initvect values for \code{mu} will become
#' \code{mu_2}, \code{mu_3}, etc. up to \code{mu_q}.
#'
#' The starting values for \code{phi_k} are length \code{q-2}.
#'
#' Note that the starting values for \code{phi} do not correspond directly to
#' \code{phi}, because \code{phi} is restricted to being increasing and between
#' 0 and 1, so instead the starting values are treated as elements
#' \code{u[2:q-1]} of a vector \code{u} which can be between \code{-Inf} and
#' \code{+Inf}, and then
#'
#'     \code{phi[2] <- expit(u[2])} and
#'
#'     \code{phi[k] <- expit(u[2] + sum(exp(u[3:k])))} for k between 3 and q-1
#'
#'     \code{(phi[1] = 0 and phi[q] = 1)}.
#'
#' For POM,
#' The starting values for \code{mu_k} are length \code{q-1}, but the starting
#' values do not correspond directly to \code{mu_k}, because \code{mu_k} is
#' restricted to being increasing, i.e. the model has to have
#'     \code{mu_1} <= \code{mu_2} <= ... \code{mu_q} = \code{+Inf}
#'
#' So instead of using the initvect values directly for \code{mu_k}, the 2nd to
#' (q-1)th elements of initvect are used to construct \code{mu_k} as follows:
#'
#'     \code{mu_1 <- initvect[1]}
#'
#'     \code{mu_2 <- initvect[1] + exp(initvect[2])}
#'
#'     \code{mu_3 <- initvect[1] + exp(initvect[2]) + exp(initvect[3])}
#'
#'     ... and so on up to \code{mu_{k-1}}, and \code{mu_k} is infinity, because
#'     it is not used directly to construct the probability of Y = q.
#'
#' Thus the values that are used to construct \code{mu_k} can be unconstrained,
#' which makes it easier to specify initvect and easier to optimize the parameter
#' values.
#'
#' Ignore \code{phi}, which is not used in POM.
#'
#' For \strong{all three models},
#'
#' The starting values for \code{rowc_coef_r} are length \code{nclus.row-1},
#' where \code{nclus.row} is the number of row clusters. The final row cluster
#' parameter is dependent on the others (see the input parameter info for
#' constraint_sum_zero), whereas if it were independent it would be colinear
#' with the \code{mu_k} parameters and thus not identifiable.
#'
#' Similarly the starting values for \code{colc_coef_c} are length
#' \code{nclus.column-1}, where \code{nclus.column} is the number of column
#' clusters, to avoid problems of colinearity and nonidentifiability.
#'
#' If you have biclustering with an interaction term between row clusters and
#' column clusters, then the number of independent values in the matrix of
#' interaction terms depends on whether you include the main effects of row and
#' column clusters separately. That is, if you use the biclustering model
#'
#' \code{Y ~ ROWCLUST + COLCLUST + ROWCLUST:COLCLUST}, or equivalently
#'
#' \code{Y ~ ROWCLUST*COLCLUST},
#'
#' then the main effect term \code{ROWCLUST} has \code{nclus.row-1} independent
#' parameters in \code{initvect}, and \code{COLCLUST} has \code{nclus.column-1}
#' independent parameters in \code{initvect}, and \code{ROWCLUST:COLCLUST} will
#' have \code{(nclus.row - 1)*(nclus.column - 1)} independent parameter values.
#' The final matrix of interaction terms will be constrained to have its last
#' row equal to the negative sum of the other rows, and the last column equal to
#' the negative sum of the other columns.
#'
#' On the other hand, if you want to use only the interaction term and not the
#' main effects (which for the clustering model is mathematically equivalent),
#' i.e.
#'
#' \code{Y ~ ROWCLUST:COLCLUST},
#'
#' then that matrix of interaction terms will have \code{nclus.row*nclus.column - 1}
#' independent parameters, i.e. more independent parameters than if you included
#' the main effects.
#'
#' If you have column effects alongside row clusters (they are not permitted
#' alongside column clusters), without interactions, i.e. the formula
#' \code{Y ~ ROWCLUST + COL} with Binary model Logit(P(Y = 1)) = mu +
#' rowc_coef_r + col_coef_j
#' then the row cluster coefficients have \code{nclus.row - 1} independent
#' parameters, and the column effect coefficients have \code{p - 1} independent
#' parameters, where p is the number of columns in the original data matrix,
#' i.e. the maximum value of \code{long.df$COL}.
#'
#' If you include the interaction term, then the number of independent parameters
#' again depends on whether you just use the interaction term, or include the
#' main effects.
#'
#' In the formula \code{Y ~ ROWCLUST + COL + ROWCLUST:COL} or its equivalent with
#' "*", the interaction term will have \code{(nclus.row - 1)*(p-1)} independent
#' parameters.
#'
#' If you instead use the formula \code{Y ~ ROWCLUST:COL}, then the interaction
#' term will have \code{nclus.row*p - 1} independent parameters. Either way, the
#' total number of independent parameters in the model will be \code{nclus.row*p}.
#'
#' Similarly, if you have row effects alongside column clusters, without
#' interactions, i.e. the formula
#'
#' \code{Y ~ COLCLUST + ROW},
#'
#' with Binary model Logit(P(Y = 1)) = mu + colc_coef_c + row_coef_i
#'
#' then the column cluster coefficients will have \code{nclus.column - 1}
#' independent parameters, and the row coefficients will have \code{n-1}
#' independent parameters, where n is the number of rows in the original data
#' matrix, i.e. the maximum value of \code{long.df$ROW}.
#'
#' If you include the interaction term alongside the main effects, i.e.
#'
#' \code{Y ~ COLCLUST + ROW + COLCLUST:ROW}, or its equivalent with "*", the
#' interaction term will have \code{(nclus.column - 1)*(n-1)} independent
#' parameters.
#'
#' If you instead use the formula \code{Y ~ COLCLUST:ROW}, that interaction
#' coefficient matrix will have \code{nclus.column*n - 1} independent parameters.
#'
#' Any covariate terms included in the formula will be split up by
#' \code{clustord} into the covariates that interact with row clusters, the
#' covariates that interact with column clusters, and the covariates that do not
#' interact with row or column clusters.
#'
#' The number of independent parameters for row-cluster-interacting covariates
#' will be \code{nclus.row*L}, where \code{L} is the number of terms involving
#' row clusters and covariates after any "*" terms have been expanded.
#'
#' So in this formula, for example,
#'
#' \code{Y ~ ROWCLUST*xr1 + xr2 + ROWCLUST:log(xc1)}
#'
#' where xr1 and xr2 are row covariates, and xc1 is a column covariate, the fully
#' expanded formula would be
#'
#' \code{Y ~ ROWCLUST + xr1 + xr2 + ROWCLUST:xr1 + ROWCLUST:log(xc1)}
#'
#' and the terms interacting with \code{ROWCLUST} would be \code{ROWCLUST:xr1}
#' and \code{ROWCLUST:log(xc1)}, so there would be \code{nclus.row*2}
#' independent coefficients for those covariates.
#'
#' The number of independent parameters for column-cluster-interacting
#' covariates will be \code{nclus.column*M}, where \code{M} is the number of
#' terms involving column clusters and covariates after any "*" terms have been
#' expanded.
#'
#' So this formula, for example,
#'
#' \code{Y ~ I(xr1^2) + COLCLUST*xc1 + COLCLUST:xc2:xc3 + COLCLUST*xr1}
#'
#' would be expanded as
#'
#' \code{Y ~ COLCLUST + xr1 + I(xr1^2) + xc1 + COLCLUST:xc1 + COLCLUST:xc2:xc3 + COLCLUST:xr1}
#'
#' and the terms interacting with \code{COLCLUST} would be \code{COLCLUST:xc1},
#' \code{COLCLUST:xc2:xc3} and \code{COLCLUST:xr1}, so there would be
#' \code{nclus.column*3} independent coefficients for those covariates.
#'
#' The number of independent parameters for covariates that do not interact with
#' row or column clusters will be the same as the number of those covariate terms,
#' after any "*" terms have been expanded.
#'
#' So this formula, for example,
#'
#' \code{Y ~ ROWCLUST*xr1 + xr2 + ROWCLUST:log(xc1) + COLCLUST*xc1}
#'
#' would be expanded as
#'
#' \code{Y ~ ROWCLUST +COLCLUST + xr1 + xr2 + xc1 + ROWCLUST:xr1 + ROWCLUST:log(xc1) + COLCLUST:xc1},
#'
#' so there would be 3 independent coefficients for the terms \code{xr1, xr2, xc1}.
#'
#' Note that there are \strong{no intercept} terms for the coefficients,
#' because those are incorporated into the parameters \code{mu_k}.
#'
#' The \strong{order of the} \code{initvect} \strong{entries} is as follows, and
#' any entries that are not included in the formula will be ignored and not
#' included in \code{initvect}. That is, you should NOT provide values in
#' \code{initvect} for components that are not included in the formula.
#'
#' 1) mu (or values used to construct mu, POM only)
#' 2) values used to construct phi (OSM only)
#' 3) row cluster coefficients
#' 4) column cluster coefficients
#' 5) [matrix] bicluster coefficients (i.e. interaction between row and column clusters)
#' 6) individual row coefficients
#' 7) individual column coefficients
#' 8) [matrix] interactions between row clusters and individual column coefficients
#' 9) [matrix] interactions between column clusters and individual row coefficients
#' 10) [matrix] row-cluster-specific coefficients for covariates interacting with row clusters
#' 11) [matrix] column-cluster-specific coefficients for covariates interacting with column clusters
#' 12) coefficients for covariates that do not interact with row or column clusters
#'
#' Any entries marked as [matrix] will be constructed into matrices by filling
#' those matrices row-wise, e.g. if you want starting values 1:6 for a matrix of
#' 2 row clusters and 3 covariates interacting with those row clusters, the
#' matrix of coefficients will become
#'
#' \code{1 2 3
#' 4 5 6}
#'
#' For the formula \code{Y ~ ROWCLUST*COLCLUST}, where the matrix of interactions
#' between row and column clusters has \code{(nclus.row - 1)*(nclus.column - 1)}
#' independent parameters, the last row and column of the matrix will be the
#' negative sums of the rest, so e.g. if you have 2 row clusters and 3 column
#' clusters, there will only be 2 independent values, so if you provide the
#' starting values -0.5 and 1.2, the final matrix of parameters will be:
#'
#' \code{                column cluster 1   column cluster 2   column cluster 3
#' row cluster 1   -0.5               1.2                -0.7
#' row cluster 2   0.5                -1.2               0.7}
#'
#' If the matrix is a matrix relating to row clusters, then the row clusters are
#' in the rows, and if it's a matrix relating to column clusters but not row
#' clusters, then the column clusters are in the rows, i.e. the matrix of
#' coefficients for column clusters interacting with individual row effects will
#' have the rows of the matrix corresponding to the clusters, i.e. the matrix
#' would be indexed colc_row_coef_ci, c being the column cluster index and i
#' being the row index.
#'
#' Similarly, if the matrix is a matrix relating to column clusters and covariates,
#' then the rows of the matrix will correspond to the column clusters, i.e. the
#' matrix would be indexed colc_cov_coef_cl, c being the column cluster index
#' and l being the covariate index.
#'
#' If using biclustering with interaction between row and column clusters, then
#' the row clusters will be the rows and the column clusters will be the columns,
#' i.e. the matrix would be indexed rowc_colc_coef_rc, r being the row cluster
#' index and c being the column cluster index.
#'
#' @param formula model formula (see 'Details').
#' @param model \code{"OSM"} for Ordered Stereotype Model or \code{"POM"} for
#'     Proportional Odds Model or \code{"Binary"} for binary data model.
#' @param nclus.row number of row clustering groups.
#' @param nclus.column number of column clustering groups.
#' @param long.df data frame with at least three columns, \code{Y} and \code{ROW}
#'     and \code{COL}. Each row in the data frame corresponds to a single cell
#'     in the original data matrix; the response value in that cell is given by
#'     \code{Y}, and the row and column indices of that cell in the matrix are
#'     given by \code{ROW} and \code{COL}. Use \code{\link{mat2df}} to create
#'     this data frame from your data matrix of responses.
#'     \code{\link{mat2df}} also allows you to supply data frames of row or
#'     column covariates which will be incorporated into \code{long.df}.
#'
#' @param initvect (default NULL) vector of starting parameter values for the model.
#'     Note: if the user enters an initial vector of parameter values, it is
#'     \strong{strongly recommend} that the user also check the values of
#'     \code{parlist.init} in the output object, to \strong{make sure that the
#'     constructed parameters are as expected}.
#'
#'     If \code{NULL}, starting parameter values will be generated automatically.
#'
#'     See 'Details' for definitions of the parameters used for different models.
#'
#' @param pi.init (default \code{NULL}) starting parameter values for the proportions
#'     of observations in the different row clusters.
#'
#'     If \code{NULL}, starting values will be generated automatically.
#'
#'     User-specified values of \code{pi.init} must be of length \code{(nclus.row-1)}
#'     because the final value will be automatically calculated so that the
#'     values of \code{pi} sum to 1.
#' @param kappa.init (default \code{NULL}) starting parameter values for the
#'     proportions of observations in the different column clusters.
#'
#'     If \code{NULL}, starting values will be generated automatically.
#'
#'     User-specified values of \code{kappa.init} must be of length
#'     \code{(nclus.column-1)} because the final value will be automatically
#'     calculated so that the values of \code{kappa} sum to 1.
#' @param EM.control (default = \code{list(EMcycles=50, EMlikelihoodtol=1e-4,
#'     EMparamtol=1e-2, paramstopping=TRUE, startEMcycles=10, keepallparams=FALSE,
#'     epsilon=1e-6)})
#'     list of parameters controlling the EM algorithm.
#'
#'     \code{EMcycles} controls how many EM iterations of the main EM algorithm
#'     are used to fit the chosen submodel.
#'
#'     \code{EMlikelihoodtol} is the tolerance for the stopping criterion for
#'     the \strong{log-likelihood} in the EM algorithm. The criterion is the
#'     absolute change in the \strong{incomplete} log-likelihood since the
#'     previous iteration, scaled by the size of the dataset \code{n*p}, where
#'     \code{n} is the number of rows in the data matrix and \code{p} is the
#'     number of columns in the data matrix. The scaling is applied because the
#'     incomplete log-likelihood is predominantly affected by the dataset size.
#'
#'     \code{EMparamtol} is the tolerance for the stopping criterion for the
#'     \strong{parameters} in the EM algorithm. This is a tolerance for the
#'     \strong{sum} of the scaled parameter changes from the last iteration,
#'     i.e. the tolerance is not for any individual parameter but for the sum of
#'     changes in all the parameters. Thus the default tolerance is 1e-2.
#'     The individual parameter criteria are the absolute differences between
#'     the exponentiated absolute parameter value at the current timestep and
#'     the exponentiated absolute parameter value at the previous timestep, as a
#'     proportion of the exponentiated absolute parameter value at the current
#'     timestep. The exponentiation is to rescale parameter values that are
#'     close to zero.
#'
#'     there are around 5 independent parameter values, then at the point of
#'     convergence using default tolerances for the log-likelihood and the
#'     parameters, each parameter will have a scaled absolute change since the
#'     previous iteration of about 1e-4; if there are 20 or 30 independent
#'     parameters, then each will have a scaled aboslute change of about 1e-6.
#'
#'     \code{paramstopping}: if \code{FALSE}, indicates that the EM algorithm
#'     should only check convergence based on the change in incomplete-data
#'     log-likelihood, relative to the current difference between the
#'     complete-data and incomplete-data log-likelihoods, i.e.
#'     \code{abs(delta_lli)/abs(llc[iter] - lli[iter])};
#'     if \code{TRUE}, indicates that as well as checking the likelihood
#'     criterion, the EM algorithm should also check whether the relative change
#'     in the exponentials of the absolute values of the current parameters is
#'     below the tolerance \code{EMstoppingpar}, to see whether the parameters
#'     and the likelihood have both converged.
#'
#'     \code{startEMcycles} controls how many EM iterations are used when
#'     fitting the simpler submodels to get starting values for fitting models
#'     with interaction.
#'
#'     \code{keepallparams}: if \code{TRUE}, keep a record of parameter values
#'     (including pi_r and kappa_c) for every EM iteration.
#'
#'     \code{rerunestepbeforelli}: if \code{TRUE}, and only when using biclustering,
#'     rerun the E-step before calculating the incomplete-data log-likelihood.
#'     The EM algorithm runs the E-step to estimate the cluster membership
#'     probabilities and then runs the M-step to estimate the parameters by
#'     maximising the complete-data log-likelihood (LLC), and then recalculates
#'     the estimated incomplete-data log-likelihood (LLI) to check for
#'     convergence. The biclustering LLI approximation uses the cluster
#'     membership probabilities so will be slightly more accurate if these
#'     cluster membership probabilities are recalculated using the very latest
#'     parameter estimates. So the \code{TRUE} setting for this control
#'     recalculates the cluster memberships before calculating the LLI approx.
#'
#'     \code{uselatestlli}: Kept for backwards compatibility with original
#'     version of clustord algorithm. Original version of the algorithm had this
#'     set to \code{FALSE}, which keeps the best LLI from any previous iteration
#'     rather than the latest LLI. The default, \code{TRUE}, instead keeps the
#'     latest LLI even if there was a better LLI in a previous iteration. This
#'     is appropriat: In row clustering the exact LLI is used and the EM
#'     algorithm creators proved the LLI should always not decrease with each
#'     iteration. In biclustering the approximation to the LLI may be very
#'     inaccurate when the algorithm is a long way from convergence, so even
#'     when the value of LLI appears to be very high in early iterations this
#'     may not be accurate, so it is better to take the latest value.
#'
#'     For \code{columnclustering}, the parameters saved from each iteration
#'     will NOT be converted to column clustering format, and will be in the row
#'     clustering format, so \code{alpha} in
#'     \code{EM.status$params.every.iteration} will correspond to beta_c and
#'     \code{pi} will correspond to kappa.
#'
#'     \code{epsilon}: default 1e-6, small value used to adjust values of pi,
#'     kappa and theta that are too close to zero so that taking logs of them
#'     does not create infinite values.
#'
#' @param optim.method (default "L-BFGS-B") method to use in optim within the M
#'     step of the EM algorithm. Must be one of 'L-BFGS-B', 'BFGS', 'CG' or
#'     'Nelder-Mead' (i.e. not the SANN method).
#' @param optim.control control list for the \code{optim} call within the M step
#'     of the EM algorithm. See the control list Details in the \code{optim}
#'     manual for more info.
#'     Please note that although \code{optim}, by default, uses \code{pgtol=0}
#'     and \code{factr=1e7} in the L-BFGS-B method, \code{clustord}, by default,
#'     alters these to \code{pgtol=1e-4} and \code{factr=1e11}, but you can use
#'     this \code{optim.control} argument in \code{clustord} to revert them to
#'     the defaults if you want. The reason for the change is that the chosen
#'     values in \code{clustord} reduce the tolerance on the log-likelihood
#'     function optimization in order to speed up the algorithm, and because the
#'     log-likelihood is on the scale of 1e4 for <100 rows in the data matrix
#'     and 1e6 for 5000 rows in the data matrix, tolerance at the default
#'     \code{optim} scale is not as important as the choice of model type and
#'     structure or the number of starting points. If one model is better than
#'     another, it will probably have a likelihood that is better by about the
#'     size of the data matrix, which is far larger than the tolerance in the
#'     optimization. If one starting point is better than another, it will
#'     probably have a likelihood that is better on about 1/10th or 1/100th the
#'     size of the data matrix, which is still far larger than the tolerance in
#'     the optimization.
#'     If you need accurate parameter estimates, firstly make sure to try more
#'     starting points, then perform model selection first, and then finally
#'     rerun the chosen model with finer tolerance, e.g. the \code{optim}
#'     defaults, \code{pgtol=0} and \code{factr=1e7}.
#' @param constraint_sum_zero (default \code{TRUE}) if \code{TRUE}, use constraints
#'     that cluster effects sum to zero; if \code{FALSE}, use constraints that
#'     the cluster effect for the first cluster will be 0.
#'     Both versions have the same constraints for joint row-column cluster
#'     effects: these effects are described by a matrix of parameters gamma_rc,
#'     indexed by row cluster and column cluster indices, and the constraints
#'     are that the final column of gamma_rc is equal to the negative sum of the
#'     other columns (so \code{gamma} columns sum to zero) and first row of
#'     gamma_rc is equal to the negative sum of the other rows (so \code{gamma}
#'     rows sum to zero).
#' @param start_from_simple_model (default \code{TRUE}) if \code{TRUE}, fit a
#'     simpler clustering model first and use that to provide starting values for
#'     all parameters for the model with interactions;
#'     if \code{FALSE}, use the more basic models to provide starting values only
#'     for \code{pi.init} and \code{kappa.init}.
#'     If the full model has interaction terms, then simpler models are ones
#'     without the interactions. If the model has individual row/column effects
#'     alongside the clusters, then simpler models are ones without the individual
#'     row/column effects. If the full model has covariates, then simpler models
#'     are ones without the covariates (to get starting values for the cluster
#'     parameters), and ones with the covariates but no clustering (to get
#'     starting values for the covariates).
#' @param parallel_starts (default FALSE) if TRUE, by generating multiple random
#'   starts, those random starts will be parallelised over as many cores as are
#'   available. For example, on a personal computer this will be one fewer than
#'   the number of cores in the machine, to make sure one is left for system
#'   tasks external to R.
#' @param nstarts (default 5) number of random starts to generate, if generating
#'     random starting points for the EM algorithm.
#' @param verbose (default \code{FALSE}) changes how much is reported to the
#'   console during the algorithm's progress. If \code{TRUE}, reports the
#'   incomplete-data log-likelihood at every EM algorithm iteration and the
#'   trace information from nnet::multinom() during the process of fitting
#'   initial values for the \code{mu} parameters. If \code{FALSE}, the
#'   incomplete log-likelihood is only reported every 10 iterations of the EM
#'   algorithm and the initial fitting reporting is suppressed. Regardless of
#'   the verbosity setting the algorithm reports each of the random starts and
#'   whenever it finds a better log-likelihood than all previous starts, and it
#'   also reports when it is fitting simpler models to find starting values for
#'   the parameters vs fitting the final, more complex model.
#'   If wanting the detailed output from \code{optim()}, use
#'   \code{optim.control=list(trace=X)}, where X is 1 to 6, with 6 being the
#'   highest level of verbosity for the L-BFGS-B algorithm.
#' @returns
#' A \code{clustord} object, i.e. a list with components:
#'
#'     \code{info}: Basic info n, p, q, the number of parameters, the number of
#'     row clusters and the number of column clusters, where relevant.
#'
#'     \code{model}: The model used for fitting, "OSM" for Ordered Stereotype
#'     Model, "POM" for Proportional Odds Model, or "Binary" for Binary model.
#'
#'     \code{EM.status}: a list containing the latest iteration \code{iter},
#'     latest incomplete-data and complete-data log-likelihoods \code{new.lli}
#'     and \code{new.llc}, the best incomplete-data log-likelihood \code{best.lli}
#'     and the corresponding complete-data log-likelihood, \code{llc.for.best.lli},
#'     and the parameters for the best incomplete-data log-likelihood,
#'     \code{params.for.best.lli}, indicator of whether the algorithm converged
#'     \code{converged}, and if the user chose to keep all parameter values from
#'     every iteration, also \code{params.every.iteration}.
#'
#'     Note that for \strong{biclustering}, i.e. when \code{ROWCLUST} and
#'     \code{COLCLUST} are both included in the model, the \strong{incomplete}
#'     log-likelihood is calculated using the entropy approximation, and this
#'     may be \strong{inaccurate} unless the algorithm has converged or is close
#'     to converging. So beware of using the incomplete log-likelihood and the
#'     corresponding AIC value \strong{unless the EM algorithm has converged}.
#'
#'     \code{criteria}: the calculated values of AIC, BIC,
#'     etc. from the best incomplete-data log-likelihood.
#'
#'     \code{epsilon}: the very small value (default 1e-6) used to adjust values
#'     of pi and kappa and theta that are too close to zero, so that taking logs
#'     of them does not produce infinite values. Use the EM.control argument to
#'     adjust epsilon.
#'
#'     \code{constraints_sum_zero}: the chosen value of constraints_sum_zero.
#'
#'     \code{param_lengths}: vector of total number of parameters/coefficients
#'     for each part of the model, labelled with the names of the components.
#'     The value is 0 for each component that is not included in the model, e.g.
#'     if there are no covariates interacting with row clusters then the
#'     \code{rowc_cov_coef} value will be 0. If the component is included, then
#'     the value given will include any dependent parameter/coefficient values,
#'     so if column clusters are included then the \code{colc_coef} value will
#'     be \code{nclus.column}, whereas the number of independent values will be
#'     \code{nclus.column - 1}.
#'
#'     \code{initvect}: the initial \emph{vector} of parameter values, either
#'     specified by the user or generated automatically. This vector has only
#'     the \strong{independent} values of the parameters, not the full set.
#'
#'     \code{outvect}: the final \emph{vector} of parameter values, containing
#'     only the independent parameter values from \code{parlist.out}.
#'
#'     \code{parlist.init}: the initial list of parameters, constructed from
#'     the initial parameter vector \code{initvect}. Note that if the initial
#'     vector has been incorrectly specified, the values of \code{parlist.init}
#'     may not be as expected, and they should be checked by the user.
#'
#'     \code{parlist.out}: fitted values of parameters.
#'
#'     \code{pi}, \code{kappa}: fitted values of pi and kappa, where relevant.
#'
#'     \code{ppr}, \code{ppc}: the posterior probabilities of membership of the
#'     row clusters and the column clusters, where relevant.
#'
#'     \code{rowc_mm}, \code{colc_mm}, \code{cov_mm}: the model matrices for,
#'     respectively, the covariates interacting with row clusters, the covariates
#'     interacting with column clusters, and the covariates not interacting with
#'     row or column clusters (i.e. the covariates with constant coefficients).
#'     Note that one row of each model matrix corresponds to one row of long.df.
#'
#'     \code{RowClusters}, \code{ColumnClusters}: the assigned row and column
#'     clusters, where relevant, where each row/column is assigned to a cluster
#'     based on maximum posterior probability of cluster membership (\code{ppr}
#'     and \code{ppc}).
#'
#'     \code{RowClusterMembers}, \code{ColumnClusterMembers}: vectors of
#'     assigned members of each row or column cluster, where each row/column is
#'     assigned to a cluster based on maximum posterior probability of cluster
#'     membership (\code{ppr} and \code{ppc})
#'
#' @references Fernandez, D., Arnold, R., & Pledger, S. (2016). Mixture-based clustering for the ordered stereotype model. \emph{Computational Statistics & Data Analysis}, 93, 46-75.
#'
#' @references Anderson, J. A. (1984). Regression and ordered categorical variables. \emph{Journal of the Royal Statistical Society: Series B (Methodological)}, 46(1), 1-22.
#'
#' @references Agresti, A. (2010). \emph{Analysis of ordinal categorical data} (Vol. 656). John Wiley & Sons.
#'
#' @examples
#' set.seed(1)
#' long.df <- data.frame(Y=factor(sample(1:3,5*20,replace=TRUE)),
#'                ROW=factor(rep(1:20,times=5)),COL=rep(1:5,each=20))
#'
#' # Model Log(P(Y=k)/P(Y=1))=mu_k+phi_k*rowc_coef_r with 3 row clustering groups:
#' clustord(Y~ROWCLUST,model="OSM",3,long.df=long.df,
#'              EM.control=list(EMcycles=2,startEMcycles=2), nstarts=2)
#'
#' # Model Log(P(Y=k)/P(Y=1))=mu_k+phi_k*(rowc_coef_r + col_coef_j) with 3 row clustering groups:
#' clustord(Y~ROWCLUST+COL,model="OSM",3,long.df=long.df,
#'              EM.control=list(EMcycles=2,startEMcycles=2), nstarts=2)
#'
#' # Model Logit(P(Y <= k))=mu_k-rowc_coef_r-col_coef_j-rowc_col_coef_rj with 2 row clustering groups:
#' clustord(Y~ROWCLUST*COL,model="POM",nclus.row=2,long.df=long.df,
#'              EM.control=list(EMcycles=2,startEMcycles=2), nstarts=2)
#'
#' # Model Log(P(Y=k)/P(Y=1))=mu_k+phi_k*(colc_coef_c) with 3 column clustering groups:
#' clustord(Y~COLCLUST,model="OSM",nclus.column=3,long.df=long.df,
#'              EM.control=list(EMcycles=2,startEMcycles=2), nstarts=2)
#'
#' # Model Log(P(Y=k)/P(Y=1))=mu_k+phi_k*(colc_coef_c + row_coef_i) with 3 column clustering groups:
#' clustord(Y~COLCLUST+ROW,model="OSM",nclus.column=3,long.df=long.df,
#'              EM.control=list(EMcycles=2,startEMcycles=2), nstarts=2)
#'
#' # Model Log(P(Y=k)/P(Y=1))=mu_k+phi_k*(rowc_coef_r + colc_coef_c)
#' #    with 3 row clustering groups and 2 column clustering groups:
#' clustord(Y~ROWCLUST+COLCLUST,model="OSM",nclus.row=3,nclus.column=2,long.df=long.df,
#'              EM.control=list(EMcycles=2), nstarts=1)
#'
#' # Model Logit(P(Y<=k))=mu_k-rowc_coef_r-colc_coef_c-rowc_colc_coef_rc
#' #    with 2 row clustering groups and 4 column clustering groups, and
#' #    interactions between them:
#' clustord(Y~ROWCLUST*COLCLUST, model="POM", nclus.row=2, nclus.column=4,
#'              long.df=long.df,EM.control=list(EMcycles=2), nstarts=1,
#'              start_from_simple_model=FALSE)
#' @export
clustord <- function(formula,
                     model,
                     nclus.row=NULL,
                     nclus.column=NULL,
                     long.df,
                     initvect=NULL,
                     pi.init=NULL,
                     kappa.init=NULL,
                     EM.control=default.EM.control(),
                     optim.method="L-BFGS-B",
                     optim.control=default.optim.control(),
                     constraint_sum_zero=TRUE,
                     start_from_simple_model=TRUE,
                     parallel_starts=FALSE,
                     nstarts=5,
                     verbose=FALSE){

    validate.inputs(formula=formula, model=model,
                    nclus.row=nclus.row, nclus.column=nclus.column,
                    long.df=long.df, initvect=initvect,
                    pi.init=pi.init, kappa.init=kappa.init,
                    EM.control=EM.control, optim.method=optim.method,
                    constraint_sum_zero=constraint_sum_zero,
                    start_from_simple_model=start_from_simple_model,
                    parallel_starts=parallel_starts,
                    nstarts=nstarts, verbose=verbose)

    ## If ROW and COL are factors, convert them to their numeric values before
    ## running clustering
    long.df <- check.factors(long.df)

    RG <- nclus.row
    CG <- nclus.column

    model_structure <- check.formula(formula, model, long.df, RG, CG)

    if (sum(model_structure$param_lengths) > nrow(long.df)/2) stop("You are trying to fit a model with more parameters than half the number of cells in the data matrix. You should try a simpler model.")

    ## Replace defaults with user-provided values, so that any control parameters
    ## the user did not specify are not left blank:
    default.EM.control <- default.EM.control()
    EM.control <- replacedefaults(default.EM.control, EM.control)

    if (verbose) cat(paste("EM algorithm for",model))

    start.par <- NULL

    if (!is.null(RG) && !is.null(CG)) {
        EM.control$biclustering <- TRUE
        if (is.null(initvect) | is.null(pi.init) | is.null(kappa.init)) {
            ## generate.start will keep using whichever of initvect and pi.init and
            ## kappa.init are not null
            start.par <- generate.start.bicluster(long.df, model=model,
                                                  model_structure=model_structure,
                                                  RG=RG, CG=CG, initvect=initvect,
                                                  pi.init=pi.init, kappa.init=kappa.init,
                                                  EM.control=EM.control,
                                                  optim.method=optim.method,
                                                  optim.control=optim.control,
                                                  constraint_sum_zero=constraint_sum_zero,
                                                  start_from_simple_model=start_from_simple_model,
                                                  parallel_starts=parallel_starts,
                                                  nstarts=nstarts,
                                                  verbose=verbose)
            initvect <- start.par$initvect
            pi.init <- start.par$pi.init
            kappa.init <- start.par$kappa.init
        }
        results <- run.EM.bicluster(invect=initvect, model=model, long.df=long.df,
                                    rowc_mm=model_structure$rowc_mm, colc_mm=model_structure$colc_mm,
                                    cov_mm=model_structure$cov_mm,
                                    pi_v=pi.init, kappa_v=kappa.init,
                                    param_lengths=model_structure$param_lengths,
                                    constraint_sum_zero=constraint_sum_zero,
                                    EM.control=EM.control,
                                    optim.method=optim.method, optim.control=optim.control,
                                    verbose=verbose)
    } else if (!is.null(RG)) {
        EM.control$biclustering <- FALSE
        if (is.null(initvect) | is.null(pi.init)) {
            ## generate.start will keep using whichever of initvect and pi.init is not null
            start.par <- generate.start.rowcluster(long.df, model=model,
                                                   model_structure=model_structure, RG=RG,
                                                   initvect=initvect, pi.init=pi.init,
                                                   EM.control=EM.control,
                                                   optim.method=optim.method,
                                                   optim.control=optim.control,
                                                   constraint_sum_zero=constraint_sum_zero,
                                                   start_from_simple_model=start_from_simple_model,
                                                   parallel_starts=parallel_starts,
                                                   nstarts=nstarts,
                                                   verbose=verbose)
            initvect <- start.par$initvect
            pi.init <- start.par$pi.init
        }
        results <- run.EM.rowcluster(invect=initvect, model=model, long.df=long.df,
                                     rowc_mm=model_structure$rowc_mm, colc_mm=model_structure$colc_mm,
                                     cov_mm=model_structure$cov_mm,
                                     pi_v=pi.init, param_lengths=model_structure$param_lengths,
                                     constraint_sum_zero=constraint_sum_zero,
                                     EM.control=EM.control,
                                     optim.method=optim.method, optim.control=optim.control,
                                     verbose=verbose)
    } else if (!is.null(CG)) {
        EM.control$biclustering <- FALSE
        RG <- nclus.column
        pi.init <- kappa.init
        long.df.transp <- long.df
        long.df.transp$ROW <- long.df$COL
        long.df.transp$COL <- long.df$ROW

        # Convert the model structure to row clustering instead of column clustering
        model_structure.transp <- convert.model.row.to.column(model_structure)

        if (is.null(initvect) | is.null(pi.init)) {
            ## generate.start will keep using whichever of initvect and kappa.init is not null
            start.par <- generate.start.rowcluster(long.df.transp, model=model,
                                                   model_structure=model_structure.transp, RG=RG,
                                                   initvect=initvect, pi.init=kappa.init,
                                                   EM.control=EM.control,
                                                   optim.method=optim.method,
                                                   optim.control=optim.control,
                                                   constraint_sum_zero=constraint_sum_zero,
                                                   start_from_simple_model=start_from_simple_model,
                                                   parallel_starts=parallel_starts,
                                                   nstarts=nstarts,
                                                   verbose=verbose)
            initvect <- start.par$initvect
            pi.init <- start.par$pi.init
        }

        results <- run.EM.rowcluster(invect=initvect, long.df=long.df.transp,
                                     model=model,
                                     rowc_mm=model_structure.transp$rowc_mm,
                                     colc_mm=model_structure.transp$colc_mm,
                                     cov_mm=model_structure.transp$cov_mm,
                                     pi_v=pi.init,
                                     param_lengths=model_structure.transp$param_lengths,
                                     constraint_sum_zero=constraint_sum_zero,
                                     EM.control=EM.control,
                                     optim.method=optim.method, optim.control=optim.control,
                                     verbose=verbose)

        ## Now convert the results back to row clustering ----
        column.info <- results$info
        column.info[c('nclus.column','n','p')] <- column.info[c('nclus.row','p','n')]
        column.info <- column.info[-which(names(column.info) == "nclus.row")]

        column.parlist <- convert.output.row.to.column(results$parlist.out)

        column.best.parlist <- convert.output.row.to.column(results$EM.status$params.for.best.lli)

        column.EM.status <- results$EM.status
        column.EM.status$params.for.best.lli <- column.best.parlist

        # Note: keep row-clustering-format param_lengths for
        column.results <- list(info=column.info,
                               model=results$model,
                               clustering_mode="column clustering",
                               EM.status=column.EM.status,
                               criteria=results$criteria,
                               numerical.correction.epsilon=results$numerical.correction.epsilon,
                               constraint_sum_zero=results$constraint_sum_zero,
                               param_lengths=model_structure$param_lengths,
                               initvect=initvect,
                               parlist.out=column.parlist,
                               kappa.init=results$pi.init,
                               kappa.out=results$pi.out,
                               ppc=results$ppr,
                               colc_mm=model_structure$colc_mm,
                               cov_mm=model_structure$cov_mm,
                               ColumnClusters=results$RowClusters,
                               ColumnClusterMembers=results$RowClusterMembers,
                               rowc_format_param_lengths=results$param_lengths,
                               rowc_format_outvect=results$outvect,
                               rowc_format_parlist.init=results$parlist.init,
                               rowc_format_rowc_mm=model_structure.transp$rowc_mm)

        results <- column.results
    } else {
        stop("Both nclus.row and nclus.col are NULL. Please set one or both to integers and try again.")
    }

    ## If the model contains individual row or column effects, rename them accordingly
    results <- tidy.output(results, long.df)

    if (results$EM.status$converged) message("EM algorithm has successfully converged.")
    else message("EM algorithm has not converged. Please try again, or with a different random seed, or with more starting points.")

    results$start.par <- start.par
    results$call <- match.call()
    results$formula <- formula
    results$terms <- terms(formula)

    class(results) <- "clustord"

    return(results)
}

#' @export
print.clustord <- function(x, digits = max(3L, getOption("digits") - 3L), ...)
{
    cat("\nCall:\n",
        paste(deparse(x$call), sep = "\n", collapse = "\n"), "\n\n", sep = "")

    cat("Clustering mode:\n",
        x$clustering_mode)

    cat("\n\nConverged:\n",
        x$EM.status$converged)

    cat("\n\nCluster sizes:\n")
    if ("RowClusterMembers" %in% names(x)) {
        cat("Row clusters: ", sapply(x$RowClusterMembers, length), "\n")
    }
    if ("ColumnClusterMembers" %in% names(x)) {
        cat("Column clusters: ", sapply(x$ColumnClusterMembers, length), "\n")
    }

    invisible(x)
}

#' @export
summary.clustord <- function (object, ...)
{
    z <- object

    ans <- z[c("call", "terms", "formula", "model", "clustering_mode")]

    ans$convergence <- z$EM.status$converged

    ans$numiter <- z$EM.status$iter
    ans$loglikelihood <- z$EM.status$best.lli
    ans$loglikelihood_type <- ifelse(z$clustering_mode == "biclustering","approximate","exact")
    ans$AIC <- z$criteria$AIC
    ans$BIC <- z$criteria$BIC

    ans$estimates <- z$parlist.out

    if("RowClusterMembers" %in% names(z)) {
        ans$rowclustersizes <- sapply(z$RowClusterMembers, length)
    }
    if ("ColumnClusterMembers" %in% names(z)) {
        ans$colclustersizes <- sapply(z$ColumnClusterMembers, length)
    }

    class(ans) <- "summary.clustord"
    ans
}

#' @export
print.summary.clustord <- function (x, digits = max(3L, getOption("digits") - 3L), ...) {
    cat("\nCall:\n", # S has ' ' instead of '\n'
        paste(deparse(x$call), sep="\n", collapse = "\n"), "\n\n", sep = "")

    cat("Clustering mode:\n",
        x$clustering_mode)

    cat("\n\nConverged:\n",
        x$convergence)
    cat("\nAIC:", x$AIC, "\n")
    cat("\nBIC:", x$BIC, "\n")

    cat("\n\nCluster sizes:\n")
    if ("rowclustersizes" %in% names(x)) {
        cat("Row clusters: ", x$rowclustersizes, "\n")
    }
    if ("colclustersizes" %in% names(x)) {
        cat("Column clusters: ", x$colclustersizes, "\n")
    }

    cat("\nParameter estimates:\n")
    print.default(x$estimates, digits=digits, print.gap=1L, quote=FALSE)

    cat("\n")
    invisible(x)
}

#' Rerun clustord using the results of a previous run as the starting point.
#'
#' This function is designed for two purposes.
#' (1) You tried to run clustord and the results did not converge. You can supply
#' this function with the previous results and the previous data object, and it
#' will carry on running clustord from the endpoint of the previous run, which
#' is quicker than starting the run again from scratch with more iterations.
#'
#' (2) The previous result converged, but you have changed the dataset slightly,
#' and want to rerun from the previous endpoint to save time.
#'
#' Either way, you call the function in the same way, supplying the previous
#' results object and a dataset, and optionally a new number of iterations
#' (`EM.control=list(EMcycles=XXX)`, where `XXX` is the new number of iterations.)
#'
#' The output parameters of the old result will be used as the new initial
#' parameters.
#'
#' @param results.original The results of the previous run that you want to use
#'    as a starting point. The model, number of clusters, and final parameter
#'    values will be used, and the cluster controls such as EMcycles will be
#'    reused unless the user specifies new values. But the row cluster and/or
#'    column cluster memberships will NOT be reused, and nor will the dataset,
#'    so you can change the dataset slightly and the rest of the details will
#'    be applied to this new dataset.
#' @param long.df The dataset to use for this run, which may be slightly
#'    different to the original. Please note that the only compatibility check
#'    performed is comparing the sizes of the original and new datasets, and it
#'    is up to the user to check that the new dataset is sufficiently similar
#'    to the old one.
#' @param EM.control Options to use for this run such as EMcycles (number of EM
#'    iterations). Note that "startEMcycles" will not be relevant as this run
#'    will not generate random starts, it will run from the end parameters of
#'    the other run. See \link{clustord} documentation for more info.
#' @param optim.control Options to use for this run within \code{optim()}, which
#'    is used to estimate the parameters during each M-step. See \link{clustord}
#'    documentation for more info.
#' @param verbose (default \code{FALSE}) changes how much is reported to the
#'    console during the algorithm's progress. See \link{clustord}
#'    documentation for more info.
#' @returns An object of class \code{clustord}. See \link{clustord} for more info.
#' @examples
#' set.seed(1)
#' long.df <- data.frame(Y=factor(sample(1:3,5*20,replace=TRUE)),
#'                ROW=factor(rep(1:20,times=5)),COL=rep(1:5,each=20))
#' results.original <- clustord(Y ~ ROWCLUST, model="OSM", nclus.row=4,
#'                              long.df=long.df, EM.control=list(EMcycles=2))
#' results.original$EM.status$converged
#' # FALSE
#'
#' ## Since original run did not converge, rerun from that finishing point and
#' ## allow more iterations this time
#' results.new <- rerun(results.original, long.df, EM.control=list(EMcycles=10))
#'
#' ## Alternatively, if dataset has changed slightly then rerun from the
#' ## previous finishing point to give the new results a helping hand
#' long.df.new <- long.df[-c(4,25,140),]
#' results.new <- rerun(results.original, long.df.new)
#' @export
rerun <- function(results.original, long.df, EM.control=NULL, verbose=FALSE, optim.control=NULL) {
    if (!("clustord" %in% class(results.original))) stop("results.original must be a clustord results object.")

    model <- results.original$model
    formula <- results.original$formula

    if ("nclus.row" %in% names(results.original$info)) nclus.row <- results.original$info['nclus.row']
    else nclus.row <- NULL
    if ("nclus.column" %in% names(results.original$info)) nclus.column <- results.original$info['nclus.column']
    else nclus.column <- NULL

    # Normalise pi/kappa before resupplying, just in case the outputs don't quite
    # add up to 1.
    if ("pi.out" %in% names(results.original)) pi.init <- results.original$pi.out/sum(results.original$pi.out)
    else pi.init <- NULL
    if ("kappa.out" %in% names(results.original)) kappa.init <- results.original$kappa.out/sum(results.original$kappa.out)
    else kappa.init <- NULL

    if ("outvect" %in% names(results.original)) initvect <- results.original$outvect
    else if ("rowc_format_outvect" %in% names(results.original)) initvect <- results.original$rowc_format_outvect
    else initvect <- NULL

    if (abs(length(unique(long.df$ROW)) - results.original$info['n'])/results.original$info['n'] > 0.01) warning("The number of rows in the data matrix has changed by more than 1%. This run may not be compatible with the original run.")
    if (abs(length(unique(long.df$COL)) - results.original$info['p'])/results.original$info['p'] > 0.01) warning("The number of columns in the data matrix has changed by more than 1%. This run may not be compatible with the original run.")

    results.new <- clustord(formula=formula, model=model, nclus.row=nclus.row,
                            nclus.column=nclus.column, long.df=long.df,
                            initvect=initvect, pi.init=pi.init, kappa.init=kappa.init,
                            EM.control=EM.control, verbose=verbose,
                            optim.control=optim.control)
}