get_IMIFA_results: Extract results, conduct posterior inference and compute...

View source: R/Diagnostics.R

get_IMIFA_resultsR Documentation

Extract results, conduct posterior inference and compute performance metrics for MCMC samples of models from the IMIFA family

Description

This function post-processes simulations generated by mcmc_IMIFA for any of the IMIFA family of models. This includes accounting for label switching, and accounting for rotational invariance via Procrustean methods. It can be re-ran at little computational cost in order to extract different models explored by the sampler used for sims, without having to re-run the model itself. New results objects using different numbers of clusters and different numbers of factors (if visited by the model in question), or using different model selection criteria (if necessary) can be generated with ease. Posterior predictive checking of the appropriateness of the fitted model is also facilitated.

Usage

get_IMIFA_results(sims = NULL,
                  burnin = 0L,
                  thinning = 1L,
                  G = NULL,
                  Q = NULL,
                  criterion = c("bicm", "aicm", "dic", "bic.mcmc", "aic.mcmc"),
                  G.meth = c("mode", "median"),
                  Q.meth = c("mode", "median"),
                  conf.level = 0.95,
                  error.metrics = TRUE,
                  vari.rot = FALSE,
                  z.avgsim = FALSE,
                  zlabels = NULL,
                  nonempty = TRUE,
                  ...)

## S3 method for class 'Results_IMIFA'
print(x,
      ...)

## S3 method for class 'Results_IMIFA'
summary(object,
        MAP = TRUE,
        ...)

Arguments

sims

An object of class "IMIFA" generated by mcmc_IMIFA.

burnin

Optional additional number of iterations to discard. Defaults to 0, corresponding to no additional burnin. See mixfaControl for the default burnin settings used previously by mcmc_IMIFA.

thinning

Optional interval for extra thinning to be applied. Defaults to 1, corresponding to no additional thinning. See mixfaControl for the default thinning settings used previously by mcmc_IMIFA.

G

If this argument is not specified, results will be returned with the optimal number of clusters. If different numbers of clusters were explored in sims for the "MFA" or "MIFA" methods, supplying an integer value allows pulling out a specific solution with G clusters, even if the solution is sub-optimal.

Similarly, this allows retrieval of samples corresponding to a solution, if visited, with G clusters for the "OMFA", "OMIFA", "IMFA" and "IMIFA" methods.

Q

If this argument is not specified, results will be returned with the optimal number of factors. If different numbers of factors were explored in sims for the "FA", "MFA", "OMFA" or "IMFA" methods, this allows pulling out a specific solution with Q factors, even if the solution is sub-optimal.

Similarly, this allows retrieval of samples corresponding to a solution, if visited, with Q factors for the "IFA", "MIFA", "OMIFA" and "IMIFA" methods. Can be supplied as a scalar or a vector of values for each cluster.

criterion

The criterion to use for model selection, where model selection is only required if more than one model was run under the "FA", "MFA", "MIFA", "OMFA" or "IMFA" methods when sims was created via mcmc_IMIFA. Defaults to bicm, but note that these are all calculated; this argument merely indicates which one will form the basis of the construction of the output.

Note that the first three options here might exhibit bias in favour of zero-factor models for the finite factor "FA", "MFA", "OMFA" and "IMFA" methods and might exhibit bias in favour of one-cluster models for the "MFA" and "MIFA" methods. The aic.mcmc and bic.mcmc criteria will only be returned for finite factor models.

G.meth

If the object in sims arises from the "OMFA", "OMIFA", "IMFA" or "IMIFA" methods, this argument determines whether the optimal number of clusters is given by the mode or median of the posterior distribution of G. Defaults to "mode". Often the mode and median will agree in any case.

Q.meth

If the object in sims arises from the "IFA", "MIFA", "OMIFA" or "IMIFA" methods, this argument determines whether the optimal number of latent factors is given by the mode or median of the posterior distribution of Q. Defaults to "mode". Often the mode and median will agree in any case.

conf.level

The confidence level to be used throughout for credible intervals for all parameters of inferential interest, and error metrics if error.metrics=TRUE. Defaults to 0.95.

error.metrics

A logical activating or deactivating posterior predictive checking: i.e. controlling whether metrics quantifying a) the posterior predictive reconstruction error (PPRE) between bin counts of the data and bin counts of replicate draws from the posterior distribution & and b) the error between the empirical and estimated covariance matrices should be computed. These are computed for every valid retained iteration (see Details). Defaults to TRUE, but can be time-consuming for models which achieve clustering. These error metrics, and the uncertainty associated with them, can be visualised via plot.Results_IMIFA. Depending on what parameters were stored when calling mcmc_IMIFA, potentially not all error metrics will be available to compute.

The Frobenius norm is used in the computation of the PPRE, by default, but the type of norm can be changed via the ... construct below. So too can the breakpoints (dbreaks) used to bin the data and the posterior predictive replicate data sets. Some caution is advised in the latter case.

vari.rot

Logical indicating whether the loadings matrix/matrices template(s) should be varimax rotated first, prior to the Procrustes rotation steps. Defaults to FALSE. Not necessary at all for clustering purposes, or inference on the covariance matrix, but useful if interpretable inferences on the loadings matrix/matrices are desired. Arguments to varimax can be passed via the ... construct, but note that the argument normalize here defaults to FALSE.

z.avgsim

Logical (defaults to FALSE) indicating whether the clustering should also be summarised with a call to Zsimilarity by the clustering with minimum mean squared error to the similarity matrix obtained by averaging the stored adjacency matrices, in addition to the MAP estimate.

Note that the MAP clustering is computed conditional on the estimate of the number of clusters (whether that be the modal estimate or the estimate according to criterion) and other parameters are extracted conditional on this estimate of G: however, in contrast, the number of distinct clusters in the summarised labels obtained by specifying z.avgsim=TRUE may not necessarily coincide with the MAP estimate of G, but it may provide a useful alternative summary of the partitions explored during the chain, and the user is free to call get_IMIFA_results again with the new suggested G value.

Please be warned that this feature requires loading the mcclust package. This is liable to take considerable time to compute, and may not even be possible if the number of observations &/or number of stored iterations is large and the resulting matrix isn't sufficiently sparse. When z.avgsim=TRUE, both the summarised clustering and the similarity matrix are stored: the latter can be visualised as part of a call to plot.Results_IMIFA.

zlabels

For any method that performs clustering, the true labels can be supplied if they are known in order to compute clustering performance metrics. This also has the effect of ordering the MAP labels (and thus the ordering of cluster-specific parameters) to most closely correspond to the true labels if supplied.

nonempty

For "MFA" and "MIFA" models ONLY: a logical indicating whether only iterations with non-empty components should be retained. Defaults to TRUE, but may lead to empty chains - conversely, FALSE may lead to empty components and related errors.

x, object, MAP, ...

Arguments required for the print.Results_IMIFA and summary.Results_IMIFA functions: x and object are objects of class "Results_IMIFA" resulting from a call to get_IMIFA_results. MAP is a logical which governs whether a table of the MAP classification is printed, while ... gathers additional arguments to those functions.

Users can also pass the type argument to the norm function when isTRUE(error.metrics) and the posterior predictive reconstruction error (PPRE) is calculated. By default the Frobenius norm (type="F") is employed.

Finally, the ... construct also allows arguments to varimax to be passed to get_IMIFA_results itself, when isTRUE(vari.rot), or arguments to hist when isTRUE(error.metrics), in order to guide construction of the bins. Additionally, by passing the argument dbreaks via the ... construct, the bins can be specified directly. However, caution is advised in doing so; in particular, the bins must be constructed on data which has been standardised in the same way as the data modelled within mcmc_IMIFA.

Details

The function also performs post-hoc corrections for label switching, as well as post-hoc Procrustes rotation of loadings matrices and scores, in order to ensure sensible posterior parameter estimates, computes error metrics, constructs credible intervals, and generally transforms the raw sims object into an object of class "Results_IMIFA" in order to prepare the results for plotting via plot.Results_IMIFA.

For the infinite factor methods, iterations where the maximum number of factors was greater than or equal to the maximum of the estimated cluster-specific factors are retained for posterior summaries of the scores, in order to preserve the estimated dimension of the scores matrices. Similarly, these are also the valid iterations used for the computation of the averages and credible intervals for the error metrics. For the finite factor models, all retained iterations are used in both instances (i.e. both for the scores and the error metrics).

In all cases, only iterations with G non-empty components are retained.

Value

An object of class "Results_IMIFA" to be passed to plot.Results_IMIFA for visualising results. Dedicated print and summary functions also exist for objects of this class. The object, say x, is a list of lists, the most important components of which are:

Clust

Everything pertaining to clustering performance can be found here for all but the "FA" and "IFA" methods (or models where the estimate number of clusters is 1), in particular x$Clust$MAP, the MAP summary of the posterior clustering, the last valid sample of cluster labels x$Clust$last.z, the matrix of posterior cluster membership probabilities x$Clust$post.prob, and the posterior confusion matrix x$Clust$PCM.

More detail is given if known zlabels are supplied: performance is always evaluated against the MAP clustering, with additional evaluation against the alternative clustering computed if z.avgsim=TRUE. Posterior summaries of the mixing proportions, and the concentration/discount parameters, if necessary, are also included here, as well as the last valid samples of each.

Error

Everything pertaining the model fit assessment can be found here, incl. the distribution of the PPRE values and associated bin counts for the replicate draws, as well as average error metrics (e.g. MSE, RMSE), and credible intervals quantifying the associated uncertainty, between the empirical and estimated covariance matrix/matrices, both of which are also included.

GQ.results

Everything pertaining to model choice can be found here, incl. posterior summaries for the estimated number of clusters and estimated number of factors, if applicable to the method employed. Model selection criterion values are also accessible here.

Means

Posterior summaries for the means, after conditioning on G.

Loadings

Posterior summaries for the factor loadings matrix/matrices, after conditioning on G and Q. Posterior mean loadings given by x$Loadings$post.load are given the loadings class for printing purposes and thus the manner in which they are displayed can be modified.

The number of iterations retained for posterior summaries of the loadings may vary for different clusters for the infinite factor methods, corresponding to iterations where the cluster-specific number of factors was greater than or equal to the modal estimate of the cluster-specific number of factors.

Scores

Posterior summaries for the latent factor scores, after conditioning on the maximum of the estimated number of cluster-specific factors. Summaries are given for the single matrix of factor scores. See scores_MAP to decompose these summaries into sub-matrices according to the MAP partition (for models which achieve clustering).

For the infinite factor methods, iterations where the maximum number of factors was greater than or equal to the maximum of the estimated cluster-specific factors are retained for posterior summaries of the scores, in order to preserve the estimated dimension of the scores matrices.

Uniquenesses

Posterior summaries for the uniquenesses, after conditioning on G.

The objects Means, Loadings, Scores and Uniquenesses (if stored when calling mcmc_IMIFA!) also contain, as well as the posterior summaries, the entire chain of valid samples of each, as well as, for convenience, the last valid samples of each (after conditioning on the modal G and Q values, and accounting for label switching, and rotational invariance via Procrustes rotation).

Note

For the "IMIFA", "IMFA", "OMIFA", and "OMFA" methods, the retained mixing proportions are renormalised after conditioning on the modal G. This is especially necessary for the computation of the error.metrics, just note that the values on which posterior inference are conducted will ever so slightly differ from the actually sampled values.

Due to the way the offline label-switching correction is performed, different runs of this function may give very slightly different results in terms of the cluster labellings (and by extension the parameters, which are permuted in the same way), but only if the chain was run for an extremely small number of iterations, well below the number required for convergence, and samples of the cluster labels match poorly across iterations (particularly if the number of clusters suggested by those sampled labels is high).

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Murphy, K., Viroli, C., and Gormley, I. C. (2020) Infinite mixtures of infinite factor analysers, Bayesian Analysis, 15(3): 937-963. <doi:10.1214/19-BA1179>.

See Also

plot.Results_IMIFA, mcmc_IMIFA, Zsimilarity, scores_MAP, sim_IMIFA_model, Procrustes, varimax, norm

Examples

# data(coffee)
# data(olive)

# Run a MFA model on the coffee data over a range of clusters and factors.
# simMFAcoffee  <- mcmc_IMIFA(coffee, method="MFA", range.G=2:3, range.Q=0:3, n.iters=1000)

# Accept all defaults to extract the optimal model.
# resMFAcoffee  <- get_IMIFA_results(simMFAcoffee)

# Instead let's get results for a 3-cluster model, allowing Q be chosen by aic.mcmc.
# resMFAcoffee2 <- get_IMIFA_results(simMFAcoffee, G=3, criterion="aic.mcmc")

# Run an IMIFA model on the olive data, accepting all defaults.
# simIMIFAolive <- mcmc_IMIFA(olive, method="IMIFA", n.iters=10000)

# Extract optimum results
# Estimate G & Q by the median of their posterior distributions
# Construct 90% credible intervals and try to return the similarity matrix.
# resIMIFAolive <- get_IMIFA_results(simIMIFAolive, G.meth="median", Q.meth="median",
#                                    conf.level=0.9, z.avgsim=TRUE)
# summary(resIMIFAolive)

# Simulate new data from the above model
# newdata       <- sim_IMIFA_model(resIMIFAolive)

IMIFA documentation built on Dec. 28, 2022, 1:58 a.m.