tclustregIC: Computes 'tclustreg' for different number of groups 'k' and...

View source: R/tclustregIC.R

tclustregICR Documentation

Computes tclustreg for different number of groups k and restriction factors c.

Description

The last two letters stand for 'Information Criterion'. This function computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values of k (number of groups) and different values of c (restriction factor for the variances of the residuals), for a prespecified level of trimming. In order to minimize randomness, given k, the same subsets are used for each value of c.

Usage

tclustregIC(
  y,
  x,
  alphaLik = 0,
  alphaX = 1,
  intercept = TRUE,
  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"),
  kk = 1:5,
  cc = c(1, 2, 4, 8, 16, 32, 64, 128),
  ccSigmaX = 12,
  plot = FALSE,
  nsamp,
  refsteps = 10,
  reftol = 1e-13,
  equalweights = FALSE,
  we,
  msg = TRUE,
  nocheck = FALSE,
  RandNumbForNini,
  startv1 = 1,
  UnitsSameGroup,
  commonslope = FALSE,
  Ysave = TRUE,
  trace = FALSE,
  ...
)

Arguments

y

Response variable. A vector with n elements that contains the response variable.

x

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

alphaLik

Trimming level, a number between 0 and 0.5 or an integer number specifying the number of observations which have to be trimmed. If alphaLik=0, there is no trimming. More in detail, if 0 < alphaLik < 1 clustering is based on h = floor(n * (1 - alphaLik)) observations. If alphaLik is an integer greater than 1 clustering is based on h = n - floor(alphaLik). The likelihood contributions are sorted and the units associated with the smallest n - h contributions are trimmed.

alphaX

Second-level trimming or constrained weighted model for x.

  • If alphaX=0 there is no second-level trimming.

  • If alphaX is in the interval [0, 0.5] it indicates the fixed proportion of units subject to second level trimming. In this case alphaX is usually smaller than alphaLik. For further details see Garcia-Escudero et. al. (2010).

  • If alphaX is in the interval [0.5, 1], it indicates a Bonferronized confidence level to be used to identify the units subject to second level trimming. In this case the fixed a priori, but is determined adaptively. For further details see Torti et al. (2018).

  • If alphaX=1, constrained weighted model for X is assumed (Gershenfeld, 1997). The CWM estimator is able to take into account different distributions for the explanatory variables across groups, so overcoming an intrinsic limitation of mixtures of regression, because they are implicitly assumed equally distributed. Note that if alphaX=1 it is also possible to apply using restrfactor(2) the constraints on the cov matrices of the explanatory variables. For further details about CWM see Garcia-Escudero et al. (2017) or Torti et al. (2018).

intercept

wheather to use constant term (default is intercept=TRUE

whichIC

A character value which specifies which information criteria must be computed for each k (number of groups) and each value of the restriction factor c. Possible values for whichIC are:

  • "MIXMIX": a mixture model is fitted and for computing the information criterion the mixture likelihood is used. This option corresponds to the use of the Bayesian Information criterion (BIC). In output just the matrix MIXMIX is given.

  • "MIXCLA": a mixture model is fitted but to compute the information criterion the classification likelihood is used. This option corresponds to the use of the Integrated Complete Likelihood (ICL). In the output just the matrix MIXCLA is given.

  • "CLACLA": everything is based on the classification likelihood. This information criterion will be called CLA. In the output just the matrix CLACLA is given.

  • "ALL": both classification and mixture likelihood are used. In this case all three information criteria CLA, ICL and BIC are computed. In the output all three matrices MIXMIX, MIXCLA and CLACLA are given.

kk

an integer vector specifying the number of mixture components (clusters) for which the information criteria are be calculated. By default kk=1:5.

cc

a vector specifying the values of the restriction factor which have to be considered for the variances of the residuals of the regression lines. By default cc=c(1, 2, 4, 8, 16, 32, 64, 128).

ccSigmaX

A number specifying the value of the restriction factor which has to be considered for the covariance matrices of the explanatory variables. The default value is ccsigmaX=12. Note that this option is used only if alphaX=1, that is if constrained weighted model (CWM) for x is assumed.

plot

If plot=FALSE (default) or plot=0 no plot is produced. If plot=TRUE a plot with the final allocation is shown (using the spmplot function). If X is 2-dimensional, the lines associated to the groups are shown too.

nsamp

If a scalar, it contains the number of subsamples which will be extracted. If nsamp = 0 all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets(). nsamp must have k * p columns. The first p columns are used to estimate the regression coefficient of group 1, ..., the last p columns are used to estimate the regression coefficient of group k.

refsteps

Number of refining iterations in each subsample. Default is refsteps=10. refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If equalweights=TRUE we are (ideally) assuming equally sized groups, else if equalweights = false (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.

we

Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones.

msg

Controls whether to display or not messages on the screen If msg==TRUE (default) messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

nocheck

Check input arguments. If nocheck=TRUE no check is performed on matrix X. The default is nocheck=FALSE.

RandNumbForNini

pre-extracted random numbers to initialize proportions. Matrix of size k-by-nrow(nsamp) containing the random numbers which are used to initialize the proportions of the groups. This option is effective only if nsamp is a matrix which contains pre-extracted subsamples. The purpose of this option is to enable the user to replicate the results when the function tclustreg() is called using a parfor instruction (as it happens for example in routine IC, where tclustreg() is called through a parfor for different values of the restriction factor). The default is that RandNumbForNini is empty - then uniform random numbers are used.

startv1

How to initialize centroids and covariance matrices. Scalar. If startv1=1 then initial centroids and covariance matrices are based on (p+1) observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices. The default value isstartv1=1.

Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied.

Remark 2 - option startv1 is used only if nsamp is a scalar for more details see the help associated with nsamp).

UnitsSameGroup

List of the units which must (whenever possible) have a particular label. For example UnitsSameGroup=c(20, 26), means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit UnitsSameGroup(r) where r=2, ...length(kk)-1 is labelled with number r (unless it is found that unit UnitsSameGroup(r) has already been assigned to groups 1, 2, ..., r-1. The default value of UnitsSameGroup is an empty list, that is consistent labels are not imposed.

commonslope

wheather to impose a constraint of common slope on the regression coefficients. If commonslope=TRUE, the groups are forced to have the same regression coefficients (apart from the intercepts). The default value of commonslope is commonslope=FALSE.

Ysave

weather to save on output the unput response variable y and matrix of predictors x.

trace

Whether to print intermediate results. Default is trace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of class tclustregic which is basically a list with the following componnts

  • call the matched call

  • CLACLA A matrix of size 5-times-8 if kk and cc are not specififed else it is a matrix of size length(kk)-times-length(cc) containinig the value of the penalized classification likelihood. This output is present only if whichIC="CLACLA") or whichIC="ALL").

  • IDXCLA array of size 5-times-8 if kk and cc are not specififed else it is an array of size length(kk)-times-length(cc). Each element of the array is a list with one element which is a vector of length n containinig the assignment of each unit using the classification model. This output is present only if whichIC="CLACLA") or whichIC="ALL").

  • MIXMIX A matrix of size 5-times-8 if kk and cc are not specififed else it is a matrix of size length(kk)-times-length(cc) containinig the value of the penalized mixture likelihood. This output is present only if whichIC="MIXMIX") or whichIC="ALL").

  • MIXCLA A matrix of size 5-times-8 if kk and cc are not specififed else it is a matrix of size length(kk)-times-length(cc) containinig the value of the ICL. This output is present only if whichIC="MIXCLA") or whichIC="ALL").

  • IDXMIX array of size 5-times-8 if kk and cc are not specififed else it is an array of size length(kk)-times-length(cc). Each element of the array is a list with one element which is a vector of length n containinig the assignment of each unit using the mixture model. This output is present only if whichIC="MIXMIX"), whichIC="MIXCLA") or whichIC="ALL").

  • kk a vector containing the values of k (number of components) which have been considered. This vector is identical to the argument kk (default is kk=1:5.

  • cc a vector containing the values of c (values of the restriction factor) which have been considered for the variance of the residuals. This vector is identical to the argument cc (defalt is cc=c(1, 2, 4, 8, 16, 32, 64, 128).

  • ccSigmaX values of the restriction factor which have been considered for the covariance matrices of the esplnatory variables. This vector is identical the argument ccsigmaX.

  • alpha the trimming level which has been used in the likelidood (it stores the values of input alphaLik).

  • alphaX second-level trimming or constrained weighted model for X.

  • X original data matrix of explanatory variables. Present if Ysave=TRUE.

  • y original vector containing the response. Present if Ysave=TRUE.

Author(s)

FSDA team, valentin.todorov@chello.at

References

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.

Examples

 ## Not run: 
 ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013).
 ## The dataset presents two parallel components without contamination.

 data(X)
 y1 = X[, ncol(X)]
 X1 = X[,-ncol(X), drop=FALSE]

 (out <- tclustregIC(y1, X1, plot=TRUE))

 tclustICplot(out, whichIC="MIXMIX")

 
## End(Not run)

fsdaR documentation built on May 20, 2026, 1:07 a.m.