tclustreg: Computes robust linear grouping analysis
In fsdaR: Robust Data Analysis Through Monitoring and Dynamic Visualization

tclustreg

R Documentation

Computes robust linear grouping analysis

Description

Performs robust linear grouping analysis.

Usage

tclustreg(
  y,
  x,
  k,
  alphaLik,
  alphaX,
  restrfactor = 12,
  intercept = TRUE,
  plot = FALSE,
  nsamp,
  refsteps = 10,
  reftol = 1e-13,
  equalweights = FALSE,
  mixt = 0,
  wtrim = 0,
  we,
  msg = TRUE,
  RandNumbForNini,
  trace = FALSE,
  ...
)

Arguments

`y`	Response variable. A vector with `n` elements that contains the response variable.
`x`	An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
`k`	Number of groups.
`alphaLik`	Trimming level, a scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If `alphaLik=0`, there is no trimming. More in detail, if `0 < alphaLik < 1` clustering is based on `h = floor(n * (1 - alphaLik))` observations. If `alphaLik` is an integer greater than 1 clustering is based on `h = n - floor(alphaLik)`. More in detail, likelihood contributions are sorted and the units associated with the smallest `n - h` contributions are trimmed.
`alphaX`	Second-level trimming or constrained weighted model for `x`.
`restrfactor`	Restriction factor for regression residuals and covariance matrices of the explanatory variables. Scalar or vector with two elements. If `restrfactor` is a scalar it controls the differences among group scatters of the residuals. The value 1 is the strongest restriction. If `restrfactor` is a vector with two elements the first element controls the differences among group scatters of the residuals and the second the differences among covariance matrices of the explanatory variables. Note that `restrfactor[2]` is used just if `alphaX=1`, that is if constrained weighted model for `x` is assumed.
`intercept`	wheather to use constant term (default is `intercept=TRUE`
`plot`	If `plot=FALSE` (default) or `plot=0` no plot is produced. If `plot=TRUE` a plot with the final allocation is shown (using the spmplot function). If `X` is 2-dimensional, the lines associated to the groups are shown too.
`nsamp`	If a scalar, it contains the number of subsamples which will be extracted. If `nsamp = 0` all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If `nsamp` is a matrix it contains in the rows the indexes of the subsets which have to be extracted. `nsamp` in this case can be conveniently generated by function `subsets()`. `nsamp` must have `k * p` columns. The first `p` columns are used to estimate the regression coefficient of group 1, ..., the last `p` columns are used to estimate the regression coefficient of group `k`.
`refsteps`	Number of refining iterations in each subsample. Default is `refsteps=10`. `refsteps = 0` means "raw-subsampling" without iterations.
`reftol`	Tolerance of the refining steps. The default value is 1e-14
`equalweights`	A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If `equalweights=TRUE` we are (ideally) assuming equally sized groups, else if `equalweights = false` (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.
`mixt`	Specifies whether mixture modelling or crisp assignment approach to model based clustering must be used. In the case of mixture modelling parameter mixt also controls which is the criterion to find the untrimmed units in each step of the maximization. If `mixt>=1` mixture modelling is assumed else crisp assignment. The default value is `mixt=0`, i.e. crisp assignment. Please see for details the provided references. The parameter `mixt` also controls the criterion to select the units to trim. If `mixt = 2` the `h` units are those which give the largest contribution to the likelihood, else if `mixt=1` the criterion to select the `h` units is exactly the same as the one which is used in crisp assignment.
`wtrim`	How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4). If `wtrim==0` (no weights), the algorithm reduces to the standard `tclustreg` algorithm. If `wtrim==1`, trimming is done by weighting the observations using values specified in vector `we`. In this case, vector `we` must be supplied by the user. If `wtrim==2`, trimming is again done by weighting the observations using values specified in vector `we`. In this case, vector `we` is computed from the data as a function of the density estimate pdfe. Specifically, the weight of each observation is the probability of retaining the observation, computed as `pretain_{ig} = 1-pdfe_{ig}/max_{ig}(pdfe_{ig})` If `wtrim==3`, trimming is again done by weighting the observations using values specified in vector `we`. In this case, each element wei of vector `we` is a Bernoulli random variable with probability of success `pdfe_{ig}`. In the clustering framework this is done under the constraint that no group is empty. If `wtrim==4`, trimming is done with the tandem approach of Cerioli and Perrotta (2014).
`we`	Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones.
`msg`	Controls whether to display or not messages on the screen If `msg==TRUE` (default) messages are displayed on the screen. If `msg=2`, detailed messages are displayed, for example the information at iteration level.
`RandNumbForNini`	pre-extracted random numbers to initialize proportions. Matrix of size k-by-nrow(nsamp) containing the random numbers which are used to initialize the proportions of the groups. This option is effective only if `nsamp` is a matrix which contains pre-extracted subsamples. The purpose of this option is to enable the user to replicate the results when the function `tclustreg()` is called using a parfor instruction (as it happens for example in routine IC, where `tclustreg()` is called through a parfor for different values of the restriction factor). The default is that `RandNumbForNini` is empty - then uniform random numbers are used.
`trace`	Whether to print intermediate results. Default is `trace=FALSE`.
`...`	potential further arguments passed to lower level functions.

Value

An S3 object of class tclustreg.object

Author(s)

FSDA team, valentin.todorov@chello.at

References

Mayo-Iscar A. (2016). The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers, Computational Statistics and Data Analysis", Vol. 99, pp. 131-147.

Garcia-Escudero, L.A., Gordaliza, A., Greselin, F., Ingrassia, S. and Mayo-Iscar, A. (2017), Robust estimation of mixtures of regressions with random covariates, via trimming and constraints, Statistics and Computing, Vol. 27, pp. 377-402.

Garcia-Escudero, L.A., Gordaliza A., Mayo-Iscar A., and San Martin R. (2010). Robust clusterwise linear regression through trimming, Computational Statistics and Data Analysis, Vol. 54, pp.3057-3069.

Cerioli, A. and Perrotta, D. (2014). Robust Clustering Around Regression Lines with High Density Regions. Advances in Data Analysis and Classification, Vol. 8, pp. 5-26.

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.

Examples

 ## Not run: 
 ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013).
 ## The dataset presents two parallel components without contamination.

 data(X)
 y1 = X[, ncol(X)]
 X1 = X[,-ncol(X), drop=FALSE]

 (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, plot=TRUE, trace=TRUE))

 (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=2,
         mixt=2, plot=TRUE, trace=TRUE))

 ##  Examples with fishery data

 data(fishery)
 X <- fishery

 ## some jittering is necessary because duplicated units are not treated:
 ## this needs to be addressed
 X <- X + 10^(-8) * abs(matrix(rnorm(nrow(X)*ncol(X)), ncol=2))

 y1 <- X[, ncol(X)]
 X1 <- X[, -ncol(X), drop=FALSE]

 (out <- tclustreg(y1, X1, k=3, restrfact=50, alphaLik=0.04, alphaX=0.01, trace=TRUE))
 ## Example 2:

 ## Define some arbitrary weightssome arbitrary weights for the units
     we <- sqrt(X1)/sum(sqrt(X1))

 ##  tclustreg required parameters
     k <- 2; restrfact <- 50; alpha1 <- 0.04; alpha2 <- 0.01

 ##  Now tclust is run on each combination of mixt and wtrim options

     cat("\nmixt=0; wtrim=0",
         "\nStandard tclustreg, with classification likelihood and without thinning\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=0, trace=TRUE))

     cat("\nmixt=2; wtrim=0",
         "\nMixture likelihood, no thinning\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=0, trace=TRUE))

     cat("\nmixt=0; wtrim=1",
         "\nClassification likelihood, thinning based on user weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, we=we, wtrim=1, trace=TRUE))

     cat("\nmixt=2; wtrim=1",
         "\nMixture likelihood, thinning based on user weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, we=we, wtrim=1, trace=TRUE))

     cat("\nmixt=0; wtrim=2",
         "\nClassification likelihood, thinning based on retention probabilities\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=2, trace=TRUE))

     cat("\nmixt=2; wtrim=2",
         "\nMixture likelihood, thinning based on retention probabilities\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=2, trace=TRUE))

     cat("\nmixt=0; wtrim=3",
         "\nClassification likelihood, thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=3, trace=TRUE))

     cat("\nmixt=2; wtrim=3",
         "\nMixture likelihood, thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=3, trace=TRUE))

     cat("\nmixt=0; wtrim=4",
         "\nClassification likelihood, tandem thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=4, trace=TRUE))

     cat("\nmixt=2; wtrim=4",
         "\nMixture likelihood, tandem thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=4, trace=TRUE))

 
## End(Not run)

fsdaR documentation built on May 29, 2024, 5:35 a.m.