Updated Classification Method using Labeled and Unlabeled Data

Share:

Description

This function implements the EM algorithm by iterating over the E-step and M-step. The initial values are obtained from the labeled data then both steps are further iterated over the complete data, labeled and unlabeled data combined.

Usage

1
2
3
upclassifymodel(Xtrain, cltrain, Xtest, cltest = NULL,
modelName = "EEE", tol = 10^-5, iterlim = 1000, 
Aitken = TRUE, ...)

Arguments

Xtrain

A numeric matrix of observations where rows correspond to observations and columns correspond to variables. The group membership of each observation is known - labeled data.

cltrain

A numeric vector with distinct entries representing a classification of the corresponding observations in Xtrain.

Xtest

A numeric matrix of observations where rows correspond to observations and columns correspond to variables. The group membership of each observation may not be known - unlabeled data.

cltest

A numeric vector with distinct entries representing a classification of the corresponding observations in Xtest. By default, these are not supplied and the function sets out to obtain them.

modelName

A character string indicating the model, with default "EEE". The models available for selection are described in modelvec

tol

A positive number, with default 10^{-5}, which is a measure of how strictly convergence is defined.

iterlim

A positive integer, with default 1000, which is the desired limit on the maximum number of iterations.

Aitken

A logical value with default TRUE which tests for convergence using Aitken acceleration. If value is set to FALSE, convergence is tested by comparing tol to the change in log-likelihood between two consecutive iterations. For further information on Aitken acceleration, see Aitken

...

Arguments passed to or from other methods.

Details

This is an updated approach to typical classification methods. Initially, the M-step is performed on the labeled (training) data to obtain parameter estimates for the model. These are used in an E-step to obtain group memberships for the unlabeled (test) data. The training data labels and new probability estimates for test data labels are combined to form the complete data. From here, the M-step and E-step are iterated over the complete data, with continuous updating until convergence has been reached. This has been shown to result in lower misclassification rates, particularly in cases where only a small proportion of the total data is labeled.

Value

The return value is a list with the following components:

call

The function call from upclassifymodel.

Ntrain

The number of observations in the training data.

Ntest

The number of observations in the test data.

d

The dimension of the data.

G

The number of groups in the data

iter

The number of iterations required to reach convergence. If convergence was not obtained, this is equal to iterlim.

converged

A logical value where TRUE indicates convergence was reached and FALSE means iter reached iterlim without obtaining convergence.

modelName

A character string identifying the model (same as the input argument).

parameters pro

A vector whose kth component is the mixing proportion for the kth component of the mixture model. If the model includes a Poisson term for noise, there should be one more mixing proportion than the number of Gaussian components.

mean

The mean for each component. If there is more than one component, this is a matrix whose kth column is the mean of the kth component of the mixture model.

variance

A list of variance parameters for the model. The components of this list depend on the model specification.

train/test z

A matrix whose [i,k]th entry is the conditional probability of the ith observation belonging to the kth component of the mixture.

cl

A numeric vector with distinct entries representing a classification of the corresponding observations in Xtrain/Xtest.

rate

The number of misclassified observations.

Brierscore

The Brier score measuring the accuracy of the probabilities (zs) obtained.

tab

A table of actual and predicted group classifications.

ll

The log-likelihood for the data in the mixture model.

bic

The Bayesian Information Criterion for the model.

Author(s)

Niamh Russell

References

C. Fraley and A.E. Raftery (2002). Model based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611-631.

Fraley, C. and Raftery, A.E. (2006). MCLUST Version for R: Normal Mixture Modeling and Model-Based Clustering, Technical Report no. 504, Department of Statistics, University of Washington.

Dean, N., Murphy, T.B. and Downey, G (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the royal Statistical Society: Series C 55 (1), 1-14.

See Also

upclassify, Aitken, modelvec

Examples

1
2
3
4
5
6
7
8
9
# This function is not designed to be used on its own, 
# but to be called by \code{upclassify}
data(wine, package = "gclus")
X <- as.matrix(wine[, -1])
cl <- unclass(wine[, 1])
indtrain <- sort(sample(1:178, 120))
indtest <- setdiff(1:178, indtrain)

fitup <- upclassifymodel(X[indtrain,], cl[indtrain], X[indtest,], cl[indtest])