Obtaining the Best Model for Data Classification Using an Updated Classification Method

Share:

Description

This function performs upclassifymodel over a range of different models and finds the model that best fits the data by comparing the BIC values.

Usage

1
2
3
upclassify(Xtrain, cltrain, Xtest, cltest = NULL, 
modelscope = NULL, tol = 10^-5, iterlim = 1000, 
Aitken = TRUE, ...)

Arguments

Xtrain

A numeric matrix of observations where rows correspond to observations and columns correspond to variables. The group membership of each observation is known - labeled data.

cltrain

A numeric vector with distinct entries representing a classification of the corresponding observations in Xtrain.

Xtest

A numeric matrix of observations where rows correspond to observations and columns correspond to variables. The group membership of each observation may not be known - unlabeled data.

cltest

A numeric vector with distinct entries representing a classification of the corresponding observations in Xtest. By default, these are not supplied and the function sets out to obtain these.

modelscope

A character string indicating the desired models to be tested. With default NULL, all available models are tested. The models available for univariate and multivariate data are described in modelvec.

tol

A non-negative number, with default 10^-5, which is a measure of how strictly convergence is defined.

iterlim

A non-negative integer, with default 1000, which is the desired limit on the maximum number of iterations.

Aitken

A logical value with default TRUE which tests for convergence using Aitken acceleration. If value is set to FALSE, convergence is tested by comparing tol to the change in log-likelihood between two consecutive iterations. For further information on Aitken acceleration, see Aitken.

...

Arguments passed to or from other methods

Value

An object of class "upclassfit" providing a list of output components for each model in modelscope, with the Best model (according to BIC) first.

The details of the output components are as follows

call

How to call the function and the order of its arguments.

Ntrain

The number of observations in the training set.

Ntest

The number of observations in the test set.

d

The dimension of the data.

G

The number of groups in the training set.

iter

The number of iterations taken.

converged

Whether or not the algorithm has converged. If converged is FALSE, then iter will be the maximum no of iterations.

modelName

The model considered in this run of the algorithm.

parameters

A list of the final model parameters estimated by the algorithm.

pro

The proportion of the data to be found in each group.

mean

Mean vectors for each group.

variance

The variance and covariences produced by Mclust.

train

A list of information about the training data. This will not have changed from before the run.

z

A matrix containing the estimated probabilities that each observation in the training data belongs to each group.

cl

A vector containing the labels of the training data.

misclass

The number of misclassifications of the training data.

rate

The misclassification rate expressed as a percentage.

Brier

The Brier score expressed as a percentage.

tab

The misclassification table for the training data.

test

A list of information about the test data.

z

A matrix containing the estimated probabilities that each observation in the training data belongs to each group.

cl

A vector containing the new labels of the training data.

misclass

The number of misclassifications of the training data, provided the correct labels have been supplied.

rate

The misclassification rate expressed as a percentage, provided the correct labels have been supplied.

Brier

The Brier score expressed as a percentage.

tab

The misclassification table for the training data, provided the correct labels have been supplied.

ll

The log-likelihood of the data.

bic

The Bayes information criterion for the specified model.

Author(s)

Niamh Russell

References

C. Fraley and A.E. Raftery (2002). Model based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611-631.

Fraley, C. and Raftery, A.E. (2006). MCLUST Version for R: Normal Mixture Modeling and Model-Based Clustering, Technical Report no. 504, Department of Statistics, University of Washington.

Dean, N., Murphy, T.B. and Downey, G (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55 (1), 1-14.

See Also

upclassifymodel, modelvec, Aitken

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data(iris)
X <- as.matrix(iris[,-5])
cl <- unclass(iris[,5])

indtrain <- sort(sample(1:150,110))
Xtrain <- X[indtrain,]
cltrain <- cl[indtrain]

indtest <- setdiff(1:150, indtrain)
Xtest <- X[indtest,]
cltest <- cl[indtest]
modelscope <- c("EII", "VII", "VEI","EVI")

fitupmodels <- upclassify(Xtrain, cltrain, Xtest, cltest, modelscope)
fitupmodels$Best$modelName    # What is the best model?