Description Usage Arguments Details Value Author(s) References Examples
This function can be used to learn a classification model from semi-supervised data. This type of data includes observations for which the class label is known as well as observation with unknown class. The function implements a strategy known as self-training to be able to cope with this semi-supervised learning problem. The function can be applied to any classification algorithm that is able to obtain class probabilities when asked to classify a set of test cases (see the Details section).
1 2 |
form |
A formula describing the prediction problem. |
data |
A data frame containing the available training set that is supposed to
contain some rows for which the value of the target variable is
unknown (i.e. equal to |
learner |
An object of class |
predFunc |
A string with the name of a function that will carry out the probabilistic classification tasks that will be necessary during the self training process (see the Details section). |
thrConf |
A number between 0 and 1, indicating the required classification confidence for an unlabelled case to be added to the labelled data set with the label predicted predicted by the classification algorithm. |
maxIts |
The maximum number of iterations of the self-training process. |
percFull |
A number between 0 and 1. If the percentage of labelled cases reaches this value the self-training process is stoped. |
verbose |
A boolean indicating the verbosity level of the function. |
Self-training (e.g. Yarowsky, 1995) is a well-known strategy to handle classification problems where a subset of the available training data has an unknown class label. The general idea is to use an iterative process where at each step we try to augment the set of labelled cases by "asking" the current classification model to label the unlabelled cases and choosing the ones for which the model is more confident on the assigned label to be added to the labeled set. With this extended set of labelled cases a new classification model is learned and the process is repeated until certain termination criteria are met.
This implementation of the self-training algorithm is generic in the
sense that it can be used with any baseline classification learner
provided this model is able to produce confidence scores for its
predictions. The user needs to take care of the learning of the models
and of the classification of the unlabelled cases. This is done as
follows. The user supplies a learner
object (see
class?learner
for details) in parameter learner
to represent the algorithm to be
used to obtain the classification models on each iteration of the
self-training process. Furthermore, the user should create a function,
whose named should be given in the parameter predFunc
, that
takes care of the classification of the currently unlabelled cases, on
each iteration. This function should be written so that it receives as
first argument the learned classification model (with the current
training set), and a data frame with test cases in the second
argument. This user-defined function should return a data frame with
two columns and as many rows as there are rows in the given test
set. The first column of this data frame should contain the assigned
class labels by the provided classification model, for the respective
test case. The second column should contain the confidence (a number
between 0 and 1) associated to that classification. See the Examples
section for an illustration of such user-defined function.
This function implements the iterative process of self training. On
each iteration the provided learner is called with the set of labelled
cases within the given data set. Unlabelled cases should have the
value NA
on the column of the target variable. The obtained
classification model is then passed to the user-defined "predFunc"
function together with the subset of the data that is unlabelled. As
mentioned above this function returns a set of predicted class labels
and the respective confidence. Test cases with confidence above the
user-specified threshold (parameter thrConf
) will be added to
the labelled training set, with the label assigned by the current
model. With this new training set a new classification model is
obtained and the overall process repeated.
The self-training process stops if either there are no classifications that reach the required confidence level, if the maximum number of iterations is reached, or if the size of the current labelled training set is alread the target percentage of the given data set.
This function returns a classification model. This will be an object of the same class as the object returned by the base classification learned provided by the user.
Luis Torgo ltorgo@dcc.fc.up.pt
Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).
http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
Yarowski, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the association for Computational Linguistics (ACL), pages 189-196.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ## Small example with the Iris classification data set
data(iris)
## Dividing the data set into train and test sets
idx <- sample(150,100)
tr <- iris[idx,]
ts <- iris[-idx,]
## Learn a tree with the full train set and test it
stdTree <- rpartXse(Species~ .,tr,se=0.5)
table(predict(stdTree,ts,type='class'),ts$Species)
## Now let us create another training set with most of the target
## variable values unknown
trSelfT <- tr
nas <- sample(100,70)
trSelfT[nas,'Species'] <- NA
## Learn a tree using only the labelled cases and test it
baseTree <- rpartXse(Species~ .,trSelfT[-nas,],se=0.5)
table(predict(baseTree,ts,type='class'),ts$Species)
## The user-defined function that will be used in the self-training process
f <- function(m,d) {
l <- predict(m,d,type='class')
c <- apply(predict(m,d),1,max)
data.frame(cl=l,p=c)
}
## Self train the same model using the semi-superside data and test the
## resulting model
treeSelfT <- SelfTrain(Species~ .,trSelfT,learner('rpartXse',list(se=0.5)),'f')
table(predict(treeSelfT,ts,type='class'),ts$Species)
|
Loading required package: lattice
Loading required package: grid
setosa versicolor virginica
setosa 18 0 0
versicolor 0 15 4
virginica 0 1 12
setosa versicolor virginica
setosa 18 0 0
versicolor 0 15 4
virginica 0 1 12
setosa versicolor virginica
setosa 18 0 0
versicolor 0 15 4
virginica 0 1 12
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.