cleanData: Rejection of new instances based on their distance to...
In semiArtificial: Generator of Semi-Artificial Data

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/forestDataGen.R

The function contains three data cleaning methods, the first two reject instances whose distance to their nearest neighbors in the existing data are too small or too large. The first checks distance between instances disregarding class, the second checks distances between instances taking only instances from the same class into account. The third method reassigns response variable using the prediction model stored in the generator teObject.

1
2
3

cleanData(teObject, newdat, similarDropP=NA, dissimilarDropP=NA, 
          similarDropPclass=NA, dissimilarDropPclass=NA, 
		  nearestInstK=1, reassignResponse=FALSE, cleaningObject=NULL)

`teObject`	An object of class `TreeEnsemble` containing a generator structure as returned by `treeEnsemble`. The `teObject` contains generator's training instances from which we compute a distance distribution of instances to their `nearestInsK` nearest instances. This distance distribution, computed on the training data of the generator, serves as a criterion to reject new instances from `newdata`, i.e. based on parameters below we reject the instances too close or to far away from their nearest neighbors in generator's training data. The computed distance distributions are stored and returned as `cleaningObject` component of returned list. If it is provided on subsequent calls, this reduces computational load.
`newdat`	A `data.frame` object with the (newly generated) data to be cleaned.
`similarDropP`	With numeric parameters `similarDropP` and `dissimilarDropP` (with the default value NA and the valid value range in [0, 1]) one removes instances in `newdat` too close to generator's training instances or too far away from these instances. The distance distribution is computed based on instances stored in `teObject`. For each instance in $teObject$ we store the distance to its `nearestInsK` nearest instances (disregarding the identical instances). These distances are sorted and represent a distribution of nearest distances for all training instances. The values `similarDropP` and `dissimilarDropP` represent a proportion of allowed smaller/larger distances computed on the generator's training data contained in the `teObject`.
`dissimilarDropP`	See `similarDropP`.
`similarDropPclass`	For classification problems only and similarly to the `similarDropP` and `dissimilarDropP` above, with the `similarDropPclass` and `dissimilarDropPclass` (also in a [0, 1] range) we also removes instances in `newdat` too close to generator's training instances or too far away from these instances, but only taking near instances from the same class into account. The `similarDropPclass` contains either a single integer giving thresholds for all class values or a vector of thresholds, one for each class. If the vector is of insufficient length it is replicated using function `rep`. The generated distance distributions are stored in the `cleaningObject` component of the returned list.
`dissimilarDropPclass`	See `similarDropPclass`.
`nearestInstK`	An integer with default value of 1, controls how many generator's training instances we take into account when computing the distance distribution of nearest instances.
`reassignResponse`	is a `logical` value controlling whether the response variable of the `newdat` shall be set anew using a random forest prediction model or taken as it is. The default value `reassign=FALSE` means that values of response are not changed.
`cleaningObject`	is a list object with a precomputed distance distributions and predictor from previous runs of the same function. If provided, this saves computation time.

The function uses the training instances stored in the generator teObject to compute distribution of distances from instances to their nearestInstK nearest instances. For classification problems the distributions can also be computed only for instances from the same class. Using these near distance distributions the function rejects all instances too close or too far away from existing instances.

The default value of similarDropP, dissimilarDropP, similarDropPclass, and dissimilarDropPclass is NA and means that the near/far values are not rejected. The same effect has value 0 for similarDropP and similarDropPclass, and value 1 for dissimilarDropP and dissimilarDropPclass.

The method returns a list object with two components:

`cleanData`	is a `data.frame` containing the instances left after rejection of too close or too distant instances from `newdata`.
`cleaningObject`	is a `list` containing computed distributions of nearest distances (also class-based fro classification problems, and possibly a predictor used for reassigning the response variable.

Marko Robnik-Sikonja

treeEnsemble, newdata.TreeEnsemble.

# inspect properties of the iris data set
plot(iris, col=iris$Species)
summary(iris)

irisEnsemble<- treeEnsemble(Species~.,iris,noTrees=10)

# use the generator to create new data with the generator
irisNewEns <- newdata(irisEnsemble, size=150)

#inspect properties of the new data
plot(irisNewEns, col = irisNewEns$Species) #plot generated data
summary(irisNewEns)

clObj <- cleanData(irisEnsemble, irisNewEns, similarDropP=0.05, dissimilarDropP=0.95, 
                   similarDropPclass=0.05, dissimilarDropPclass=0.95, 
		           nearestInstK=1, reassignResponse=FALSE, cleaningObject=NULL) 
head(clObj$cleanData)