RFdist: Unsupervised Random Forest

Description Usage Arguments Details Value References Examples

Description

Compute dissimilarity matrix between observations by training a randomForest (RF) classifier to descriminate between the 'original' data and a synthetic version. The original data is labeled as "True.Data" while the synthetic data is labeled “Synthetic.Data". The random forest p roximity matrix between observations in the original data are then extracted, converted to distance, and returned. The synthetic data is generated by taking a random sample from each dimension of the true data, with or without replacement.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
RFdist(data, ...)

## Default S3 method:
RFdist(data, mtry = floor(sqrt(ncol(data))), ntree,
  no.rep, syn.type = "emperical", importance = FALSE, nodesize = 1,
  parallel = c("forests", "trees", "no"), ...)

## S3 method for class 'RFdist'
print(x, ...)

## S3 method for class 'RFdist'
plot(x, ...)

Arguments

data

data.frame or matrix

...

further arguments passed to randomForest.

mtry

mtry in randomForest

ntree,

number of trees

no.rep

number of repetitions or forests

syn.type

type of synthetic data generator: "emperical" generate samples from the emperical distribution of the original data while "permute" takes a permutation of each dimension. emerical is just sample with replacement while permute is without replacement

importance

(logical) compute variable importance ?

nodesize

node size in randomForest

parallel

character vector specifying the type of parallel run: 'forests' - run a total of no.rep RF in parallel, 'trees' - run RF in parallel over ntrees and combine the results, 'no' - serial computation.

x

object of class RFdist

Details

Methods

  1. print : print OOB error and convergence summary

  2. plot : plots the convergence of the RF proximities given by the MSE over number of forest no.rep

Value

A list with elements:

  1. RFdist: RF proximity converted to a distance object

  2. err: error rate

  3. UnsupRFvarimp: Unsupervised RF variable importance

  4. proxConver: a matrix containing three convergence meausres

    1. Max.prox = max( abs( aveprox(N)- aveprox(N-1)))

    2. MSE.prox = mean( (aveprox(N)- aveprox(N-1))^2)

    3. Mean = mean(aveprox(N)) where N is number of forests (no.rep).

References

Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. Volume 15, Number 1, March 2006, pp. 118-138(21)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
set.seed(12345)
data(iris)
dat <- iris[, -5]
RF.dist <- RFdist(data=dat, ntree = 10, no.rep=20, syn.type = "permute", 
               importance=TRUE, parallel = "no")
# 
print(RF.dist, digits = 3)
#  
plot(RF.dist)
# plot variable importance 
UnsupRFvarImpPlot(RF.dist, sort=TRUE)

## End(Not run)

nguforche/UnsupRF documentation built on May 5, 2019, 4:51 p.m.