unsupervised package provides unsupervised random forest methods to compute and predict (on new data):
Dissimilarity and proximity matrices (useful in clustering observations and visualizing the dissimilarities)
Detect outliers (by outlyingness)
Impute missing data
The package implements various unsupervised random forest methods:
Create synthetic data by sampling each covariate.
Classify actual versus synthetic data using a random forest model.
Obtain proximity between observations by counting the number of times a pair of observations occur together in a terminal node of a tree.
For details, see Tao shi (2006).
The following outputs might be obtained from this method:
Distance or proximity matrix: This might be used to cluster the observations using Hierarchical clustering, PAM (partitioning around mediods), DBScan and other clustering methods that work with distance or dissimilarity matrices. Low dimensional embedding methods like MDS (Multi-dimensional scaling), TSNE allow visualizing the dissimilarities. Quoting from Andy Liaw et al (2002): "The idea is that real data points that are similar to one another will frequently end up in the same terminal node of a tree, exactly what is measured by the proximity matrix".
Variable Importance: The synthetic data destroys the relationship among the covariates. The random forest classifier tries to distinguish the classes: actual and synthetic based on the covariates. High OOB Error is an indication of lack of relationship or interaction among the covariates. When the OOB error is low enough, a variable importance measure would indicate a set of covariates with high interactions among some subsets of themselves.
Outlyingness: This measure of outlyingness for the jth observation is calculated as the reciprocal of the sum of squared proximities between that observation and all other observations in the same class (from Andy Liaw et al (2002))
Impute missing data: At the first step, the each column is imputed by its median/mode value and a proximity matrix of the previous step is used in the further steps to estimate the missing value where the proximities are used in the weighted average. These iterations are continued until values do not change beyond a threshold or until some maximum iterations are reached.
When the imputation is run in 'predict' mode, the random forest model built during train is utilized to estimate proximities.
The randomForest package also provides proximity matrices by running unsupervised mode. The unsupervised package differs from the implementation in randomForest package in these ways:
Random forest is computed using 'ranger' package.
Provides the predict method. This may be used to compute the pairwise distances between observations of a new dataset, understand outliers, impute by learning from the training dataset. When predict method is "terminalNodes", the observations of the new data traverse through the trees built by the randomforest on training dataset and pairwise distance is computed by counting the number of times the pair land in the same terminal node.
The predict method can be run with any ranger(trained by
ranger) or randomForest(trained by
Maintainer: KS Srikanth [email protected]
Unsupervised Learning With Random Forest Predictors by Tao Shi & Steve Horvath <doi:10.1198/106186006X94072> (2016)
Classification and Regression by randomForest (R News, Vol. 2/3, December 2002, page 18) by Andy Liaw and Matthew Wiener
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.