Description Usage Arguments Details Value Examples
View source: R/cluster_RF_function.R
perform_clustering
performs Hierarchical DBSCAN clustering on numeric
data. Cluster assignments are then used as the dependent classifier for a
random forest.
1 2 | perform_clustering(df, stratifierColumn, train_sample_size,
min_cluster_size, num_trees)
|
df |
The input data frame of observations to be clustered and modeled.
Data frame should consist of numeric columns only and can be output of
|
stratifierColumn |
The column within the data frame that controls stratified random sampling. This column will be excluded from clustering and modeling. Must be a factor variable. |
train_sample_size |
The decimal percent of total observations to be sampled for clustering. The clustering algorithm is RAM-intensive, so this parameter may need tuning to ensure the process does not fail due to memory restrictions. |
min_cluster_size |
The minimum number of observations that constitute a valid cluster. This is the only required input for the H-DBSCAN clustering algorithm. |
num_trees |
The number of constituent trees used to build the random forest model. Increasing the number of trees may increase the stability of the model solution. |
This function wraps clustering and modeling into a single procedure. It performs clustering on numeric data using HDBSCAN. The cluster assignments are then utilized as classifiers to train a random forest model. The random forest object is then capable of classifying novel observations.
The order of operations for this function is:
Stratified sampling of input data by stratiferColumn and train_sample_size
H-DBSCAN clustering using min_cluster_size
Random forest modeling using num_trees as number of trees in the forest
The output will be a list containing the original data frame orig_df, the randomly selected observations used for clustering and to train the random forest model df_sample, the cluster object cluster_obj, and the random forest model object randomForest_model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | stratifier = df[, 1] #stratify by values in first column
train = 0.1 #train on 10% of total observations
min_cluster_size = 1000 #minimum size of valid clusters is 1000 points
num_trees = 2000 #create 2000 bootstrapped and boosted trees in the random forest model
out = perform_clustering(df, stratifierColumn = stratifier, train_sample_size = train, min_cluster_size = min_cluster_size, num_trees = num_trees)
## Not run:
randomF_model = out$randomForest_model
traim_samples = out$df_split
test_samples = df[-train_samples, ]
cluster_solution = out$cluster_obj
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.