perform_clustering: Quickly perform clustering and modeling with a command-line...
In dwalke44/customerClusters: Quickly Data Processing, Clustering, and Modeling

Description Usage Arguments Details Value Examples

View source: R/cluster_RF_function.R

perform_clustering performs Hierarchical DBSCAN clustering on numeric data. Cluster assignments are then used as the dependent classifier for a random forest.

1 2	perform_clustering(df, stratifierColumn, train_sample_size, min_cluster_size, num_trees)

`df`	The input data frame of observations to be clustered and modeled. Data frame should consist of numeric columns only and can be output of `process_data`.
`stratifierColumn`	The column within the data frame that controls stratified random sampling. This column will be excluded from clustering and modeling. Must be a factor variable.
`train_sample_size`	The decimal percent of total observations to be sampled for clustering. The clustering algorithm is RAM-intensive, so this parameter may need tuning to ensure the process does not fail due to memory restrictions.
`min_cluster_size`	The minimum number of observations that constitute a valid cluster. This is the only required input for the H-DBSCAN clustering algorithm.
`num_trees`	The number of constituent trees used to build the random forest model. Increasing the number of trees may increase the stability of the model solution.

This function wraps clustering and modeling into a single procedure. It performs clustering on numeric data using HDBSCAN. The cluster assignments are then utilized as classifiers to train a random forest model. The random forest object is then capable of classifying novel observations.

The order of operations for this function is:

Stratified sampling of input data by stratiferColumn and train_sample_size
H-DBSCAN clustering using min_cluster_size
Random forest modeling using num_trees as number of trees in the forest

The output will be a list containing the original data frame orig_df, the randomly selected observations used for clustering and to train the random forest model df_sample, the cluster object cluster_obj, and the random forest model object randomForest_model.

 stratifier = df[, 1] #stratify by values in first column
 train = 0.1 #train on 10% of total observations
 min_cluster_size = 1000 #minimum size of valid clusters is 1000 points
 num_trees = 2000 #create 2000 bootstrapped and boosted trees in the random forest model
 out = perform_clustering(df, stratifierColumn = stratifier, train_sample_size = train, min_cluster_size = min_cluster_size, num_trees = num_trees)


 ## Not run: 
   randomF_model = out$randomForest_model
   traim_samples = out$df_split
   test_samples = df[-train_samples, ]
   cluster_solution = out$cluster_obj
 
## End(Not run)