process_data: Quickly perform data pre-processing with a command-line...
In dwalke44/customerClusters: Quickly Data Processing, Clustering, and Modeling

Description Usage Arguments Details Value Testing for numeric data Handling for NAs Handling correlated predictors Centering and scaling predictors Examples

View source: R/process_data_function.R

process_data returns several data frames with various levels of pre-processing.

1	process_data(df)

`df`	The input data frame for processing. Data frame should consist of numeric columns only.

This function wraps several individual pre-processing steps into a single function and is driven by user input at the command line. The purpose of this function is to quickly ensure that data is in the proper format for clustering + random forest modeling. This function returns 3 data frames with varying degrees of pre-processing.

The returned data frames are 1) drop_corr_var, the input data frame with correlated variables removed; 2) corr_removed_cs, the centered and scaled data frame without correlated variables; and 3) corr_present_cs, the centered and scaled data frame including any correlated variables. The content of the output is dependent on selections made by the user.

Most clustering algorithms, such as the included H-DBSCAN, require input data to be numeric-type only. This function will return an error if input dataframe contains columns of types other than numeric.

After the input dataframe has passed the numeric test, this function tests for the presence of NAs. If NAs are detected, the function assumes they represent zeroes and performs the appropriate replacement.

Once NAs have been removed from the data frame, the function test for correlated predictors. The function will identify highly correlated predictors (default setting = >95 identify which predictors to remove. If no predictors are considered candidates for removal, users should enter 0.

The final processing step in this function is to center and scale the predictors using the base-R scale function. Centered and scaled predictors are usually required for clustering, and often required for ML algorithms.

  out = process_data(df)

  ## Not run: 
  #example selection from output
  df1 = out$drop_corr_var
  df2 = out$corr_removed_cs
  df3 = out$corr_present_cs
  
## End(Not run)