knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, dpi = 100 )
Speed Up Exploratory Data Analysis (EDA) with
correlationfunnel
The goal of correlationfunnel
is to help data scientist's speed up Exploratory Data Analysis (EDA). EDA can be an incredibly time consuming process.
Traditional approaches to EDA are labor intense where the data scientist reviews each of the features (predictors) in the data set for relationship to the target (i.e. goal or response). This process of manually building many visualizations and searching for relationships can take hours.
Correlation Analysis on data that has been preprocessed (more on this shortly) can drastically speed up EDA by identifying key features that relate to the target. The key is getting the features into the "right format". This is where correlationfunnel
helps.
The correlationfunnel
package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name "Correlation Funnel"), making it very efficient to understand which features are most likely to provide business insights and lend well to a machine learning model.
Speeds Up Exploratory Data Analysis - You can drastically increase the speed at which you perform Exploratory Data Analysis (EDA) by using Correlation Analysis to focus on key features (rather than investigating all features).
Improves Feature Selection - Using correlation to determine if you have good features prior to spending significant time developing Machine Learning Models.
Gets You To Business Insights Faster - Understanding how features are related to a target variable can help you develop the story in the data (aka business insights).
The Correlation Funnel process uses 3 functions:
Transform the data into a binary format with binarize()
- This step prepares semi-processed data for an optimal format (binary) for correlation analysis
Perform correlation analysis using correlate()
- This step correlates the "binarized" data (binary features) with the target
Visualize the feature-target relationships using plot_correlation_funnel()
- This step produces the visualization from which we can get business insights
We'll step through an example of understanding what features are related to Customer Churn.
Load the necessary libraries.
library(correlationfunnel) library(dplyr)
Get the customer_churn_tbl
dataset. The dataset contains a number of features related to a telecommunications company's customer-base and whether or not the customer has churned. The target is "Churn".
data("customer_churn_tbl") customer_churn_tbl %>% glimpse()
We use the binarize()
function to produce a feature set of binary (0/1) variables. Numeric data are binned (using n_bins
) into categorical data, then all categorical data is one-hot encoded to produce binary features. To prevent low frequency categories (high cardinality categories) from increasing the dimensionality (width of the resulting data frame), we use thresh_infreq = 0.01
and name_infreq = "OTHER"
to group excess categories.
customer_churn_binarized_tbl <- customer_churn_tbl %>% select(-customerID) %>% mutate(TotalCharges = ifelse(is.na(TotalCharges), MonthlyCharges, TotalCharges)) %>% binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) customer_churn_binarized_tbl %>% glimpse()
Next, we use correlate()
to correlate the binary features to a target (in our case Customer Churn).
customer_churn_corr_tbl <- customer_churn_binarized_tbl %>% correlate(Churn__Yes) customer_churn_corr_tbl
Finally, we visualize the correlation using the plot_correlation_funnel()
function.
customer_churn_corr_tbl %>% plot_correlation_funnel()
We can see that the following features are correlated with Churn:
We can also see that the following features are correlated with Staying (No Churn):
We can then develop a strategy to retain high risk customers:
The correlationfunnel
package provides a 3-step workflow that streamlines the EDA process, helps with feature selection, and improves the ease of obtaining Business Insights.
To learn about the inner-workings of and key considerations for use of correlationfunnel
, please read the Key Considerations and FAQs.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.