Solution

Correlation Analysis on data that has been preprocessed (more on this shortly) can drastically speed up EDA by identifying key features that relate to the target. The key is getting the features into the “right format”. This is where correlationfunnel
helps.
The correlationfunnel
package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a machine learning model.
Example - Customer Churn
We’ll step through an example of understanding what features are related to Customer Churn.
Load the necessary libraries.
Get the customer_churn_tbl
dataset. The dataset contains a number of features related to a telecommunications company’s customer-base and whether or not the customer has churned. The target is “Churn”.
data("customer_churn_tbl")
customer_churn_tbl %>% glimpse()
#> Observations: 7,043
#> Variables: 21
#> $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795…
#> $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Fe…
#> $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No"…
#> $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No"…
#> $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58,…
#> $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", …
#> $ MultipleLines <chr> "No phone service", "No", "No", "No phone servi…
#> $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fib…
#> $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Y…
#> $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "N…
#> $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "N…
#> $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No"…
#> $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No…
#> $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No"…
#> $ Contract <chr> "Month-to-month", "One year", "Month-to-month",…
#> $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", …
#> $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed che…
#> $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10…
#> $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50…
#> $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "N…
Step 1 - Prepare Data as Binary Features
We use the binarize()
function to produce a feature set of binary (0/1) variables. Numeric data are binned (using n_bins
) into categorical data, then all categorical data is one-hot encoded to produce binary features. To prevent low frequency categories (high cardinality categories) from increasing the dimensionality (width of the resulting data frame), we use thresh_infreq = 0.01
and name_infreq = "OTHER"
to group excess categories.
customer_churn_binarized_tbl <- customer_churn_tbl %>%
select(-customerID) %>%
mutate(TotalCharges = ifelse(is.na(TotalCharges), MonthlyCharges, TotalCharges)) %>%
binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)
customer_churn_binarized_tbl
#> # A tibble: 7,043 x 60
#> gender__Female gender__Male SeniorCitizen__0 SeniorCitizen__1
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 0
#> 2 0 1 1 0
#> 3 0 1 1 0
#> 4 0 1 1 0
#> 5 1 0 1 0
#> 6 1 0 1 0
#> 7 0 1 1 0
#> 8 1 0 1 0
#> 9 1 0 1 0
#> 10 0 1 1 0
#> # … with 7,033 more rows, and 56 more variables: Partner__No <dbl>,
#> # Partner__Yes <dbl>, Dependents__No <dbl>, Dependents__Yes <dbl>,
#> # `tenure__-Inf_6` <dbl>, tenure__6_20 <dbl>, tenure__20_40 <dbl>,
#> # tenure__40_60 <dbl>, tenure__60_Inf <dbl>, PhoneService__No <dbl>,
#> # PhoneService__Yes <dbl>, MultipleLines__No <dbl>,
#> # MultipleLines__No_phone_service <dbl>, MultipleLines__Yes <dbl>,
#> # InternetService__DSL <dbl>, InternetService__Fiber_optic <dbl>,
#> # InternetService__No <dbl>, OnlineSecurity__No <dbl>,
#> # OnlineSecurity__No_internet_service <dbl>, OnlineSecurity__Yes <dbl>,
#> # OnlineBackup__No <dbl>, OnlineBackup__No_internet_service <dbl>,
#> # OnlineBackup__Yes <dbl>, DeviceProtection__No <dbl>,
#> # DeviceProtection__No_internet_service <dbl>,
#> # DeviceProtection__Yes <dbl>, TechSupport__No <dbl>,
#> # TechSupport__No_internet_service <dbl>, TechSupport__Yes <dbl>,
#> # StreamingTV__No <dbl>, StreamingTV__No_internet_service <dbl>,
#> # StreamingTV__Yes <dbl>, StreamingMovies__No <dbl>,
#> # StreamingMovies__No_internet_service <dbl>,
#> # StreamingMovies__Yes <dbl>, `Contract__Month-to-month` <dbl>,
#> # Contract__One_year <dbl>, Contract__Two_year <dbl>,
#> # PaperlessBilling__No <dbl>, PaperlessBilling__Yes <dbl>,
#> # `PaymentMethod__Bank_transfer_(automatic)` <dbl>,
#> # `PaymentMethod__Credit_card_(automatic)` <dbl>,
#> # PaymentMethod__Electronic_check <dbl>,
#> # PaymentMethod__Mailed_check <dbl>, `MonthlyCharges__-Inf_25.05` <dbl>,
#> # MonthlyCharges__25.05_58.83 <dbl>, MonthlyCharges__58.83_79.1 <dbl>,
#> # MonthlyCharges__79.1_94.25 <dbl>, MonthlyCharges__94.25_Inf <dbl>,
#> # `TotalCharges__-Inf_265.32` <dbl>, TotalCharges__265.32_939.78 <dbl>,
#> # TotalCharges__939.78_2043.71 <dbl>,
#> # TotalCharges__2043.71_4471.44 <dbl>, TotalCharges__4471.44_Inf <dbl>,
#> # Churn__No <dbl>, Churn__Yes <dbl>
Step 2 - Correlate to the Target
Next, we use correlate()
to correlate the binary features to a target (in our case Customer Churn).
Step 3 - Plot the Correlation Funnel
Finally, we visualize the correlation using the plot_correlation_funnel()
function.

Business Insights
We can see that the following features are correlated with Churn:
- “Month to Month” Contract Type
- No Online Security
- No Tech Support
- Customer tenure less than 6 months
- Fiber Optic internet service
- Pays with electronic check
We can also see that the following features are correlated with Staying (No Churn):
- “Two Year” Contract Type
- Customer Purchases Online Security
- Customer Purchases Tech Support
- Customer tenure greater than 60 months (5 years)
- DSL internet service
- Pays with automatic credit card
We can then develop a strategy to retain high risk customers:
- Promotions for 2 Year Contract, Online Security, and Tech Support
- Loyalty Bonuses to incentivize tenure
- Incentives for setting up an automatic credit card payment
Conclusion
The correlationfunnel
package provides a 3-step workflow that streamlines the EDA process, helps with feature selection, and improves the ease of obtaining Business Insights.