Methodology, Key Considerations, and FAQs

The purpose of the Methodology, Key Considerations, and FAQs is to address questions that users have regarding the correlationfunnel methodology and to highlight key considerations before or after various steps in correlationfunnel process.

Methodology

The method used is to perform binarization, the process of converting numeric and categorical variables to binary variables, and then to perform correlation analysis (Pearson method is used by default as the corelate() function wraps stats::cor()). This method lends itself well to quickly understanding relationships in data.

During Binarization

The binarize() function performs the transformation from numeric/categorical data to binary data. Use the following parameters to ensure that the correlation step goes well.

Specify Number of Bins (Numeric Data) - Use n_bins to control the number of bins that numeric data is binned into. Too many and data gets overly discrete. Too few and trend can be missed. Typically 4 or 5 bins is adequate.

Eliminate Infrequent Categories (Categorical Data) - Use thresh_infreq to control the appearance of infrequent categories. This will speed up processing of the correlation and prevent dimensionality (width of columns) getting out of control. Typically 0.01 threshold is adequate.

Prior to Correlation Step

Address Data Imbalance - The correlation analysis is susceptible to data imbalance. Try reducing the number of majority class rows to get a proportion of 75% to 25% majority to minority class.

After Plotting the Correlation Funnel

Garbage In, Garbage Out - Using the correlationfunnel package on data that has little relationship will not yield good results. If you are not getting good results by following the afformentioned “Key Considerations”, then your data may have little relationship. This is useful and a sign that you may need to collect better data.

Something Bad Happened or You Think Something Good Should Have Happened - File an error on GitHub if you have a question and you believe that correlationfunnel should be reporting something differently and/or has an error.

FAQs

1. How does the Correlation Funnel Find Relationships in Numeric Data?

The approach to numeric data is to bin. This works well when non-linear relationships are present at the expense of a slight loss in linear relationship identification. We’ll see examples using synthetic data to illustrate this point.

1.1 Linear Relationships

Let’s make some sample data for Sales versus a highly correlated Macroeconomic Predictor.

We can see that the best relationship between the highest sales bin is the highest macroeconomic indicator bin. The magnitude of the correlation is lower, but it’s still relatively high at approximately 0.8.

When we visualize with plot_correlation_funnel(), the macroeconomic predictor trend shows up indicating the highest macroeconomic indicator bin is highly correlated with the highest sales bin.

However, when we bin the data, the relationship is exposed. The bin age 31-36 has a 0.25 correlation, which is quite high for real data. This indicates predictive power. Below 25 at -0.18 is negatively correlated, and likewise above 46 is negatively correlated. This tells the story that the 30-36 age range is the most likely group to purchase products.

2.2. Highly Skewed Data

At some point, binning becomes impossible because only 2 values exist. Rather than drop the feature, binarize() converts it to a factor() and the one-hot encoding process takes over using the thresh_infreq argument for converting any low frequency factors to a generic “OTHER” category.

We can see that the “PDAYS” feature is highly skewed with almost all values are -1.

3.1 One-Hot Encoding vs Dummy Encoding

The binarize() function uses One-Hot Encoding by default. The Dummy Encoding Method has a key flaw for Correlation Analysis in that it does not show all of the categorical levels. This affects the Correlation Funnel Plot causing loss of potentially highly correlated bins in the visualization.

One-Hot Encoding is the process of converting a categorical feature into a series of binary features. The default in binarize() is to perform one-hot encoding, which returns the number of columns equal to the number of levels in the category. This creates a series of binary flags to use in the Correlation Analysis.

Another popular method in statistical analysis is Dummy Encoding. The only real difference is that the number of columns are equal to the number of levels minus 1. When zeros are present in each of the new columns (secondary, tertiary, unknown) such as in Row 9, the value means not secondary, tertiary, or unknown, which in turn means primary.

Dimensionality Reduction is important for the correlation analysis. Categorical data can quickly get out of hand causing the width (number of columns) to increase, which is known as High Dimensionality. The One-Hot Encoding process will add a lot of features if not kept in check.

High Dimensionality is a two-fold problem. First, having many features adds to the computation time to analyze data. This is particularly painful on large data sets (5M rows+). Second, adding infrequent features (low occurrance) typically adds little predictive value to the modeling process.

To prevent this High Dimensionality situation, the binarize() function contains a thresh_infreq argument that lumps infrequent categories together based on a threshold of proportion within the rows of the data. For example, a thresh_infreq = 0.01 lumps any factors present in less than 1% of the data into a new category defined by name_infreq = "-OTHER" (you can change this name for the lumped category).

Let’s examine the “JOB” categorical feature from the marketing_campaign_tbl, which has 12 levels, one for the category of Job that the customer falls into. We can se that using binarize() creates 12 new features.