Consumer Complaint / Support Ticket Classification

Document classification experiment using Multiclass Decision Forests, with data taken from the CFPB to model support ticket classification.

# Summary
This experiment models using multiclass decision forests to categorize documents. The data used
is taken from a publicly-available dataset of consumer complaints about financial products; this is a
close analogue to a common task of customer support ticket classification.
# Description
**Data**
Data is taken from the Consumer Financial Protection Bureau's online database of consumer
complaints about different financial products. The overall dataset contains over 600,000
complaints, over 90,000 of which contain a "Consumer complaint narrative" --
open-ended text describing the problem the consumer is reporting. These complaint span 11
different types of financial products. Because the number of complaints varies substantially across
product types, here we limit ourselves to the 5 most frequent product types:
- Bank accounts
- Credit cards
- Credit reporting
- Debt collection
- Mortgages
Additionally, we restrict ourselves here to complaints with narratives at least 500 characters long,
to ensure a sufficiently rich training set. This yields a dataset of 36,958 consumer complaints. In
addition to columns containing the product type, complaint narrative, and complaint ID, there
are numerous other metadata columns as well that are included in the dataset but not used here.
**Pipeline**
1. This data is subjected to a series of transformation and cleaning steps:
- Converting the "Product" column to be a categorical variable
- Setting the "Complaint ID" column to be ignored as a predictive feature
- Subsetting the data to contain only the Complaint ID, Product, and
Consumer Complaint Narrative columns
- Using an R script to convert narrative text to lower case
2. Feature extraction is done using AML's native Feature Hashing module,
here set to fairly conservative parameters of unigram features and 12-bit
hashing
3. A multiclass decision forest model is used. This is one of several
multiclass models available directly in AML.
4. In addition to training up a model, cross-validation is included
(defaults to 5-fold). Summary statistics for cross-validation can be
viewed directly via the output port of the Evaluate Model node, and
predictions from the cross-validation run (collapsed across folds) are
also exported to CSV for inspection of model predictions.
**Model Performance**
Inspection of cross-validation results shows this model to achieve about 82% accuracy on this data.
What is particularly interesting is to note the nature of the classification errors. As can be seen in the
confusion matrix displayed by the model evaluation step, errors are concentrated in three
highly plausible areas:
- Bank accounts are mistaken for credit cards
- Credit cards are mistaken for credit reporting
- Debt collection is mistaken for credit reporting
All three of these error types involve topic areas that overlap in real life. Many credit cards are bank-issued,
credit reporting is an integral step in securing a card, and credit reporting is also almost always directly
implicated in the debt collection process. These errors may well indicate that the labels being applied to
this data are not entirely mutually-exclusive, an important consideration when doing classification; it is
also interesting that the errors are not symmetrical which may indicate further interesting structural
properties of the data.