Using the data set from Kaggle's Give Me Some Credit competition, which contains 150,000 observations with 10 features and the objective of predicting whether a lendee will have a serious delinquincy (90+ days past due) within 2 years.

Using the methods described, we were able to attain an AUC score of 0.866752 on the private leaderboard.

There's someone aged 0, which should be an error. Let's set it to 21, the minimum age that makes sense. We can also set it to the median age, but this won't make a huge difference considering there are 150,000 observations.

The values only start going crazy at some point after the 99.8th percentile, so there fewer than 300 "irregular" values.

In [13]:

freq.quantile(q=[0.998])

Out[13]:

0

0.998

2.761009

In [14]:

df[df['MonthlyIncome']>=2.761009].head()

Out[14]:

SeriousDlqin2yrs

RevolvingUtilizationOfUnsecuredLines

age

NumberOfTime30-59DaysPastDueNotWorse

DebtRatio

MonthlyIncome

NumberOfOpenCreditLinesAndLoans

NumberOfTimes90DaysLate

NumberRealEstateLoansOrLines

NumberOfTime60-89DaysPastDueNotWorse

NumberOfDependents

1

1

0.766127

45

2

0.802982

9120.0

13

0

6

0

2.0

2

0

0.957151

40

0

0.121876

2600.0

4

0

0

0

1.0

3

0

0.658180

38

1

0.085113

3042.0

2

1

0

0

0.0

4

0

0.233810

30

0

0.036050

3300.0

5

0

0

0

0.0

5

0

0.907239

49

1

0.024926

63588.0

7

0

1

0

0.0

With a media of 0.154 and mean of 6.04 there is huge positive skewness. It's possible that there is some inconsistency in the data. Even accounts that are victims of fraud will not have rates in the thousands. It is possible that they were recorded as dollar amounts, or when credit card limits were unavailable, 1 was substituted in the denominators.

Regardless, we should deal with this problem in some way, such as censoring all amounts above some threshold. We will play it safe and use 2.5.

There are some outliers in these values. While it may be feasible, if unlikely, for someone to have nearly 100 loans past due, in the actual data, these people with huge numbers of defaults don't seem more likely to default than anyone else.

We can see that the correlation between delinquincy within 2 years increases in the number of days of lateness.

Past lateness alone is not a very accurate predictor of future delinquency, but they are a valuable contributor. In a linear model, the probability of delinquency conditional on having prior lateness would be higher depending on the degree of past lateness. In a decision tree we can imagine, for example, that a path with a larger number of prior late notices, low income, high debt, and a larger number of dependents would be very likely to have a serious delinquency. Conditional probability is one of the most powerful concepts in the universe.

There is no reason to suspect any of these numbers, even the 2 outliers within 13 and 20 dependents. Those numbers are not out of the question at all. We can set the 2 outliers to 10, but we'll leave them be for now.

There is missing data we will fill with the median, which turns out to just be 0.

Like unsecured credit utilization, there are a small number of outliers that we can choose to censor. Let's set 15 as a threshold. These higher numbers of real estate loans don't necessarily give us that much relevant information.

As with revolving utilization, the ratio is not too informative when the denominator is very low. 329664 doesn't make much since as a debt ratio, but it makes a lot of sense for a dollar amount. Indeed, the documentation describes DebtRatio as debt payments, alimony and living costs divided by monthly gross income. We can surmise that when income was 0 or was missing, 1 was substituted.

This actually gives us an interesting opportunity to create a monthly payments feature. But first let's look at the accounts with 1 monthly income. It's very interesting to see that there are significantly fewer delinquincies among people with 1 than those with 0, and fewer even than those with higher incomes. Although this could just mean that those with lower incomes receive fewer loans, and perhaps those with 0 income did not have 0 when the loans were taken.

It's probably best to leave these values as they are. While for example 1000 might have mistakenly been entered as 1, overall the non-zero low income observations have very low delinquincy rates, and it would be unwise to make any assumptions. The "1" can mean anything: unknown income, 0 income, people with trust funds, people with non traditional incomes, etc. But they are a very small group (605) with an exceptionall low rate of serious delinquency.

From monthly income and debt ratio, we can determine the monthly payments. This could be extremely valuable, since people with the same debt ratio may be in vastly different economic situations. Imagine 3 cases with a 1.0 debt ratio:

\$200 income, \$200 payment

\$2000 income, \$2000 payment

\$20000 income, \$20000 payment

The first case seems like a college student with a part time job and limited expenses, but who has their room and board covered by their parents. The second case could be a working class person struggling to make ends meet, while the third case could be someone with a high income who is borrowing for investment purposes, has a gambling problem, or got hit with a nasty divorce settlement. Each person has their own distinct challenges when it comes to repaying their loans. Consider a situation where all three lose their jobs: the third person is going to have the most to make up for.

These values can also interact with age: for the first case, the person being 21 or 22 would be a vastly different situation than if it was a 50 year old. In short, the separate dollar amounts can give us different information than the ratios.

# Since log 1 is 0 and we're "setting" all 0 values to 1 anyway, this code is effectively the same as that# without prematurely filling in the missing values with 0. A win-win-win# Also do the same to the monthly paymentsdf['logMI']=[np.log(v)ifv>0elsevforvindf['MonthlyIncome']]df['logMP']=[np.log(v)ifv>0elsevforvindf['MonthlyPayments']]

Ignoring the 0s, there is something very "normal" looing about log wages, which should not be shocking to anyone in a dataset of this size. Log payment is negatively skewed. In any case, the log transformations make these datum much more manageable.

Let's transform n dimensions into 2. t-SNE emphasizes placing similar features near each other, as opposed to methods like PCA which focus on maximizing variance/separating differences. We can see that t-SNE will cluster a large number of features related to delinquency together.

tsne=TSNE(early_exaggeration=4.0,learning_rate=25.0,n_iter=250)# this stuff takes a long time, let's take a smaller sampleX_samp,X_notused,y_samp,y_notused=train_test_split(X,y,test_size=0.33)transtsne=tsne.fit_transform(X_samp)tsnedf=pd.DataFrame(transtsne,columns=['1','2'])y_samp.index=range(0,len(tsnedf))tsnedf['class']=y_sampsns.lmplot(x='1',y='2',data=tsnedf,hue='class',size=10,palette="Set2",fit_reg=False,scatter_kws={'alpha':0.2})

We can see that there are significant overlaps, but there are regions where there are clearly more class 0s than 1s. Particularly in the t-SNE graph, there is a major high risk concentration near the center of the cloud.

We can see that logistic regression performs noticeably better on the test set than random forest. It is our experience that in many binary classification problems - especially in credit risk problems - random forest severely overfits on the training set. In fact there usually exists some significantly high number of trees (n_estimators) such that random forest can achieve 100% training set accuracy. Of course, such a model would not be a good predictor.

Let's try to set an optimal risk threshold for a lender, since lending to someone with a slightly below 50% chance of default is probably not a great idea. We will stick with the gradient boosted model since that's what we ultimately used for our Kaggle submission.

Compared to simply giving everyone a loan, we deny 1547 bad borrowers at the cost of denying 5381 good borrowers. With more data, it would be possible to go further and estimate the present values of these loans to predict a monetary value for a particular lending scheme. It is reasonable to assume that the average amount lost from a delinquincy greatly exceeds the discounted profit from a good loan.

We can use a more traditional method such as weight of evidence (WoE) to assess age.

WoE: log(%non event/% event) where the "event" in question is delinquincy

Age is technically already binned, since time is a continuous variable and we're rounding down to whole numbers. Still, grouping them further may make it more useful. As we saw from the income analysis above, there is evidence that income groups can be usefully binned. However, we will stick with age for now.

In [117]:

plt.figure(figsize=(8,8))df['age'].hist(bins=100)plt.title('Number of loans by age')

It seems like age by itself is not worth binning, with such low IV scores. However, these only apply for age categories on their own. It is entirely possible that they can be the basis for features that may be useful. A few that come to mind include finding the log difference between an individual's income and their age group average, or a similar measure of dependents relative to the group average.

Measures such as these may give additional information about an individual's economic situation and ability to pay, and they would be useful to explore in any further research.

In terms of the baseline unused models, logistic regression performs very well. This is not very surprising, as it is the traditional method of risk classification, with nearly 100 years of theoretical development and practical use. It is also a deterministic model that requires relatively little computing power, and so it can easily and quickly be reproduced on even a mobile device. Being deterministic, these results are 100% consistent for any given set of observations and features. However, on average, the results from boosting methods are clearly superior. Without using XGBoost or more ensemble methods, we managed to achieve a score of 0.866752 on the Kaggle private leaderboard with just a randomized cross-validated gradient boosting model.

The log transformations removed many of the outliers, but it's not clear how much of a difference this made in our final model, since decision tree methods tend to be robust for monotonic transformations (since they are order-preserving). However, they did make the data more palatable and easier to visualize. In addition to the features mentioned above for age we could try traditional methods like including squared features or trying different interactions and ratios between the existing features. We could also try to improve monthly income estimation, which is in itself a large topic. Finding more uses for age could be a major component of that as well.

The prize winning team used over 20 features and was developed over the course of several months. Intellectually, it would have been fascinating to have been a fly on the wall during their deliberations. But in a more practical sense, we have to consider the tradeoff between time and money, and the diminishing returns from spending a lot more time to make relatively minor improvements in the model.

This website does not host notebooks, it only renders notebooks
available on other websites.