Predicting borrowers chance of defaulting on credit loans

Transcription

1 Predicting borrowers chance of defaulting on credit loans Junjie Liang Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm is used to determine if borrowers are likely to default on their loans. This in turn affects whether the loan is approved. In this report I describe an approach to performing credit score prediction using random forests. The dataset was provided by as part of a contest Give me some credit. My model based on random forests was able to make rather good predictions on the probability of a loan becoming delinquent. I was able to get an AUC score of , placing me at position 122 in the contest. 1 Introduction Banks often rely on credit prediction models to determine whether to approve a loan request. To a bank, a good prediction model is necessary so that the bank can provide as much credit as possible without exceeding a risk threshold. For this project, I took part in a competition hosted by Kaggle where a labelled training dataset of 150,000 anonymous borrowers is provided, and contestants are supposed to label another training set of 100,000 borrowers by assigning probabilities to each borrower on their chance of defaulting on their loans in two years. The list of features given for each borrower is described in Table 1. Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by percentage the sum of credit limits Age Age of borrower in years Number of times borrower has been NumberOfTime30- days past due but no worse in the last 2 59DaysPastDueNotWorse years. DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents Table 1: List of features provided by Kaggle Number of mortgage and real estate loans including home equity lines of credit Number of times borrower has been days past due but no worse in the last 2 years. Number of dependents in family excluding themselves (spouse, children etc.) 1

2 2 Method Random forests is a popular ensemble method invented by Breiman and Cutler. It was chosen for this contest because of the many advantages it offers. Random forests can run efficiently on large databases, and by its ensemble nature, does not require much supervised feature selection to work well. Most importantly, it supports not just classification, but regression outputs as well. However, there are still performance parameters that need to be tuned to improve the performance of the random forest. 2.1 Data Imputation The data provided by Kaggle were anonymous data taken from a real-world source and hence, it is expected that the input contains errors. From observation, I determined that there were three main types of data that required imputation: 1. Errors from user input: These errors probably came from a typo during data entry. For example, some of the borrowers ages were listed as 0 in the dataset, suggesting that the values might have been entered wrongly. 2. Coded values: Some of the quantitative values within the dataset were actually coded values that had qualitative meanings. For example, under the column NumberOfTime30-59DaysPastDueNotWorse, a value of 96 represents Others, while a value of 98 represented Refused to say. These values need to be replaced so that their large quantitative values do not skew the entire dataset. 3. Missing values: Lastly, some entries in the dataset were simply listed as NA, so there is a need to fill in these values before running the prediction. In particular, NA values were found for features NumberRealEstateLoansOrLines and NumberOfDependents. Since the choice of method of data imputation could significantly affect the outcome of the prediction algorithm, I tested several ways of doing the imputation: 1. Just leave the data alone: This method can only be used for errors in the first 2 categories. Since the random forest algorithm works by finding appropriate splits in the data, it will not be adversely affected even if there were coded keys with large quantitative values in the input. However, this would not work for filling in the NA entries in the data, so for this method, I removed the features NumberRealEstateLoansOrLines and NumberOfDependents completely from the dataset. 2. Substitute with the median value: Another way to handle the missing data is to use the median values among all valid data within the feature vector. In a way, this is using an available sample within the dataset that would not affect predictions greatly to fill in the missing entry. 3. Coded value of -1: Since the randomforest library that I was using would not take NA for an input entry, I simply replaced them with a value of -1 (essentially, a value that did not appear anywhere else) to ensure that NA is seen as a separate value in the data. As will be described below, from tests it was determined that method 3 (filling in missing entries with -1) worked the best in giving prediction accuracy. 2

3 Lastly, the labels in the dataset were skewed so that there were about 14 times more nondefaulters than defaulters. This is to be expected since most people do not default on their loans, or try not to. However, this will tend to skew the predictions made by the random forest as well. To overcome this problem, I ``rebalanced'' the input by repeating each row with defaulters 14 times, so that overall, the input file contained an even number of defaulters and non-defaulters. 2.2 Parameter Tuning One of the nice features of random forests is that there are relatively few parameters that need to be tuned. In particular, the parameters that I tested for optimizing were: 1. Sample size: Size of sample to draw at each iteration of the split when building the random forests. 2. Number of trees: The number of trees to grow. This value cannot be too small, to ensure that every input row gets predicted at least a few times. In theory, setting this value to a large number will not hurt, as the random forests should converge. However, in my tests it appears that a value that is too large reduces the accuracy of the prediction, possibly due to overfitting. 2.3 Experimental Setup For the test setup, I set up a data pipeline that first took the raw input and passed it through the data imputation stage. Thereafter, the parameters to the random forests are chosen, and a 4-fold cross is done on the labeled inputs. An AUC score is calculated for each of the 4 runs, and the average AUC is used to get an estimate of the performance of the algorithm on the actual test data. The experimental setup is shown in Figure 1. Figure 1: Experimental workflow 3 Results 3.1 Methods of data imputation I shall now present some of the findings from my experiments. Among the three methods of doing data imputation, it was clear that the method which used coded values gave the best results. This could be because borrowers who had NA in their entries were grouped together and used to predict each other s credit reliability, which would implicitly mean a nearest neighbor -like match was being done. The results of the three methods can be seen in Figure 2. 3

4 Average AUC from 4-fold cross Average AUC from 4-fold cross Comparing methods for data imputation Leave alone Median Coded value AUC 500 trees AUC 1000 trees Figure 2: Comparing different ways of doing data imputation 3.2 Number of inputs sampled at each split This input number refers to the number of inputs sampled at each iteration when building the forest. I tested the prediction performance with different sample size, all using the coded value heuristic for data imputation (since it gave the best performance among the three methods). From the results we see that a larger sampling size gives less accurate prediction. This could be because of overfitting to the dataset. The results of the tests can be seen in Figure Comparing Sampling size Sampling size Figure 3: Comparing different sampling size 3.3 Number of trees grown Lastly, from the best performers in the past two parameters, I ran a test for 500 and 1000 trees grown. Actually, most of the learning occurred within the first 350 trees that were grown, and only incremental changes appeared thereafter. Nonetheless, it appears that the performance was slightly better when we grew only 500 trees as opposed to 1000 trees. Again, I suspect this was due to overfitting of the data. Results are shown in Figure 4. 4

5 Average AUC from 4-fold cross Comparing number of trees grown Figure 4: Comparing number of trees built in the forest 3.4 Submission to Kaggle The above results were calculated from doing a 4-fold cross on the labeled data provided by Kaggle. However, after all the parameters were chosen and a best set was found, the parameters were used to train on the labeled training data, and used to predict the unlabelled test data. This was then submitted to Kaggle which did the final AUC scoring. Using the parameters chosen above, I got an AUC of As a reference, the top team got a much better result, an AUC score of Discussion Simply by tweaking a few parameters, I was able to get rather good prediction results on the dataset. This is an example of the strength of random forests: its default parameters are generally quite good. From discussions that occurred on Kaggle after the competition is over, some of the teams also using random forests were able to get much better results, with smarter ways of doing data imputation, and by combining random forests with other ensemble methods like gradient boosting. 5 References 1. Bastos, Joao. "Credit scoring with boosted decision trees." Munich Personal RePEc Archive 1 (2007). Print. 2. Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forests to learn imbalanced data." 3. Dahinden, C.. "An improved Random Forests approach with application to the performance prediction challenge datasets." Hands on Pattern Recognition - (2009). Print. 4. "Description - Give Me Some Credit - Kaggle." Data mining, forecasting and bioinformatics competitions on Kaggle. N.p., n.d. Web. 17 Dec <http://www.kaggle.com/c/givemesomecredit>. 5. Leo, Breiman. "Random Forests." Machine Learning 45.1 (2001): Print. 6. "Random forest - Wikipedia, the free encyclopedia." Wikipedia, the free encyclopedia. N.p., n.d. Web. 17 Dec <http://en.wikipedia.org/wiki/random_forest>. 7. Robnik-Sikonja, M.. "Improving Random Forests." Lecture Notes in Computer Science (2004): Print. 5

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

A Machine Learning Approach to March Madness Jared Forsyth, Andrew Wilde CS 478, Winter 2014 Department of Computer Science Brigham Young University Abstract The aim of this experiment was to learn which

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese