CREDIT CARD DEFAULT PREDICTION

Comments (0)

Transcript of CREDIT CARD DEFAULT PREDICTION

LogoThe DatabaseData from UCI Machine Learning Repository had been obtained for this project. Name of the dataset is Default of credit card clients Data Set. The Number of Instances is 30000. The number of attributes is 24. A binary variable, default payment (Yes = 1, No = 0), as the response variable. Tools UsedRSTUDIO: VERSION 0.98.1091:RStudio IDE is a powerful and productive user interface for R. It’s free and open source, and works great on Windows, Mac, and Linux. RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics.

CREDIT CARD DEFAULT PREDICTIONUnderlying Concept used in the AlgorithmCredit card allows the convenience of spending on credit to the owner of owner of card. This means that the lack of availability of cash at that time is not much of a concern for a credit card holder since he can spend and purchase on credit and pay conveniently at a later date. . Before giving a credit loan to borrowers, bank decides who is bad (Defaulter) or good (Non-defaulter) borrower. The prediction of borrower status i.e. in future borrower will be defaulter or non-defaulter is a challenging task for bank. The loan defaulter prediction is a binary classification problem. CREDIT CARD DEFAULT PREDICTIONMADE BY: ANKITA PAL R LANGUAGE: R VERSION 3.3.1 :R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical .R and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and othersGENETIC ALGORITHMA Genetic Algorithm (GA) is a method for solving both constrained and unconstrained optimization problems based on a natural selection process that mimics biological evolution. The algorithm repeatedly modifies a population of individual solutions. At each step, the genetic algorithm randomly selects individuals from the current population and uses them as parents to produce the children for the next generation. Over successive generations, the population "evolves" toward an optimal solution.SVM (SUPPORT VECTOR MACHINES)“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges.It is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features ) with the value of each feature being the value of a particular coordinate. We perform classification by finding the hyper-plane that differentiate the two classes very well .Support Vector Machine is a frontier which best segregates the two classes (hyper-plane/ line).PRINCIPAL COMPONENT ANALYSISPrincipal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. The desired goal is to reduce the dimensions of a (d)-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d) in order to increase the computational efficiency while retaining most of the information. ARTIFICIAL NEURAL NETWORKS:An artificial neuron is a mathematical function conceived as a model of biological neurons. Artificial neurons are the constitutive units in an artificial neural network. The artificial neuron receives one or more inputs (representing dendrites) and sums them to produce an output (representing a neuron's axon). The transfer functions usually have a sigmoid shape, but they may also take the form of other non-linear functions, piecewise linear functions, or step functions.

Packages used in the project:

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation Other functionalities

CARETDOPARALLELThe doParallel package is a “parallel backend” for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. The foreach package must be used in conjunction with a package such as doParallel in order to execute code in parallel.DPLYRdplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.PLYRA set of tools that solves a common set of problems: A big problem may be broken down into manageable pieces, operate on each piece and then put all the pieces back together.The package allows flexible settings throughcustom-choice of error and activation function. Furthermore,the calculation of generalized weightsneuralnetOutputsThis is observed from the above screenshots that SVM classifier does not correctly classify the customers correctly as it identifies only 3 out of 690 actual defaulters. The Neural Network Predictor gives a better performance by predicting 254 correctly.SVMNEURAL NETWORKSFLOWCHART