Machine Learning with Spark – Part 4 : Determining Credibility of a Customer

In our earlier posts of Machine Learning with Spark, we had seen how our data looks like, along with its headers. However, that description was not sufficient to provide a complete business view. To have a complete grasp of the problem, we should know every part of the attributes of the data.Following is the mapping for each attribute’s numerical value to its actual categorical values. This gives us enough information about which attribute value corresponds to what significance in actual business.Attributes:

Now, let’s do a quick check on the average account balance, credit amount and loan duration as per the credibility.

# register the Customers frame as tableCustomers.registerTempTable(“credit”)# query the credability table to check average balance amount,average loan and average duration for # each class of customer i.e. 1 and 0results = sqlContext.sql(“SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, \ avg(duration) as avgdur FROM credit GROUP BY creditability “)# check the result of the queryresults.show()

+————-+——————+——————+——————+|creditability| avgbalance| avgamt| avgdur|+————-+——————+——————+——————+| 1.0|2.8657142857142857| 2985.442857142857|19.207142857142856|| 0.0|1.9033333333333333|3938.1266666666666| 24.86|+————-+——————+——————+——————+Similarly, we can do a quick check on the statistical summary of all numerical columns in the data frame, as shown below.

Customers.describe().show()

This becomes very difficult to identify the summary for each column. Therefore, we can see the summary by separating them out.

The above stats does not give a complete view of the data because these values look all numerical but they all are not continuous. Hence, the above stats might not work on all of them.Barring ‘creditability’, ’age’ and ‘amount’, every other column is categorical i.e. they are binned into pre-defined categories. You can check those categories from the above table.The above stats of mean, median, devian, max, min will be good for continuous variables. For categorical variables, we will have to break them into cubes and then we can have a good look at their summary.First, let’s take a look at the distinct categories in each attribute in the data.

+——-+——————+|summary| amount|+——-+——————+| count| 1000|| mean| 3271.248|| stddev|2822.7517598956515|| min| 250.0|| max| 18424.0|+——-+——————+Since the goal of this analysis is to find the credibility of a customer using various attributes, let’s drive our EDA by keeping it creditability-oriented.Let’s find out the nature of savings column towards the creditability i.e. count of customers who falls into various saving buckets and then among them how many are credible for the loan and how many are not.

People who are not credible for the loan are mostly those who do not have any savings.The same information can be shown using pivot functionality of Spark with a concise code. This looks more presentable.

Customers.groupBy(“creditability”).pivot(“savings”).count().show()

+————-+—+—+—+—+—+|creditability|1.0|2.0|3.0|4.0|5.0|+————-+—+—+—+—+—+| 0.0|217| 34| 11| 6| 32|| 1.0|386| 69| 52| 42|151|+————-+—+—+—+—+—+Similarly, we can see the summary of creditable people according to their account balance.

Customers.groupBy(“creditability”).pivot(“balance”).count().show()

+————-+—+—+—+—+|creditability|1.0|2.0|3.0|4.0|+————-+—+—+—+—+| 0.0|135|105| 14| 46|| 1.0|139|164| 49|348|+————-+—+—+—+—+We can check the summary of people as per their payment history as well.

Customers.groupBy(“creditability”).pivot(“history”).count().show()

+————-+—+—+—+—+—+|creditability|0.0|1.0|2.0|3.0|4.0|+————-+—+—+—+—+—+| 0.0| 25| 28|169| 28| 50|| 1.0| 15| 21|361| 60|243|+————-+—+—+—+—+—+We can also check for the summary of the purpose of loans, as shown below.

+————-+—+—+—+|creditability|1.0|2.0|3.0|+————-+—+—+—+| 0.0| 70|186| 44|| 1.0|109|528| 63|+————-+—+—+—+We can do numerous statistical review of the data like shown above. In our next post, we will be looking at how to summarize with visualizations.

Hope this post has been helpful in understanding the various attributes of date. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Stay tuned to our blog for more posts on Big Data and other technologies.

Related

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.