For companies that make money off of interest on loans held by their customer, it’s always about increasing the bottom line. Being able to assess the risk of loan applications can save a lender the cost of holding too many risky assets. It is the data scientist’s job to run analysis on your customer data and make business rules that will directly impact loan approval.

The data scientists that spend their time building these machine learning models are a scarce resource and far too often they are siloed into a sandbox:

Although they work with data day in and out, they are dependent on the data engineers to obtain up-to-date tables.

With data growing at an exponential rate, they are dependent on the infrastructure team to provision compute resources.

Once the model building process is done, they must trust software developers to correctly translate their model code to production ready code.

This is where the Databricks Unified Analytics Platform can help bridge those gaps between different parts of that workflow chain and reduce friction between the data scientists, data engineers, and software engineers.

In addition to reducing operational friction, Databricks is a central location to run the latest Machine Learning models. Users can leverage the native Spark MLLib package or download any open source Python or R ML package. With Databricks Runtime for Machine Learning, Databricks clusters are preconfigured with XGBoost, scikit-learn, and numpy as well as popular Deep Learning frameworks such as TensorFlow, Keras, Horovod, and their dependencies.

Once you have downloaded the data locally, you can create a database and table within the Databricks workspace to load this dataset. For more information, refer to Databricks Documentation > User Guide > Databases and Tables > Create a Table section for AWS or Azure.

In this case, we have created the Databricks Database amy and table loanstats_2012_2017. The following code snippet allows you to access this table within a Databricks notebook via PySpark.

Explore your Data

With the Databricks display command, you can make use of the Databricks native visualizations.

# View bar graph of our data
display(loan_stats)

In this case, we can view the asset allocations by reviewing the loan grade and the loan amount.

Munging your data with the PySpark DataFrame API

As noted in Cleaning Big Data (Forbes), 80% of a Data Scientist’s work is data preparation and is often the least enjoyable aspect of the job. But with PySpark, you can write Spark SQL statements or use the PySpark DataFrame API to streamline your data preparation tasks. Below is a code snippet to simplify the filtering of your data.

After this ETL process is completed, you can use the display command again to review the cleansed data in a scatterplot.

# View bar graph of our data
display(loan_stats)

To view this same asset data broken out by state on a map visualization, you can use the display command combined the the PySpark DataFrame API using group by statements with agg (aggregations) such as the following code snippet.

Training our ML model using XGBoost

While we can quickly visualize our asset data, we would like to see if we can create a machine learning model that will allow us to predict if a loan is good or bad based on the available parameters. As noted in the following code snippet, we will predict bad_loan (defined as label) by building our ML pipeline as follows:

Executes an imputer to fill in missing values within the numerics attributes (output is numerics_out)

Using indexers to handle the categorical values and then converting them to vectors using OneHotEncoder via oneHotEncoders (output is categoricals_class).

The features for our ML pipeline are defined by combining the categorical_class and numerics_out.

Next, we will assemble the features together by executing the VectorAssembler.

As noted previously, we will establish our label (i.e. what we are going to try to predict) as the bad_loan column.

While the previous code snippets are in Python, the following code examples are written in Scala to allow us to utilize XGBoost4J-Spark. The notebook series includes Python code that saves the data in Parquet and subsequently reads the data in Scala.

Tune Model using MLlib Cross Validation

We can try to tune our model using MLlib cross validation via CrossValidator as noted in the following code snippet. We first establish our parameter grid so we can execute multiple runs with our grid of different parameter values. Using the same BinaryClassificationEvaluator that we had used to test the model efficacy, we apply this at a larger scale with a different combination of parameters by combining the BinaryClassificationEvaluator and ParamGridBuilder and apply it to our CrossValidator().