We also walked through a detailed credit card fraud detection use case, from how the data typically gets collected to data wrangling, building a model, tuning the model, and operationalizing the model for a business to use in their production environment.

The goal of this blog post is to share with you more details on this end-to-end credit card fraud detection solution that we built using Python and Azure Data Science Virtual Machine.

Business Scenario

Recent advancements in computing technologies along with the increasing popularity of eCommerce platforms have radically amplified the risk of online fraud for financial services companies and their customers. Failing to properly recognize and prevent fraud results in billions of dollars of loss per year for the financial industry. This trend has urged companies to look into many popular artificial intelligence (AI) techniques, including deep learning for fraud detection. Deep learning can uncover patterns in tremendously large data sets and independently learn new concepts from raw data without extensive manual feature engineering. For this reason, deep learning has shown superior performance in domains such as object recognition and image classification.

Data Set

For this solution we used a sample data set from Kaggle that contains transactions made by credit cards in September 2013 by European cardholders. These transactions occurred in two days:

The data set can be summarized as follow:

· Features V1, V2, … V28: are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’

· Feature Time: contains the seconds elapsed between each transaction and the first transaction in the dataset

· Feature Amount: is the transaction Amount

· Feature Class: is the response variable and it takes value 1 in case of fraud and 0 otherwise

Machine Learning Approach

For this scenario we used a specific type of neural network called Autoencoder. This neural network is trained to attempt to copy its input to its output. Internally, it has a hidden layer h that describes a code used to represent the input.

The network may be viewed as consisting of two parts:

· an encoder function h=f(x)

· a decoder that produces a reconstruction r = g(h)

We optimize the parameters of our autoencoder model in such way that a special kind of error, reconstruction error is minimized.

Environment Set Up

To build our solution, we used a Data Science Virtual Machine, that is a Windows Azure virtual machine (VM) image. It is preinstalled and configured with several tools that are used for data analytics and machine learning. The Data Science Virtual Machine jump-starts your analytics project. You can work on tasks in various languages including R, Python, SQL, and C#.

v. Subscription. If you have more than one subscription, select the one on which the machine is to be created and billed.

vi. Resource Group. You can create a new one or use an existing group.

vii. Location. Select the data center that’s most appropriate. For fastest network access, it’s the data center that has most of your data or is closest to your physical location.

b. Size. Select one of the server types that meets your functional requirements and cost constraints. For more choices of VM sizes, select View All.

c. Settings:

i. Use Managed Disks. Choose Managed if you want Azure to manage the disks for the VM. If not, you need to specify a new or existing storage account.

ii. Other parameters. You can use the default values. If you want to use nondefault values, hover over the informational link for help on the specific fields.

d. Summary. Verify that all the information you entered is correct. Select Create.

Model Development

We developed our model using Python and Azure Notebooks. First of all, you need to prepare your environment and import the necessary components:

You can now enter the credentials to access the data from the cloud and then download the file for analysis:

Import the credit card data set:

For the modelling piece, you first need exclude the variable ‘Time’. Since the spread of the variable ‘Amount’ is large, this variable is standardized. Then you have to define the framework for the autoencoder and then compile and fit using the training data:

Finally, you can save your model:

Conclusion

In this blog post, we dived into a specific credit card fraud detection use case. Most importantly, we showed how the right cloud analytics environment, such as an Azure Data Science Virtual Machine, makes it easy to collect data, analyze, experiment, and build a model for any organization to use in a production environment.