A primer on machine learning for fraud detection

Michael Manapat

Michael leads work on Stripe’s machine learning products, including Radar. Prior to Stripe, he was an engineer at Google and a postdoctoral fellow in and lecturer on applied mathematics at Harvard. He received a Ph.D. in mathematics from MIT.

Introduction

Stripe builds products that enable millions of e-commerce companies, SaaS businesses, on-demand marketplaces, nonprofits, and platforms to conduct business online. One inescapable facet of online commerce—and one that, unfortunately, frequently comes as an unpleasant surprise—is fraud. Unlike businesses that accept payments in person, internet businesses are liable for fraudulent purchases—this despite the fact that they are no more experts on fraud than their brick-and-mortar counterparts. As a result, many internet businesses have had to build up teams of fraud analysts and expend engineering effort on fraud detection systems. At Stripe, we want to help businesses focus on their product and customer experiences and not on fraud, so we’ve developed Stripe Radar, a suite of modern tools for fraud detection and prevention.

The goal of this guide is to provide more detail on the machine learning that powers the core of Radar, explain how we think about the efficacy and performance of fraud detection systems, and describe how other tools in the Radar suite can help businesses optimize their outcomes. Before getting into details, though, it’s worth spending a little time on the nature and consequences of online fraud.

Credit card fraud

Online payments fraud at a basic level involves a fraudster obtaining someone else’s credit card number and using it to make unauthorized purchases. For example, the fraudster might buy a high value item—say a watch—from an internet business for $1,000 and then resell it on eBay or Craigslist for $200. The real cardholder will eventually discover the unauthorized transaction and initiate a dispute (also known as a “chargeback”) with his or her bank. (While there are different types of disputes, our focus here is on payments that are disputed with fraud as the specified reason.)

If the cardholder’s bank decides that the transaction actually was fraudulent, then the cardholder is made whole but the business is left responsible for the cost of the fraud. This cost is not only the value of the item sold (however much the business paid for the watch) but also any additional fees levied for the dispute. When a business is being targeted by fraudsters, these costs can add up and have a significant impact on the business’s financials. False negatives—or fraud that is not identified and prevented before a dispute occurs—are not the only way in which fraud can have a real, financial impact on a business.

False positives—or legitimate transactions that are blocked by a fraud detection system—are also costly: when a customer tries to make a purchase but is prevented from doing so, the business takes both a gross profit and a reputational hit.

We’ll examine this more closely later, but there is a tradeoff between false negatives and false positives—the fewer you have of the former, the more you need to tolerate of the latter (and vice versa). Businesses need to decide how to trade off the two. Each false negative incurs a certain cost (the cost of goods sold and the fee for disputes), as does each false positive (the margin on the goods sold). If a business’s margins are small, a false negative is very costly and a false positive is not so costly and so the business should lean towards casting a wide net when trying to stop potential fraud (even if that means more false positives). If margins are large, the reverse is true.

It’s important to note, however, that businesses are not always truly free to control this tradeoff. If a business’s fraud rate is greater than 1%, card networks like Visa and Mastercard may revoke its ability to process any credit card payments. The exact definition of “fraud rate” and how long it needs to be above 1% before serious action is taken varies between card brands, but the important point is that businesses only have room to make tradeoffs per their utility functions if their fraud rate is below 1%. Once it is above 1%, they must reduce the rate at all costs, even if that means accepting a false positive rate that is higher than is economically rational for the business. Doing otherwise exposes the business to the existential risk of losing the ability to accept payments.

Stripe Radar and the Stripe “network”

Radar is a suite of products from Stripe for detecting and preventing fraud. Radar’s core is powered by adaptive machine learning, the result of years of data science and infrastructure work by Stripe’s machine learning teams. Radar’s algorithms evaluate every transaction for fraud risk and take action appropriately. High risk payments are blocked by default, but Radar provides tools so that users can specify when other actions should be taken.

Radar is built directly into Stripe and works out of the box. Other fraud prevention solutions generally require a substantial amount of both upfront and ongoing investment. First, businesses must integrate with the product. This involves engineering work to send data on relevant events and payments and to process the assessments returned by the service. Second—at least for fraud detection systems that involve machine learning—businesses must spend several weeks manually labelling payments, explicitly indicating for each of their payments whether it is fraudulent or not, before the fraud system has a chance to be effective. And because fraud patterns change, a diligent business will need to keep labelling transactions indefinitely. Radar, on the other hand, receives information directly from the usual Stripe payment flow and taps into data from card networks and issuers.

We can be more concrete about the benefits of Radar in these two areas. Because Radar is an integral part of Stripe, aggregate data relevant to fraud from all of Stripe’s millions of businesses—collected automatically because these businesses have already integrated with Stripe for payments—is used to improve our fraud detection ability. While properties that can be “read off” a single credit card payment—for example, the country in which the card was issued, the IP address from which the payment was made, and the user’s email domain—provide valuable signal when predicting whether the payment is likely to be fraudulent, we’ve found that some of the most predictive signals come from behavioral data aggregated over time. For example, the number of countries in which a card was used (across all the businesses working with Stripe) in the recent past is a very precise indicator of fraud.

Radar analyzes every single credit card payment made across all of the millions of Stripe users to build up this sort of behavioral data. When a business using Stripe sees a card for the first time, there’s an 80% chance that we’ve seen the card elsewhere on the Stripe network in the past. Previous encounters with a card offer a significant amount of data to inform our risk assessments.

The fact that Radar receives information on disputes directly from Stripe’s banking and financial partners means that ongoing labelling of transactions by businesses is not necessary. Labelling is a labor-intensive process that also “bakes in” the biases of the people doing the labelling. A particularly aggressive (or lenient) analyst can detrimentally affect the performance of a fraud detection system that relies on these manual labels. Radar learns about the “ground truth” of fraud directly from comprehensive and authoritative dispute information from card issuers, and its machine learning starts providing protection on Day 1.

Let’s get started with a more detailed look at machine learning and how we use it at Stripe.

The basics of machine learning

Machine learning refers to a body of techniques for taking large amounts of data and using that data to produce models that encode the relationships in that data.

One of the main applications of machine learning is prediction: we want to predict the value of some output (in this case, a boolean value that is true if the payment is fraudulent and false otherwise) given some input values (for example, the country the card was issued in and the number of distinct countries the card was used from across the Stripe network in the past day). We determine how to make that prediction based on previous examples of input and output data. We can further divide the prediction problem into two types of tasks:

For classification tasks, we want to predict if a sample is in one class or another (for example, we'd like to predict whether a transaction in the “class” of fraudulent payments or in the “class” of legitimate ones).

For regression tasks, we want to predict a numerical output value associated with the inputs (for example, we might want to predict the lifetime value, in dollars, of a new customer).

While the distinction between classification and regression is helpful in practice, they involve fundamentally similar techniques. Classification can be viewed as the regression problem of predicting the probability that a sample is in a class.

The data that is used to train (or generate) the models consists of records (often obtained from historical data) with both the output value and the various input values as we have in the following (highly simplified) example:

While there are only three inputs in this example, in practice machine learning models often have hundreds or thousands of inputs. The output of the machine learning algorithm might be a model like the following decision tree:

When we observe a new transaction, we can look at the country in which the card was issued, the number of countries from which the card was used in the past day, and the payment amount in USD and traverse the tree “20-questions style” until we reach one of its “leaves.” Each leaf consists of all the samples in the data set (the table above) satisfying the question-answer pairs along the path we followed down the tree, and the probability that we think the new transaction is fraudulent is the number of samples in the leaf that are fraudulent divided by the total number of samples in the leaf. Put another way, the tree answers the question, “of transactions in our data set with properties similar to the transaction we’re examining now, what fraction were actually fraudulent?” The machine learning part is concerned with the construction of the tree—what questions do we ask, in what order, to maximize the chances that we can distinguish between the two classes accurately? Decision trees are particularly easy to visualize and reason about, but there are many different learning algorithms, and they each have their own unique ways of representing the relationships we are trying to model.

Today machine learning models are prevalent—powering, behind the scenes, many of the products we interact with frequently—and generally much more sophisticated than the toy model above:

Google, Apple, Amazon, and Microsoft use machine learning to power their intelligent assistants (Google Assistant, Siri, Alexa, and Cortana). Machine learning is used in a number of ways—from predicting the text represented by a fragment of speech (with models generated from input data consisting of a huge number of samples of text in both written and spoken form) to extracting the semantic intent of the question.

Amazon uses machine learning for its recommendation system: based on what people have purchased after buying a particular item, Amazon can determine what three or four other items to recommend to people to maximize the probability of a follow-up purchase.

Google and Facebook use machine learning to determine what ads to show you for a given search query or in your news feed. In both cases, inputs include the content of the ad and other contextual information (like the query text or the content of an adjacent news feed post) and the model output is the probability that you click on the ad (which both Google and Facebook would like to maximize).

And, most relevant to this discussion, machine learning is the basis for Stripe Radar, which seeks to predict which of your payments are fraudulent.

How does machine learning work?

Academic machine learning courses will usually focus on the modeling process—the methods for translating data (e.g., the table above) into the models (e.g., the decision tree), which are the algorithms that tell you how input values (the country in which the card was issued, the number of countries from which the card was used, etc.) map to output values (was the transaction fraudulent or not?). The process that takes the input data table above and produces the “best” tree is an example of a particular machine learning method. Modeling involves a number of steps, and while we won’t go into too much detail, a high-level overview follows.

First, we need to obtain training data. Before we can begin automatically detecting fraud, we need to have seen plenty of examples of fraud. For each example, we need to have recorded (or be able to compute retrospectively) a range of input properties that could be useful in making future predictions about the output value. These input properties are called features and the collection of inputs together for a given sample a feature vector. In our example above, the feature vector had a length of three (the country in which the card was issued, the number of countries from which the card was used in the past day, and the payment amount in USD). However, feature vectors with hundreds or thousands of features are not uncommon. In fact, Radar uses hundreds of features and most of them are “behavioral aggregates” computed from across the Stripe network. The output value—in our running example, the boolean as to whether or not the transaction was fraudulent—is often called a target or label. The training data thus consists of a large number of feature vectors and their corresponding output values. Coming up with good features is a difficult data science and engineering problem that we talk more about in the next section.

Second, we need to train a model. Given the training data, we need a method for producing our predictive model. As mentioned above, there are two general classes of prediction tasks in machine learning: regression and classification. In the former, we are trying to predict a numerical output (e.g., the amount of fraud loss). In the latter, we are trying to predict the class to which the sample belongs (e.g., is this payment in the class of fraudulent payments?). Machine learning classifiers (which are models for the classification problem) generally do not just output a class label—they typically assign probabilities that the given sample belongs to each possible class. For example, the output of a fraud classifier might be an assessment that the payment has a 65% of being fraudulent and a 35% chance of being legitimate.

There are many machine learning techniques for each of the two types of tasks. For regression, one could use traditional models like linear regression or regression trees. For classification, one could use logistic regression, decision trees, or random forests, among others.

Neural nets and deep learning, inspired by the architecture of neurons in the brain, are also applicable to both types of tasks. (They are responsible for many of the stunning recent advances in the field, including AlphaGo’s defeat of Lee Sedol.) However, for most industrial machine learning applications, traditional models do just fine. Though we constantly iterate on and experiment with our modeling process at Stripe, we’ve found that random forests (a generalization of decision trees like the one above) work well for a wide swath of the machine learning problems we face.

Feature engineering

Machine learning courses (and Kaggle competitions) would have you believe that the hard work is done once you’ve produced a machine learning model from data—they don’t cover where the data comes from or what you have to do to operationalize the model for your business. Before the model training process, you have to build the data infrastructure to take the data from your business (wherever it may be stored) and consolidate it in a way that’s useful to train models. There’s also heavy lifting after the model has been produced: you still need to integrate the new model into your business’s operational flows.

One of the most involved parts of industrial machine learning is feature engineering. Feature engineering consists of both the formulation (based on extensive knowledge of the problem domain) of features that have predictive value and the engineering to make the values of those features available both for model training and for model evaluation in "production."

For example, a Stripe data scientist may have a hunch that a useful feature is a boolean indicating whether the card payment is coming from one of the two IP addresses from which we’ve seen the card across Stripe most frequently in the past (two because people use their cards both from home and from work). In this case the idea is intuitive but generally these hunches come from examining thousands of cases of fraud.

Once we have the feature idea, we need to compute its historical values so that we can train a new model including the feature—this is the process of adding a new column to the “table” of data we use to produce our model. To do this for our candidate feature, for every payment in Stripe’s history, we need to compute the two most frequent IP addresses from which preceding payments were made with the card. We might do this in a distributed fashion with a Hadoop job, but even then we may find that the job takes too much time (or memory). We might then try optimizing the computation by using a space-saving probabilistic data structure. Even for features that are intuitively simple, there can be a substantial amount of work to produce the data for model training.

Once we have a model that incorporates the feature (we’ll talk about how we determine if the model is effective in the next section), we need to deploy it to production. While we have all feature values for historical payments from jobs like the one described above, we need to be able to compute the value of every feature for every new payment in real-time, when an API call is made to Stripe to create the payment, since we want to be able to block all transactions that our classifier believes are likely to be fraudulent. This computation is entirely separate from the one used to produce training data—we need to maintain up-to-date state on the two most frequently used IP addresses for every card ever seen at Stripe, and fetching and updating those counts needs to be fast because those operations happen as part of the Stripe API flow.

Lastly, it would be terribly inefficient if we had to go through this process for every feature idea. Ideally, there would be a way to specify a feature in a declarative way, and supporting infrastructure should automatically make the historical values of the feature available for training and the current values of the features available in production with suitably low latency. This is one of the infrastructure problems Stripe’s machine learning engineers work on.

If you’re interested in working on machine learning products at Stripe, get in touch!

Evaluating machine learning models

Once we’ve developed a machine learning classifier for fraud that uses hundreds of features and assigns a probability (or score) that the payment is fraud to every incoming transaction, we need to determine how effective the model is at detecting fraud.

Let’s start by supposing we’ve created a policy to block a payment if the machine learning model assigns the transaction a probability of being fraudulent of at least 0.7. (We write this as P(fraud)>0.7). Here are some quantities useful for reasoning about the performance of our model and policy:

Precision: The precision of our policy is the fraction of transactions that we block that are actually fraudulent. The higher the precision is, the fewer false positives there are. Let’s say out of 10 transactions, P(fraud)>0.7 for 6 and, of those 6, 4 are actually fraudulent. The precision is then 4/6=0.66.

Recall: Also known as sensitivity or the true positive rate, recall is the fraction of all fraud that is caught by our policy, i.e., the fraction of fraud for which P(fraud)>0.7. The higher the recall is, the fewer false negatives there are. Let’s say out of 10 transactions, 5 are actually fraudulent. If 4 of these transactions are assigned a P(fraud)>0.7 by our model, then recall is 4/5=0.8.

False positive rate: The false positive rate is the fraction of all legitimate payments that are incorrectly blocked by our policy. Let’s say out of 10 transactions, 5 are legitimate. If 2 of these transactions are assigned a P(fraud)>0.7 by our model, then the false positive rate is 2/5=0.4.

While there are other quantities that are used when evaluating a classifier, we’ll focus on these three.

Precision-recall and ROC curves

The next natural question is what good values are for the precision, recall, and false positive rate. In a theoretically ideal world, precision would be 1.0 (that is, 100% of transactions that you classify as fraud are actually fraud), which would make your false positive rate 0 (you didn’t incorrectly classify a single legitimate transaction as fraudulent), and recall would also be 1.0 (100% of fraud is identified as such).

In reality, there is a tradeoff between precision and recall—as you increase the probability threshold for blocking, precision will increase (since the criterion for blocking is more stringent) and recall will decrease (since fewer transactions match the high probability criterion). As you decrease the probability threshold, the reverse is generally true: precision will decrease and recall will increase. For a given model, a precision-recall curve captures the tradeoff between precision and recall as the policy threshold is varied:

Precision-recall curve

GreatGoodOkay

0.0Recall1.0

1.0Precision0.0

High score threshold for blockingLow score threshold for blocking

As our model gets better overall—because we add features that are good predictors of fraud, tweak other model parameters, use more and more data from across the Stripe network as training inputs, and so forth—the precision-recall curve will change, as depicted in the example above. As it controls the tradeoff for businesses on Stripe, we closely monitor the impact on the precision-recall curve when our data scientists and machine learning engineers modify models.

When considering a precision-recall graph, it’s important to distinguish between the two notions of “performance.” On its own, a model is better overall the closer it hugs the top-right of the chart (that is, where precision and recall are both 1.0). However, operationalizing a model usually requires the selection of an operating point on the precision-recall curve (in our case, the policy threshold for blocking a transaction), which controls the concrete impact using the model has on a business.

Put simplistically, there are two problems: the data science problem of producing a good machine learning model by adding the right features (which controls the shape of the precision-recall curve) and the business problem of picking a policy for actioning a given machine learning model’s outputs (which controls where on the curve we're operating).

Another curve that is examined when evaluating a machine learning model is the ROC curve. (ROC is short for “receiver operating characteristic," a relic of the curve’s origin in signal processing applications.) The ROC curve is a plot of the false positive rate (on the x-axis) and the true positive rate (which is the same as the recall) on the y-axis for various values of the policy threshold.

ROC curve example

GreatGoodOkay

0.0False positive rate1.0

1.0True positive rate (recall)0.0

High score threshold for blockingLow score threshold for blocking

The ideal ROC will hug the top left of the graph (where recall is 1.0 and the false positive rate is 0.0), and as the model improves the ROC will move and more in that direction. One way to capture the overall quality of the model is by computing the area under the curve (or AUC); in the ideal case, the AUC will be 1.0. When developing our models, we look to see how the precision-recall curve, the ROC curve, and the AUC change.

Score distributions

Imagine that we have a model that randomly assigns a probability of fraud between 0.0 and 1.0 to a transaction. Practically, this model does nothing to discriminate between legitimate and fraudulent transactions and is of little use to us. This randomness is captured by the score distribution of the model—the fraction of transactions getting each possible score. In the completely random case, the score distribution would be close to uniform:

A model will have a uniform score distribution like the above if, for example, the model has no features that are even remotely predictive of fraud. As a model is improved—by adding predictive features, training on more data, and so forth—its power to discriminate between the fraudulent and legitimate classes will increase and the score distribution will become more bimodal, with peaks around the scores of 0.0 and 1.0.

On its own, a bimodal distribution does not tell you that a model is good. (A vacuous model that randomly assigns probabilities of just 0.0 and 1.0 would also have a bimodal score distribution.) However, in the presence of evidence that transactions with a low score are not fraudulent and transactions with a high score are fraudulent, an increasingly bimodal distribution is a sign of improved efficacy for a model.

Computing precision and recall

We can compute the metrics above in two different contexts: during model training, using the historical data that drives the model development process, and after model deployment, using production data—i.e., data from the world when the model is already being used to take action by, say, blocking transactions if P(fraud)>0.7.

For the former, data scientists will typically take the training data they have (recall the table from above) and randomly assign some fraction of the records to a training set and the other records to a validation set. One could imagine that the first 80% of rows go into the former and the last 20% to the latter, for example.

The training set is the data fed into a machine learning method to produce a model as described above. Once we have a candidate model, we can then use it to assign scores to each sample in the validation set. The validation set scores together with their output values are used to compute the ROC and precision-recall curves, the score distributions, and so forth. We split data into separate training and validation sets so that we have metrics that are an accurate measure of the predictive power of the model. Every sample in the training set (by virtue of being data on which the model is trained) is in some sense “baked in” to the model, and thus the predictive performance of the model on the training set will generally be better than it is on new data that hasn’t been seen before. Testing on a validation set makes our model assessment more accurate.

Once we put a model into production—because the validation set metrics suggest we can use the model effectively—there is the separate but related question of how we continuously monitor the performance of our model-policy pair. For payments that have scores below the threshold for blocking, we can observe the ultimate outcome—was the transaction disputed by the cardholder as fraud? Payments that have scores above the threshold, however, are blocked, and so we can’t know what their outcomes would have been. Computing the full production precision-recall or ROC curve is thus more involved than computing the validation curves because it involves counterfactual analysis—we need to obtain statistically sound estimates of what would have happened even to the payments we blocked. Stripe has developed methods to do this, which you can learn more about from the referenced talk.

We’ve just described a few of the measures of model efficacy that data scientists look at when developing machine learning models. Next, we’ll talk about how businesses should think about fraud prevention—much of what we discussed above (for example, the inverse relationship between precision and recall) will be useful for obtaining an accurate picture of how fraud affects your business.

Reasoning about fraud prevention system performance

Fraud prevention platforms will often advertise a single number to capture performance, but fixating just on one number can result in choices that are not optimal for your business. (This is perhaps unsurprising since the metric that fraud teams own is usually one related to fraud rate—they’re not responsible for lost revenue, which is harder to identify than fraud losses.)

Here are some illustrations of the real-world implications of the false positive-false negative (or precision-recall) tradeoff:

If you’re only optimizing the dispute rate, we could trivially reduce it to zero by blocking every payment. Needless to say, this would be disastrous for your business. While our ridiculous “model” would stop all fraud, it would do so at the expense of a terrible false positive rate (all legitimate transactions are also blocked).

Focusing just on a low false positive rate in isolation can also lead you astray. You can trivially make the false positive rate zero by not blocking any fraud, but more seriously, a low false positive rate does you no good if your business’s dispute rate nears 1% (which is the upper limit imposed across almost all card networks)—you need to reduce fraud significantly even if it means tolerating many false positives.

While these examples might seem contrived, we’ve found that businesses will often overemphasize false negatives—they’re very concerned about fraud that is missed—and underemphasize false positives. This often results in generally ineffective and costly brute-force measures like blocking all international cards, or all IP addresses from a certain region, or all cards of a certain type. Machine learning systems aren’t immune to biases like this—ones that depend primarily on human labelling learn to reproduce those biased decisions. The fact that Stripe gets all dispute information directly from card networks and issuers means that this is not an issue for Radar.

We’ll give one more example of the subtlety involved in reasoning about system efficacy: is the existence of your fraud prevention system (even if it may be catching a substantial amount of fraud) actually making you money? It’s possible that it’s not if the system results in more legitimate transactions getting blocked (per fraudulent transaction blocked) than is appropriate for your business.

Let’s start with a simple calculation—imagine we’re selling a widget for $10 that costs us $4 to make. For a legitimate sale, our profit is $6. On the other hand, a fraudulent transaction costs us $4 (the cost of producing the widget) as well as a $15 chargeback fee (a total of $19 lost).

Given these numbers, we should be willing to forego $19/$6=3.17 legitimate transactions if that means avoiding 1 fraudulent transaction. Put another way, as long as 1 out of every 4.17 transactions we block is actually fraudulent, then we are increasing our profit by having the fraud detection and blocking system in place. For the numbers in this simplified example, we call 1/(1+3.17)=0.24 the break-even precision. If the system’s precision is less than 0.24, you are actually making less money from having it in place, even if it has lowered your chargeback rate substantially!

While there might still be times when you’ll have to tolerate a precision lower than the break-even value, such as if your business is nearing the critical 1% chargeback rate limit, in general you should be thinking about how all the various performance measures relate and what the right tradeoffs are given your particular circumstances.

Improving performance with rules and manual reviews

Alongside the more automatic machine learning algorithms, Stripe Radar also lets individual businesses compose customized rules (for example, “block all transactions above $1,000 when the IP country does not match the card’s country.”) and manually review flagged payments in the Dashboard.

Such rules can be seen as simple “models” (they can be represented as decision trees, after all!) and they should be evaluated—with a full consideration of the tradeoff between precision and recall—in the same way as models. When you create a rule with Radar, we’ll present historical statistics on the number of matching transactions that were actually disputed, refunded, or accepted to help aid with these calculations.

Just as important, rules and manual review allow users to change the shape of the precision-recall curve in their favor by adding in proprietary, business-specific logic (rules) or by expending some additional effort (manual review).

If you realize that the the machine learning algorithms are frequently missing a certain type of fraud particular to your business (and that fraud is easily identifiable to you), you can compose a rule to automatically block it. That specific intervention will increase recall with little cost to precision, in effect moving the operating point along a less steep, more favorable precision-recall curve.

By sending some classes of transactions to manual review instead of blocking them outright, you can gain precision without a hit to recall. Similarly, by sending some transactions to manual review instead of allowing them outright, you can gain recall without a hit to precision.

Of course, in these cases, you are paying for these gains with additional human work (and exposing yourself to the accuracy of your team’s assessments), but having manual review and rules as a additional tools gives you another lever to optimize fraud outcomes.

Next steps

We hope this guide helps you understand how machine learning is applied to fraud prevention at Stripe and how to gauge the efficacy of your fraud systems. You can learn more about Radar’s features or explore our docs.

If you have any questions or feedback about this guide or Stripe Radar, or are interested in working on machine learning products like Radar, please reach out!