How to build better fraud detection systems with ML


I started working in the fraud domain as a scientist at AWS in 2017. The primary job of the science team was to build the intelligence systems, which were largely model based (although that wasn't always the case!). After some time there, I went on to help build the Amazon Fraud Detector service to empower customers to build customized Machine Learning (ML)-powered fraud detection solutions with a few clicks. With this opportunity I was exposed to a number of different fraud and fraud-adjacent problems. Although the purview of the term “fraud” is vast and diverse, the general intersection of it with ML is a fascinating success story of the business-value of ML. There are many interesting lessons learned about how to adapt methods to make something work well in a particular domain, and what it takes to integrate into a real production system rather than building a model in isolation. I therefore wanted to take some time and write down what I've learned so far with the hope that it helps give you a valuable perspective on fraud or ML.

This will be a 4 part series. This first part will discuss what the challenges are with non-ML fraud systems and why ML is a strong tool for the job. The key result is that ML-powered fraud detection doesn't just automate data mining, but also allows a trust-score paradigm that changes the game. Part 2 will discuss the challenges of modeling with Personally Identifiable Information (PII) and Part 3 will dive into the problem of label noise inherent in these systems. Finally, Part 4 will dive deep into using graph representations of fraud data to unlock new detection capabilities, and how this intersects with the hot topic of ML on graphs.

What is “fraud”?

Fraud can be loosely defined as the misrepresentation of facts as a means to gain access to something that would otherwise not be granted without deception. The obvious things are included like transaction fraud (e.g., no intent to pay or stolen payment instrument), account take-over (e.g., bad actor gains access to legitimate account), insurance claim fraud…etc. We also include illegal activity that is a result of collusion of actors, like money laundering. We often see “fraud-adjacent” problems tackled with similar methods, like abuse, phishing, spam, fake news. You even see “intent to pay” modeling blur the lines between fraud and credit use cases.

Each of these problems areas is distinct and will have different characteristics. For example, trying to identify accounts created by a bad actor that has automated account creation to take advantage of promotions is quite different than trying to find cases where a bad actor has over-taken a legitimate account in good standing via credential stuffing. This is to say that there is no such thing as a “one size fits all” approach, and the details matter. That said, we will attempt to talk about fraud generally and as if it's all one thing despite knowing this isn't quite true.

Traditional (non-ML) fraud detection systems

Before diving into ML, let's start by describing typical fraud detection systems that do not use ML.


Regardless of detection techniques, somewhere there will typically be humans that represent the final authority on what constitutes fraud, as the precise definition of fraud will depend on the business. A user failing to pay their bill may be a strong indication of fraud, for example, but there are also a number of non-fraud reasons this may occur (e.g., a good-faith user is not able to pay). Investigators represent the oracle that gives us the ground truth.


The backbone for finding potential fraud is the team of analysts. They are responsible for combing through the data to either surface potential fraud to the human investigators or make highly confident automated decisions. Analysts often perform this job by writing SQL queries to assemble relevant data, and then search for patterns that help discern fraud from legitimate instances. Through this process, analysts can discover manipulations of the data that are useful for fraud detection, called features. This might be, for example, the number of transactions in the last minute, hour, day and month. These features are a result of the domain knowledge of the team along with the data munging and statistical expertise of the analysts. In addition to features, one of the valuable outputs of these efforts is a rule which uses a combination of factors to make an explicit fraud-related decision.

Rules Engine

A rules engine contains a collection of rules and is often used to evaluate incoming events to make fast decisions in a low latency setting. In other words, the rules codify the domain knowledge of the fraud group and make decisions on new events, as opposed to finding fraud in events that already occurred. A rule might look something like:

if (
    IpCountry == 'Vietnam'
    and BillingAddressCountry == 'Canada'
    and CreditCardBank == 'Chase'
    fraud = True

These rules are often highly specific to a particular fraud pattern because the business needs to ensure that their rule does not negatively impact many legitimate customers (i.e., “false positives”). It's the analysts’ job to therefore create candidate rules and then backtest them on the historical data until they find one that makes the appropriate balance between catching the fraud and not negatively impacting legitimate customers. The rules engine is primary infrastructure to represent and apply the intelligence of the analysts.

Block/Allow Lists

Another tool that helps apply insights are block and allow lists, which contain explicit values that are either automatically blocked or passed without friction if an incoming event has a matching value. A block list could be used, for example, if it's found that a particular IP address is repeatedly defrauding the business. Or an allow list might be used if there's e.g., a corporate credit card for a business's biggest customer so as to ensure there's never friction from a fraud system when processing transactions from this customer.

Challenges of these systems

Poor generalization means labor intensive

How do these systems typically respond to an emerging fraud threat, defined as a new pattern of fraud that is subverting the current system? This will often be noticed either by the investigation team if a subset of these fraud are detected by existing systems, or a business metric associated with fraud loss such as the number of chargebacks. Once discovered, an analyst is tasked with diving deeper to understand and define the emerging threat. Ideally, a new rule will be crafted to fill the hole in the detection system and stop future fraud. In other words: there is a cat and mouse game where fraudsters evolve their tactics to subvert the current system, and then the fraud team reacts by building a protection against this newest pattern, making the system incrementally stronger.

The nature of the challenge is that the detection systems have poor generalization to new patterns and therefore the system is labor intensive.

Let's consider the two primary automated detection systems: block lists and rules. Block lists can be subverted by simply changing the data element that's on the list, whether it's an IP address, e-mail address or something else. This adds friction, but it doesn't take genius to subvert them. Rules are highly targeted by their very nature of being a response to emerging threats. This allows them to minimize the impact to legitimate customers, but less able to generalize to evolving fraud behavior. For example, reconsider the rule above. If the fraudster is using a VPN, they might simply change regions so that the IP location switches from Vietnam to Indonesia, which breaks the rule. Some portion of fraudsters will be motivated enough to overcome these systems if the reward is high.

Duplication means hard to scale

Next let's consider how these systems are used to make decisions across the lifetime of an entity. For illustration, let's assume there are two types of touch-points where a business wants to assess risk: new account creation and transactions. Each of these will have quite different data available. For new account creation you might have form data that is meant to identity the person and data collected about the device. For a transaction, you will have the transaction data itself, the history of transactions for the user, the account creation data and perhaps many other things. Because the data and use cases are different, the rules that evaluate these two events will be quite different. We therefore need to build a distinct version of the system for each context where we want to assess risk. If the team needs to support multiple rules engines, it likely will duplicate risky patterns across them. For example, if it's known that Vietnam IPs and Chase Bank is a risky combination, that pattern may be duplicated across several rules engines and then further refined according to the context and available data.

Singular outcomes have little nuance

An additional problem of duplication is that it's not clear how to communicate risk from one event to another with this approach. For each evaluation we make a singular choice among a handful of options–investigate, shut down, pass…etc. Since all accounts below the risk threshold are given “pass”, for example, there's no way to tell the difference between those that were just below the risk threshold versus those that were confidently legitimate. This is why, in some sense, we have the duplication problem above. If we had a way to communicate risk from the registration rules engine, we could directly use that as a factor in assessing transactions rather than needing to duplicate the underlying patterns associated with risk.

In summary, these systems are highly targeted to particular fraud patterns because 1) they are reactive to particular threats and 2) teams want to minimize friction for legitimate customers. Highly targeted detection also means poor generalization, and an adversarial game is therefore setup in which fraudsters make small changes to avoid detection and fraud teams react with new rules. When these systems need to be duplicated for different contexts and there's no natural way to communicate risk between them, the required effort is multiplied.

Fraud detection is tailor-made for ML

Now let's shift our focus to Machine Learning. Many times, ML is a hammer looking for a nail, and the actual business needs are not well aligned with the strengths of ML. Let's start by describing why fraud applications are generally well-aligned with the strengths and requirements of successful ML projects.

Lots of data. Fraud systems typically have a lot of data to work with, and much of it is useful for purposes of fraud detection. An online business, for example, likely has website click data, authentication events, form data in account registration flows, purchase/usage data…etc. Each of these provides an opportunity for detection since fraudsters will typically have unique characteristics. For example, whereas a legitimate user might be referred based on an ad and then navigate a website to learn about the products/services before signing up, a fraudster might come directly to the registration page and create an account. Furthermore, behavior data like time on page, keystroke speed…etc, might be very different, particularly if a sophisticated fraudster has automated the account creation process using software.

Lots of decisions. Businesses typically need to make far more decisions regarding fraud than they can possibly scale with human review. In all but the most extreme examples, businesses cannot have a human review every user interaction for fraud risk. We therefore have a need to make a high volume of systematic decisions.

Accuracy is valuable. In some cases, it may not be clear how better prediction performance translates into business value. For example, if your cat detector for images increases in accuracy from 89% to 92%, what is the business value? Or if you can better predict which customers are going to churn, how does that help you retain more customers? In the case of fraud, it's quite clear: the more fraud you can detect and the sooner you can do it, the more cost you can save. In extreme cases like wire transfers, a single transaction can be worth millions of dollars and accurate predictions can therefore be highly valuable.

Computers are better than humans. Given the proper data, computers are much better than humans at combing through lots of data to find patterns and statistical relationships. Not only are they faster, but they're far more systematic. An analyst, for example, might tackle this problem by writing a SQL query against their favorite data sources, and then visually inspecting the results to look for patterns they can exploit. Once they have a candidate, they can backtest it on historical data to assess performance. Different analysts, or even the same analyst on different days, might produce totally different patterns. ML, in contrast, follows an algorithm and will produce similar results from run to run.

ML addresses key challenges in fraud detection

Trained as a supervised fraud classification system, ML models do the same task of the analyst: comb through the data and find combinations of features that help discern fraud from legitimate populations. Given a population of fraud that meets the aforementioned rule criteria and a legitimate population, a properly prepared dataset and an off-the-shelf method like XGBoost will definitely find the population and end up using decision trees that look similar to the rule that a human created. The first advantage therefore is that ML gets a computer to do the job of finding patterns in the data, and it can do it faster and more systematically than a human.

The second advantage is that if the ML data has a variety of fraud patterns to learn from, it has a better chance to generalize since it will give a proper statistical treatment to the various combinations of features. The ML model might therefore generalize in the example above where the fraudster changed their IP address to be located in Indonesia instead of Vietnam to subvert the rule. It could learn, for example, that both Vietnam and Indonesia are similarly unlikely to be used by your typical customer base, and that when combined with North American billing address and bank, have similarly risky scores. Note that this can be learned even if the model has never before seen an example of fraud matching this exact pattern. Instead, what is required is that the model has seen fraud from both these IP countries and learns they have similar risk profiles. Subverting the model therefore becomes more difficult for fraudsters. If fraudsters do successfully subvert the model, the team can simply re-train and -deploy the model once they have appropriately labeled examples to use for training. Since the primary intelligence task is now computer-driven, this entire process of improving detection in response to newly discovered fraud can be automated with the proper tooling.

Similar to rules engines, we will need separate models for distinct contexts. Since the model building process is much faster and can be automated, however, the incremental maintenance effort is small once the initial model is designed. Importantly, since the outcome of a model is a continuous value rather than a distinct outcome, risk can be naturally communicated across events. In other words, the risk level at account creation can be directly used when evaluating other events like transactions. The powerful implications of a “trust score” will be discussed in more depth below.

A paradigm for ML-powered fraud detection systems

Beyond simply replacing humans for the task of pattern discovery for fraud detection, an ML powered fraud system allows a totally different approach that can be a powerful mindset shift. This change is enabled by the concept of a trust score, or a number that quantifies the associated trust of an entity or event (or inversely: risk). In contrast to rules, which make binary fraud/not-fraud decisions, a trust score gives the business a continuous notion of risk, and this can be operationalized in a number of impactful ways. Let's define a few concepts to help us think about a trust-score based system.


Our detection system is an ecosystem of models, analysts, rules and investigators. The models assign trust scores and the analysts produce rules that consume this score and have the option of further refining them to address specific needs. Automatic decisions can be made in cases of high confidence, but the investigators still provide the final decisions for tough calls. Critically, instead of making independent decisions for each distinct context, we can continually re-evaluate the same entity across its lifetime and update its trust score based on its evolving behavior. This gives us a singular reference to entity trust/risk that can be used in a broad set of contexts.


The ideal of fraud systems is the notion of prevention, in which fraud is stopped before the loss is incurred. In the case of transactions, this would mean blocking the transaction from going through, or preventing a new account from being created at all. This is a challenge because the business has less data to make a decision than the case where fraud has already been committed, but it is by far the most valuable place for an intervention. The goal is to move as much detection as possible to the prevention stage. Since these are often real-time decisions, we can evaluate the event using the the current trust score in combination with a rules engine. Trust scores make this easier since they provide a holistic view of the entity at the time of evaluation.


When we use the trust score to limit what an entity can do, we are using the containment strategy. For example, high-trust entities will be given the ideal low-friction experience, but low-trust entities will have a human review for any transaction exceeding $100 in value. This gives the business a way to operationalize risk-adjusted decisions. Without an ML-powered trust score, it's difficult to imagine how to build such a system unless it was crude, like “all entities created in the last 30 days need human review for transactions exceeding $100”. Such a rule would effectively treat all new entities as low trust and all old entities as high trust, but this would ignore the many data an ML model could use. Containment strategies drive fraudsters to behave more like legitimate customers, because it's the only way they can escape containment.


These are the actions we take once we have detected fraud, and these actions will be targeted to the threat and risk level. As examples, we may block a transaction or shut down an account for high risk of fraud, or if we fear an account take-over scenario, we may force a password change or lock down an account until they demonstrate their identity to customer service. Having a trust score allows us to take an action that's inline with the risk it presents. In all of these cases, we record the actions taken and final results, which are then used as feedback into our model building process. This closes the feedback loop and a well-design system will result in making better decisions over time.


In sum, ML leads to systems that are more generalizable and less labor intensive since the core challenge of finding risk patterns is better accomplished by computers. Because of their ability to understand risk systematically and statistically, trust-score-based systems have a nuanced understanding of risk. With this nuance, we can make risk-adjusted decisions in a broad set of contexts that prevent fraud in the ideal case, and greatly constrain its impact in others.

These systems are, however, a feedback loop: the models create the trust scores, the trust scores drive the investigations, and those results are used for labels to build the models. While powerful, there's a number of challenges inherit to such a system, and the next couple blog articles will dive deeper into this.

Zak Jost
Zak Jost
ML Scientist @ AWS; Blogger; YouTuber