◆ Fraud · AMLAI in FintechMay 20, 2026 · 12 min read

Why AI / ML Solutions Fail In Production Payments: Seven Patterns I See Every Year

Most AI/ML projects in payments fail in production for reasons that have nothing to do with model accuracy. They fail because the team optimised for a leaderboard metric, the operating environment moved, the labels were wrong, or the audit cycle the model now lives inside was not part of the design. Seven patterns I see every year.

By Rizwan Zafar

Article

Reading time: 12 min read
Sections: 11
Published: May 20, 2026

In this essay11 sections

The standard model post-mortem in payments reads like a war story: model trained well in development, accuracy was 0.92 AUC, deployed to production, and within 60 days the team is back in the room asking why the false-positive rate doubled, the merchant complaints spiked, and the regulator has a question.

In every case I have reviewed, the answer is not "the model was bad" or "the data was bad". The answer is one of seven structural patterns that the team did not address before deployment. The patterns are not surprises, every senior ML practitioner can name them. What is surprising is how routinely they get skipped because the team is optimising for the wrong thing: a leaderboard metric, a launch date, a board demo, a vendor pitch.

This is the operator's catalogue of the seven patterns, what each one looks like in payments specifically, and what the senior PM does about them at design time; not at incident time.

Pattern 1: Concept drift the team did not measure for

The story. A fraud-detection model trained on 18 months of transactions is deployed. Six months later, the false-positive rate has climbed 40%. The team retunes. Two months later, another spike. The cycle repeats.

The root cause. The behaviour the model learned to detect is not stationary. Fraud patterns evolve, the bad actors adapt to the model, the legitimate-traffic mix changes (new market, new merchant category), the regulator pushes a behavioural shift (PSD2 SCA, tokenisation), the schemes change the rules. The model that captured the world in 2022 is fitting a fading distribution by 2024.

The payments specifics. Fraud models drift faster than most ML domains because the adversary is intentional and the regulatory environment is unstable. A model that drifts gracefully in image classification can drift catastrophically in payments fraud.

What the senior PM ships at design time. Concept-drift monitoring as a first-class deliverable, not an afterthought. Population stability index (PSI) per feature, KS-test on score distributions, drift alarms wired into incident response. Model retraining cadence agreed upfront (typically every 30–90 days for fraud, longer for credit scoring). Champion-challenger architecture so the next-generation model is always in shadow mode.

Pattern 2: Label leakage

The story. The model performs astonishingly well in development (0.96 AUC). Deployed. Performance collapses in production (0.72). The team is mystified.

The root cause. A feature that was available at training time was implicitly using information that would not be available at inference time. The classic example: a "transaction was reviewed by the fraud team" feature that is only set after the review. The model effectively cheated.

The payments specifics. Label leakage is endemic in payments because the labels (fraud / not fraud) arrive days to weeks after the transaction, the data warehouses are built for analytics not for ML training, and the feature pipelines often include flags set during the dispute lifecycle.

What the senior PM ships at design time. Strict point-in-time correctness on the training data pipeline. Every feature is timestamped; the training pipeline rejects any feature value with a timestamp later than the transaction it describes. Model-development environments use the same feature pipeline as production (the "online-offline parity" discipline). Validation includes a "future-proof" replay against the most recent month of transactions where labels are still being collected.

Pattern 3: The ops integration was bolted on

The story. Model ships. Predictions are accurate. The fraud-ops team complains that the false-positive cases the model flagged are coming through with no context, and the case-investigation time has doubled.

The root cause. The model produced a score; the team treated the deployment as complete; the operations workflow that consumes the score was an afterthought. Investigators get a transaction with a "high-risk" flag and no explanation. Each case takes longer because the human investigator is now doing the model's interpretive work.

The payments specifics. Fraud cases require investigation. AML cases require investigation. Disputes require investigation. The model's score is one signal in a process; the integration into the case-management workflow is what decides whether the model saves ops time or adds to it.

What the senior PM ships at design time. The case-management workflow co-designed with the ops team before model training begins. Investigator-facing explanations (which features drove the score, what historical patterns match). Threshold tuning that is operator-controlled, not buried in a config file. SLA on case-investigation throughput; the model launch is gated on the workflow being ready, not on the model being trained.

Pattern 4: The model has no governance layer

The story. A year after deployment, the regulator asks: "tell us how your fraud model makes decisions." The team's answer is "AUC was 0.91". The regulator asks for the model documentation, change log, performance monitoring, and bias testing. The team has none of the four.

The root cause. The model was treated as an engineering artefact, not as a governed asset. The team that owned the model did not include risk management; the risk function did not have a role in deployment; the model documentation was a Confluence page nobody owned.

The payments specifics. Models in payments touch protected categories (consumer credit, account access, payment authorisation). Most jurisdictions now have explicit model-risk-management expectations, SR 11-7 in the US, PRA SS1/23 in the UK, EBA guidelines in the EU, equivalent in MENA and Asia. The regulator does not ask "is the model good"; they ask "is the model governed".

What the senior PM ships at design time. Model governance as part of the launch package. Documentation that names the model purpose, training data, validation methodology, performance monitoring, change log, bias and fairness assessment, rollback plan. Sign-off from the model risk management function before production. Annual review schedule. The senior PM treats this as launch hygiene; the junior team treats it as compliance overhead.

Pattern 5: Performance metric mismatch

The story. Model scores 0.91 AUC. Deployed. Within 60 days the fraud team is asking why their losses are up while the false-positive rate is also up. Both metrics moved in the wrong direction simultaneously.

The root cause. AUC is a rank-ordering metric. It says "the model can distinguish fraud from non-fraud", but it does not say what the operating point should be. At the operating point chosen for deployment, the model traded off precision and recall in a way that did not match the business cost. A model with 0.91 AUC can absolutely have worse production performance than a model with 0.86 AUC if the operating point on the first model is wrong.

The payments specifics. The cost of a false positive (declined transaction → merchant complaint → potential churn) and the cost of a false negative (fraud loss + chargeback) are usually wildly asymmetric and merchant-specific. A model deployed with one operating point across the portfolio produces over-aggressive declines for some merchants and under-aggressive for others.

What the senior PM ships at design time. Operating points chosen per merchant cohort, not portfolio-wide. Expected-value calibration (precision × value-saved minus false-positive × cost-of-decline) as the operating-point selection metric. Quarterly re-calibration. The model launch is not "model A is now in production", it is "model A with these operating points per cohort is in production".

Pattern 6: Training data does not represent the production population

The story. Model trained on the platform's historical traffic. Deployed. Performance is decent in the first market but collapses when the platform launches in market 2.

The root cause. The training data represented the first market's transaction mix, MCC distribution, demographic profile, BIN composition, fraud typology. The second market's distribution differs on every axis. The model is being asked to generalise to a population it has not seen.

The payments specifics. Payments traffic is heterogeneous across markets in ways most other domains are not. Card BIN distributions, merchant categories, fraud typologies, regulatory exemption usage, all differ. A model that performs well in UAE is not automatically a good fit in Pakistan, even within the same operator's portfolio.

What the senior PM ships at design time. Per-market validation as a launch gate. Population profile comparison between training data and target market. Either re-training per market, or a model architecture that explicitly conditions on market features. Cohort-level performance monitoring in production.

Pattern 7: No clear failure escalation

The story. Model produces an unusual cluster of high-risk scores in a 30-minute window. Fraud team notices. Nobody knows whether to retrain, roll back, alert the regulator, or wait. By the time the decision is made, four hours have passed and the regulator has independently noticed.

The root cause. The model was deployed without an incident response runbook. The model is a real-time decision system; it can fail in real time; the team did not pre-plan how to respond.

The payments specifics. Models in payments are decision systems that affect customer transactions and merchant relationships in real time. A model incident is operationally equivalent to a payment-system incident, but most teams treat model performance as analytics rather than as operational status.

What the senior PM ships at design time. Model incident response runbook. Defined triggers (PSI threshold breach, score-distribution shift, sudden cluster of unusual scores). Defined actions (rollback to challenger model, raise threshold to fail-safe, alert the risk-management function, notify the regulator under defined conditions). Practiced drill at least quarterly. The model is operated like a payment system, because it is one.

What ties the seven together

The pattern in the patterns: each one is a design-time discipline that gets sacrificed for launch-time velocity. The team is under pressure to ship the model; the governance, the ops integration, the per-cohort calibration, the drift monitoring, the incident runbook are all "we'll add it after launch". They never are.

The senior PM running an AI / ML programme in payments treats the model itself as ~30% of the deliverable. The other 70% is the seven disciplines above. The model that ships with all seven addressed is the model that survives 18 months in production. The model that ships with three or four of them addressed is the model that produces the post-mortem.

What this means for build / buy / partner

Two practical implications for the build / buy / partner decision:

Most teams below ~$5B TPV should not build their own fraud / risk ML. The seven disciplines above are not what makes the model accurate; they are what makes the model survive. Vendors that have shipped models across multiple platforms have learned the patterns the hard way. Building from scratch means re-learning all seven.

At scale, build for the bespoke layer; buy for the foundation. A $20B TPV acquirer-processor benefits from a custom challenger-scoring model on top of vendor-supplied foundation scoring (sanctions screening, device fingerprinting, baseline fraud score). The seven disciplines apply to the custom layer; the foundation layer carries them from the vendor.

The senior-PM tell

The interview question that distinguishes senior payments PMs on AI / ML programmes: "your model has been in production for six months. It is still performing. What are you measuring?"

The junior answer talks about AUC and accuracy. The senior answer reads: drift indicators (PSI per feature, score distribution KS-test, label distribution shift), operating-point performance against the calibration baseline, per-cohort performance, case-investigation throughput by ops, incident frequency, governance posture (have we hit the annual review). The model itself is one input; the seven disciplines are the dashboard.

That answer is the difference between a programme that survives and a programme that has the second post-mortem.

FAQ

Are these patterns specific to AI / ML, or do they apply to rule engines too? Partially. Concept drift applies to rules (rules become stale); label issues apply to rules (rules are tuned to past labels); ops integration applies to any decision system; governance applies to any system that touches a regulator. The patterns generalise; the cadence at which they bite is faster for ML.

Does buying a vendor solve all seven? Mostly the first four. The last three (performance metric calibration, training-data representativeness, failure escalation) still need to be done at the platform level even with a vendor model, because the operating points, the cohorts, and the platform's incident response are platform-specific.

Are the patterns the same for LLM applications? The seven generalise. Concept drift becomes prompt / input distribution shift. Label leakage becomes data contamination in the foundation model. Ops integration is the same. Governance includes the model card and the foundation-model provider's terms. Performance metric mismatch is harder for LLMs (the metrics are less standard). Training data representativeness becomes "the foundation model knows what it knows", out-of-domain risk. Failure escalation is the same.

How much engineering does the seven disciplines add to a model launch? Roughly 30–50% of the engineering effort of the model itself, distributed across data engineering, MLOps, and model governance. Teams that include this in their estimate ship reliably; teams that exclude it ship the model and then spend the next two quarters retrofitting.

Who owns model governance, risk, engineering, or product? Joint. Risk owns the policy and the audit posture. Engineering owns the implementation. Product owns the integration into the workflow and the operating-point selection. Senior payments organisations have all three at the design-time review.

Does this map to credit / lending ML? Strongly. Credit scoring has stricter regulatory expectations (fair lending, model risk management, adverse action explainability) but the seven patterns apply identically. Credit ML programmes that ignore them produce regulator findings, not just incidents.

If this resonated, also read Where ML Beats AI: Six Payment Problems an LLM Cannot Touch, AI Fraud Detection vs Rule Engines, and AI in Payments: Three Production Systems and One Banking Pilot.

Building through similar complexity?

Discuss the operating decisions behind the essay, or explore where my experience can help.

Book introduction Email Rizwan