RAG for Merchant Integration Support: A Production Playbook
Merchant integration support is the cleanest place to put a GenAI bot in a payments platform. The questions are patterned, the answers live in docs you already maintain, and the failure mode (wrong answer) has a cheap recovery (human takes over).
We shipped this surface at Simpaisa and cut merchant integration support time by ~65%. This is the playbook.
Why RAG, not fine-tuning
Two reasons. First, your integration docs change every sprint, new endpoints, new error codes, new SDKs. Fine-tuned models go stale fast and re-fine-tuning is slow and expensive. RAG just re-indexes.
Second, you need citations. Every answer the bot gives a merchant should be traceable to a doc page. Without citations you cannot defend the answer in a post-incident review.
The corpus is the product
Spend more time on the corpus than on the model. Three principles:
- Curate, don't dump. Index the docs you actually want the bot to use. If your old v1 docs are still on the site, exclude them. The retrieval layer cannot tell stale from fresh.
- Structure the error catalogue. Decline codes, integration error codes and webhook failure modes should be in a single canonical store with: code, plain-English meaning, common causes, suggested fix, links to relevant doc sections.
- Add the integration playbook. A merchant going from zero to first successful transaction follows the same 8–12 steps. Make that an explicit document, not implicit across 40 pages.
Citation discipline is non-negotiable
Every answer must show the doc page(s) it came from. Two enforcement points:
- Prompt the LLM to refuse if it cannot ground the answer in retrieved context.
- Post-process the response: parse out citation references; if zero, drop the answer and hand off to human.
This costs you 5–10% of answer volume in the early weeks. It saves you orders of magnitude more in trust.
Fallback paths designed before launch
Three fallback paths to design before you ship:
- Low-confidence handoff. If retrieval similarity is low or the LLM signals uncertainty, route to a human with the question + retrieved context attached.
- Out-of-scope handoff. Questions about billing, contracts, dispute outcomes or anything money-affecting. The bot says "I'll route this" and does.
- Explicit human request. Merchant types "human" or "rep", instant handoff, no friction.
If you don't design these paths, the bot will improvise, and that's where damage happens.
Operating model: who owns the bot?
Three teams co-own it: DevRel/Docs (corpus quality), product engineering (retrieval + LLM ops), and partner success (fallback handling + feedback loop). Weekly review of: top questions, lowest-confidence answers, fallback rate trend, citation accuracy spot-check.
Common failure modes
- Stale doc detected too late. The bot keeps answering with an old endpoint that's been deprecated. Mitigation: a quarterly doc audit + tagging deprecated content explicitly.
- Code-block hallucinations. LLMs love to invent SDK method names. Mitigation: any code block in an answer must be matched against the actual SDK API surface; if no match, strip and hand off.
- Overconfidence on currency conversions or settlement timing. Add specific guardrails for these topics, never let the bot quote money.
What good looks like at 6 months
- 80%+ first-turn resolution on patterned questions
- <2% factual error rate (sample-audited weekly)
- Average time-to-first-successful-API-call for new merchants halved
- Doc team gets a feed of the "bot said X, human corrected to Y" pairs and uses it to improve the docs
The compound effect, better bot → better docs → better bot, is the actual unlock.
FAQ
Do I need a vector DB? Yes. Start with whatever is in your stack, pgvector, Pinecone, Weaviate. The DB choice matters far less than corpus quality.
Closed-model APIs or open-source LLMs? Either works. For merchant-facing surfaces with sensitive data, open-source self-hosted has the edge on data egress. For internal-only tooling, vendor APIs are faster to ship.
How do you measure quality? Weekly: first-turn resolution rate, fallback rate, citation accuracy on a sampled audit. Monthly: merchant CSAT delta vs. control. Quarterly: drift / regression audit on a held-out question set.
What's the biggest mistake teams make? Treating the corpus as a one-time setup. The corpus is the product. Treat it like one.
Related reading
There is a quiet AI-in-fintech mistake teams keep making: reaching for an LLM the moment the word 'AI' shows up on the roadmap. Sometimes the right answer is a gradient-boosted tree and a clean feature pipeline. This is the operator's argument for the boring choice.
Most fintech AI work in 2026 is still demos. These four use cases are not, they're running in production at $1B+ TPV across five regulated markets.
The first 15 minutes of any payment incident is reconstruction work. An AI auto-escalation bot does that reconstruction in seconds, and your incident commander walks in with the diagnostic already done.