Last month, I ordered food on Swiggy. Clicked "Pay ₹847" via UPI. Payment went through in 1.2 seconds.
What I didn't see: my payment was first sent to Razorpay's primary gateway, which timed out. It was then instantly rerouted through Paytm's gateway, which also failed. Finally, PhonePe's gateway processed it successfully.
All of this happened in under 2 seconds, and I had no idea three different systems were involved.
This is payment gateway routing—one of those invisible backend systems that seems simple but is actually a fascinating distributed systems problem. After building my own routing system to understand how it works, I want to share what happens behind that "Pay" button.
Why Can't You Just Use One Payment Gateway?
When I first thought about payment systems, my mental model was simple: your app talks to Razorpay, Razorpay talks to your bank, money moves. Done.
Reality is messier.
In India, a typical payment company integrates with 10-15 different gateways: Razorpay, Juspay, PayU, Paytm, PhonePe Switch, Cashfree, CCAvenue, BillDesk, and more.
Why so many? Because each gateway has different strengths.
Gateway Strengths
- Gateway A might have: 97% success rate for HDFC UPI, 82% for Axis Bank cards, ~800ms latency, ₹2 per transaction.
- Gateway B might have: 89% success for HDFC UPI, 95% for Axis Bank cards, ~1200ms latency, ₹1.5 per transaction.
If you're processing an Axis Bank card payment, you'd want Gateway B. For HDFC UPI, you'd want Gateway A. But how do you decide this in real-time, at scale? That's the routing problem.
What Happens When You Click "Pay"
Let me walk through what actually happens during a payment, based on what I learned building this system.
Step 1: The Scoring Decision (~5 milliseconds)
The moment your payment request arrives, the routing system needs to pick which gateway to try first. It doesn't just round-robin or randomly pick—it scores every available gateway based on multiple factors:
Scoring Factors
- Success Rate (35% weight): In the last 5 minutes, how many UPI payments from HDFC accounts succeeded on this gateway?
- Latency (25% weight): What's the P95 latency? 95% of requests complete faster than this time.
- Method Affinity (15% weight): Is this gateway good at UPI vs Cards vs Netbanking?
- Bank Affinity (12% weight): An HDFC card through HDFC's gateway tends to have higher success rates.
- Amount Fit (8% weight): Is this gateway optimized for small (₹100-500) or large (₹10,000+) payments?
- Time-of-Day Penalty (5% weight): Does historical data show this gateway struggles at 12 PM or month-end?
All of this gets combined into a single score. The gateway with the highest score wins and gets tried first.
Step 2: The Circuit Breaker Check (~1 millisecond)
Before actually calling the gateway, there's a critical safety check: Is this gateway even healthy?
Circuit breakers track three states:
- CLOSED (Healthy): Everything's normal. Success rate above 60%, timeout rate low. All requests go through.
- OPEN (Unhealthy): Gateway is having issues. Maybe success rate dropped to 35%, or 10 requests in a row timed out. Don't try this gateway—route to backup immediately.
- HALF-OPEN (Recovery): After 30 seconds of being OPEN, send 10% of traffic as probes. If probes succeed, re-open fully. If they fail, stay OPEN.
Circuit Breaker Example
12:00 PM - Razorpay gateway processing normally (CLOSED) 12:15 PM - Bank maintenance starts, success rate drops to 38% 12:16 PM - Circuit breaker trips to OPEN 12:16-12:46 PM - All payments route to backup gateways (Paytm, PhonePe) 12:46 PM - Circuit enters HALF-OPEN, sends 10% probe traffic 12:48 PM - Probes successful (bank maintenance over) 12:48 PM - Circuit returns to CLOSED
The user never knew anything was wrong.
Step 3: The Actual Gateway Call (200ms - 3000ms)
When you make an HTTP request to a payment gateway, three things can happen:
1. Success (HTTP 200): Clear outcome. Money moved.
2. Clear Failure (e.g. "Insufficient funds"): Transaction failed for a known reason.
3. Timeout (No response after 2.5 seconds): This is the nightmare scenario. You don't know if the gateway received your request and processed it, failed, or never received it. Retry immediately and you might double-charge. Don't retry and you might lose a legitimate transaction.
Step 4: The Retry Decision (~2 milliseconds)
When a gateway call fails, the system classifies every error:
- Retryable (e.g. NETWORK_ERROR, HTTP_500): Try a different gateway.
- Timeout-like: Don't retry! Queue for async verification. Mark as PENDING_VERIFICATION, enqueue for status check worker.
- User errors (e.g. INSUFFICIENT_FUNDS): Don't retry, show error to user.
Example: Attempt 1 → Razorpay NETWORK_ERROR (retryable) → try next gateway. Attempt 2 → Paytm TIMEOUT → don't retry; mark PENDING_VERIFICATION; return "Payment processing..." to user. Background worker later calls Paytm's status API and updates the payment. This prevents double-charging while handling ambiguous states.
Step 5: The Latency Budget
Users won't wait forever. The system has a latency budget—typically 8-10 seconds total for all retry attempts. When the budget is running low, it strategically skips slow gateways and tries faster ones. That's why your payment might succeed on the third try in 4 seconds instead of failing after 12.
The Metrics Pipeline: How the System Learns
Payment Attempt → Save to DB → Publish to Event Stream → Metrics Worker. The worker updates sliding window metrics (1min, 5min, 15min, 60min), stores hot metrics in Redis and historical in PostgreSQL. The system primarily uses the 5-minute window for routing: recent enough to be relevant, stable enough to not overreact to noise.
What Gets Tracked
- Success rate, timeout rate, latency percentiles (P50, P95, P99), error distribution, total volume—per (gateway, payment_method, bank). Data updates every second.
A/B Testing and Thompson Sampling
Companies run experiments (e.g. 90% control, 10% treatment) with deterministic assignment by customer_id, then statistical testing and auto-stop guardrails if treatment harms users.
Thompson Sampling (multi-armed bandit): For each gateway, maintain a Beta(alpha, beta) belief about success rate. When routing, sample a random success probability from each distribution and pick the highest. After the payment, update alpha on success, beta on failure. Gateways with little data get explored; good ones get exploited. The system converges automatically.
The Architecture
Request → Check Idempotency → Build Payment Context → Resolve Experiments → Score All Gateways → Apply Bandit Reordering (if enabled) → Retry Loop: for each gateway check circuit breaker, call API, update circuit breaker, classify error, check latency budget → Persist payment, routing decision, attempts, outbox event → Background: metrics worker, verification worker, experiment analyzer, webhooks.
What This Achieves
- Reliability: Failover in <500ms, circuit breakers prevent cascade failures.
- Performance: Sub-10ms routing decisions, intelligent latency budgeting.
- Safety: Idempotency, timeout handling, full audit trail.
- Intelligence: Learns which gateways work best; A/B tests and bandits for optimization.
The Hard Parts
- Timeouts are inherently ambiguous—only mitigation via aggressive timeouts, async verification, idempotency.
- Circuit breaker thresholds are highly sensitive; too aggressive or too conservative both hurt.
- Metrics need context—segment by (gateway, method, bank), not just gateway.
- Cold start: New gateways have no data; use a bootstrap period (e.g. 10% forced traffic for first hour).
- Idempotency edge cases: Same key, different amount → hash request body and reject if mismatch.
Why I Built This
I'm a backend engineer who'd never worked on payments. Building this helped me understand each layer: intelligent routing, circuit breakers, metrics pipelines, retry orchestration, A/B testing. It's a distributed systems problem that's business-critical—if routing is down, payments fail; if slow, users bounce; if it double-charges, you lose trust.
The Tech Stack
I built this in Rust for type safety, performance (sub-ms routing decisions), fearless concurrency, and explicit error handling. Key libraries: Axum (web), SQLx (SQL), Redis (hot metrics + circuit breaker state), Tokio (async). The system has one API server, three background workers (metrics, verification, experiments), PostgreSQL, and Redis.
Try It Yourself
The full codebase is on GitHub. Run locally with Docker, then start the API server and workers. Example request:
bash1curl -X POST http://localhost:3000/payments \ 2 -H 'Idempotency-Key: test-123' \ 3 -H 'Content-Type: application/json' \ 4 -d '{"amount_minor": 100000, "currency": "INR", "payment_method": "UPI", "merchant_id": "m_001", "customer_id": "cust_001", "instrument": {"type": "UPI", "vpa": "test@okhdfcbank"}}'
The README has detailed docs on API endpoints, architecture, and configuration.
Payment routing seems simple on the surface but reveals layers of complexity. The naive solution (try A, then B) works until scale. Then you need intelligence, speed, safety, resilience, and optimization. Building this taught me how companies like Razorpay achieve 99.9% success rates—it's careful engineering across every layer.
GitHub: https://github.com/Bhup-GitHUB/payments-gateway LinkedIn: https://www.linkedin.com/in/bhupesh-k-185327366/