Skip to content
16px
Observability in Distributed Payment Systems: Following One Payment Across Every Service
ObservabilityPaymentsDistributed SystemsOpenTelemetryBackend

Observability in Distributed Payment Systems: Following One Payment Across Every Service

How metrics, traces, logs, OpenTelemetry, and structured context turn a failed distributed payment from scattered noise into a complete story.

June 14, 20269 min read

Let me tell you what a payment failure actually looks like from where I sit.

The customer sees: Payment failed.

I see: a request that touched eight different services, crossed three network boundaries, hit a message broker, and died somewhere in the middle — and now I have to figure out where.

That's the actual problem. Not "our system has no data." Our system has too much data, scattered across too many places, completely disconnected from each other. This article is about how I'd wire everything together so that one failed payment becomes a story you can actually follow.

The System I'm Working With

Here's the simplified flow I'll use throughout:

  1. 1Merchant
  2. 2API Gateway
  3. 3Payment Service
  4. 4Risk Service
  5. 5Payment Router
  6. 6Bank Adapter
  7. 7Ledger Service
  8. 8Payment Event (broker)
  9. 9Notification Worker

Every box is a separate process. Each one can have multiple replicas. The bank adapter is calling an external provider you don't control. The notification worker is consuming events asynchronously — possibly minutes after the original request completed.

From the customer's perspective, this is one transaction. From mine, it's a distributed workflow with eight failure surfaces.

The Debugging Problem Nobody Talks About Honestly

Here's what actually happens when a payment takes four seconds.

The gateway sees: POST /api/payments — 4000ms. That's it. That's what you have without proper observability.

Now you're digging. You search gateway logs. Then payment-service logs. Then bank adapter logs — except those are on three different replicas, and there were 40,000 other payments happening in that same window.

The breakdown you need to see looks like this:

Latency Breakdown
API Gateway
20ms
Payment Service
40ms
Risk Service
30ms
Payment Router
15ms
Bank Adapter
3800ms← here's your problem
Ledger Service
45ms

Without distributed tracing, you're manually correlating log timestamps across services and hoping the clocks are in sync. At any reasonable traffic volume, that process is somewhere between painful and impossible.

The issue isn't missing data. The issue is that the data has no connective tissue.

The Three Signals (and What Each One Is Actually For)

I see teams obsess over one of these and neglect the other two. Don't do that. They solve fundamentally different problems.

Metrics tell you that something is wrong. Numbers aggregated over time — success rate, failure rate, P95 latency, provider error count. Great for alerting. Useless for understanding a specific failure.

Traces tell you where the time went and which operation failed. The complete journey of one request, broken into spans. This is what transforms "the payment was slow" into "the bank adapter took 3.8 seconds."

Logs tell you what exactly happened at a specific moment. The error message, the state, the context. Not useful in isolation, but essential when you're staring at a failed span and need to know why.

Miss any of these, and you're operating blind in one dimension.

Traces: The Part Most Teams Get Wrong

The concept is simple: every service creates a span for the work it does, and all those spans are linked into one trace. In practice, two things trip people up.

1. Context Propagation

Creating spans isn't enough. The services have to agree that their spans belong to the same trace. That happens through headers.

When the gateway calls the payment service, it sends trace context in the HTTP headers. The payment service extracts that context, creates its span as a child, and passes it forward. Every hop in the chain does the same thing.

  1. 1Gateway trace context
  2. 2Payment Service (child span)
  3. 3Risk Service (child span)
  4. 4Bank Adapter (child span)
  5. 5Ledger Service (child span)

OpenTelemetry uses the W3C Trace Context standard for this (traceparent header). Without it, every service creates an orphaned trace and you're back to guessing.

2. Async Boundaries

This is where I see even experienced teams drop the ball.

After a payment succeeds, the payment service publishes a payment.completed event. The notification worker picks it up and sends an email. The HTTP request is long gone by then.

If you don't carry trace context through the message, the notification work is invisible. You'll never know if notifications are failing or slow unless something breaks badly enough to be obvious.

The pattern is straightforward:

Publishing:

  1. Read current trace context
  2. Inject it into the message headers
  3. Publish

Consuming:

  1. Extract trace context from message headers
  2. Create a consumer span linked to that context
  3. Process

The result: one trace that shows the entire payment lifecycle, HTTP and async included.

  1. 1Payment request
  2. 2Risk evaluation
  3. 3Bank authorization
  4. 4Ledger write
  5. 5Event publication
  6. 6Event consumption
  7. 7Notification delivery

That's genuinely useful. You can follow a payment from API request all the way to the email the customer received.

Metrics: The Label Trap

Metrics are powerful but they have one sharp edge: cardinality.

This seems useful:

payments_total{payment_id="pay_123", trace_id="abc456", merchant_id="merchant_789"}

It's actually a disaster. Payment IDs, trace IDs, merchant IDs — these are unbounded. You're telling Prometheus to maintain a separate time series for every unique combination. That's potentially millions of series, and your monitoring system will buckle.

Metric labels should be bounded categorical values:

payments_total{status="failed", currency="INR", provider="bank_a", risk_decision="rejected"}

Simple rule: if the value can grow without a predictable ceiling, it doesn't belong on a metric label. Payment IDs go in traces and logs. Aggregated counts go in metrics.

For a payment platform, the metrics that actually matter:

  • Payment throughput and success rate
  • P50 / P95 / P99 latency (per provider, not just overall)
  • Provider error rates
  • Risk rejection rate
  • Ledger write failures
  • Event queue depth

That last one — queue depth — is something a lot of teams add too late. If your notification worker stops consuming, payments are succeeding but customers aren't getting confirmations. Infrastructure metrics will show everything healthy. Business metrics catch it.

Structured Logs: Small Change, Massive Payoff

This is probably the cheapest high-value improvement most teams can make.

A plain log:

Payment failed for pay_123 because bank request failed

A structured log:

json
1{
2  "level": "ERROR",
3  "service": "bank-adapter",
4  "message": "bank authorization failed",
5  "trace_id": "4fd0bfa1b74312d8",
6  "span_id": "ae21c30b19f44851",
7  "payment_id": "pay_123",
8  "bank": "demo_bank",
9  "error": "authorization_declined"
10}

The difference isn't readability — it's queryability. With the structured version, your investigation workflow becomes:

  1. Alert fires: payment failures spiked
  2. Open metric dashboard: failures concentrated on demo_bank
  3. Pull an example trace for a failed payment
  4. Copy the trace_id, search Loki
  5. Every log line from every service that touched that payment comes back
  6. Root cause is usually obvious at this point

Without the trace_id in the log, step 4 is "grep for the timestamp and hope." At scale, that doesn't work.

One more thing: use consistent field names across services. If one service logs payment_id and another logs paymentId and another logs transaction_id, you're doing extra work every time you query. Boring consistency pays dividends during incidents.

What Goes in a Span (and What Absolutely Cannot)

Span attributes make traces readable:

payment.id
payment.amount
payment.currency
payment.status
risk.decision
bank.name
provider.response_code

What must never go in a span:

  • Card numbers
  • CVV
  • Full account numbers
  • Auth tokens
  • Raw provider payloads

Observability systems are usually accessible to a large number of engineers and retain data for weeks or months. If you're not careful, your tracing backend becomes an uncontrolled second store for sensitive financial data. That's a compliance problem and a security problem simultaneously.

Treat telemetry data with the same discipline you'd apply to any production datastore.

Error Modeling: Not Everything Broken Is an Error

This comes up constantly in payment systems.

When the risk engine rejects a transaction, that's not a bug. That's the system working correctly. The risk service ran, evaluated, and returned a decision:

risk.decision = rejected

The span should be marked success. The payment outcome is rejected, but the service didn't fail.

Compare that to a bank adapter that throws an exception because the upstream provider returned an unparseable response. That's a genuine error. The span should be marked error, and the exception should be recorded.

Why does this matter? Because if you mark business outcomes as errors, your error rate becomes noise. Alerts fire constantly. Engineers stop responding to them. When something actually breaks, it gets buried.

Model the difference explicitly:

  • Infrastructure failure → error status on span
  • Business rule outcome → appropriate attribute, span marked success

Infrastructure vs. Business Health: You Need Both

I've seen this exact situation too many times:

CPU usage:       normal
Memory:          normal
HTTP 200s:       normal
Payment success: 12%

An infrastructure-only monitoring setup would page nobody. Business metrics would be screaming.

The service is technically running fine. It's accepting requests, returning 200s, operating within resource limits. It's just failing to actually process payments because a downstream bank integration is broken.

This is why payment observability specifically needs both dimensions. Infrastructure tells you the process is alive. Business metrics tell you the product is working.

The OpenTelemetry Architecture (Why It's Worth the Upfront Cost)

The last time I had to migrate a monitoring backend, I understood why vendor-neutral instrumentation matters.

The approach I'd use:

  1. 1Services (instrumented with OTel SDK)
  2. 2OpenTelemetry Collector
  3. 3Prometheus (metrics)
  4. 4Tempo (traces)
  5. 5Loki (logs)
  6. 6Grafana (dashboards + correlation)
  7. 7Alertmanager (alerts)

The Collector is the key piece. Without it, every service needs to know about every backend — addresses, auth, retry behavior, sampling rules. That's configuration debt that compounds fast.

With the Collector, each service knows exactly one thing: where the Collector is. The Collector handles batching, retry, sampling, routing, and backend-specific configuration. If you want to change backends, you change Collector config, not application code.

It's more moving parts upfront. It's significantly less pain over time.

What Good Looks Like During an Incident

Without proper observability, an incident sounds like this:

"The payment API is slow. Maybe the database? Maybe the risk service? Maybe the bank? Let's check every service's logs."

That's a 45-minute investigation for a problem that might take 5 minutes to fix.

With connected telemetry, it sounds like this:

"Payment P95 latency spiked at 14:32. Affects only requests routed to demo_bank. The bank.authorize span went from 300ms to 3.5 seconds. Our internal services are normal. Provider timeout rate is up. Payments on bank_b are fine."

That's a different conversation entirely. You're not guessing. You're reading what the system is telling you.

What This Doesn't Solve

I want to be honest about scope. A well-instrumented payment system is not a complete payment platform. A real production deployment also needs:

  • Tail-based sampling (you cannot store every trace at scale)
  • Data redaction at the Collector level
  • Role-based access to dashboards
  • Long-term retention policies and cost controls
  • Dead-letter queues and reconciliation
  • Idempotency across retries
  • Exactly-once financial effects
  • Audit trails
  • Compliance requirements (PCI-DSS changes what you can log)

Observability is the operational layer on top of all of that. It makes the system understandable, not correct. You still have to build the correctness.

Final Thought

Distributed systems make failures harder to understand by design. A single customer operation touches multiple services, external providers, databases, and async workers. The failure modes multiply with every boundary you cross.

The point of observability isn't pretty dashboards. It's reducing the time between "something is wrong" and "we know exactly what's wrong and where."

Metrics give you the signal. Traces give you the path. Logs give you the detail. OpenTelemetry connects them across every boundary.

Once that context is preserved end-to-end — across HTTP, external providers, databases, message brokers, async consumers — a failed payment stops being a mystery. It becomes a complete story you can follow from the first byte received at the gateway to the last event consumed by the notification worker.

That's what you're building toward. Everything else is just tooling.

Bhupesh Kumar

Bhupesh Kumar

Backend engineer building scalable APIs and distributed systems with Node.js, TypeScript, and Go.