Payment Processing Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Network timeout between gateway and card network
A single timed-out charge can result in a double charge (if retried) or lost payment (if abandoned). At $50 avg, even 0.01% of 432M daily transactions = 43,200 affected charges = $2.16M.
Constraint: card network calls take 200-500ms normally. During degradation, latency climbs to 2s+. Our 2-second timeout fires. We sent the authorization to Visa, but we do not know if the card was charged. The issuing bank may have approved the charge after the timeout. What breaks: if we retry, we risk a double charge. If we mark it failed, we risk a lost payment (money was taken from the customer but we do not know). Detection: card network adapter timeout rate exceeding 1%. Transaction 'unknown' state count growing. Recovery: mark as 'unknown', reconcile at T+1 against settlement file.
- Mark timed-out transactions as 'unknown' (not failed, not succeeded). Never retry a timed-out charge.
- Reconciliation job at T+1 compares our records against the card network's settlement file. If the charge went through, update to 'authorized'. If not, the hold auto-expires within 7 days.
- If one card network (Visa) has elevated timeouts, route new charges to a backup network (Mastercard) for dual-network cards.
- Alert on-call if timeout rate exceeds 5%. Consider pausing new charges to that network until latency recovers.
Double charge from idempotency failure
Zero tolerance. A single double charge triggers: customer refund, merchant notification, regulatory reporting, and root cause analysis. The PostgreSQL UNIQUE constraint makes this physically impossible when the schema is correct.
Constraint: two identical charge requests arrive within 1ms (client retry, load balancer duplicate). If idempotency uses Redis separate from the transaction store (PostgreSQL), a crash between the two loses the idempotency record while the charge processes. The second request passes the empty check and creates a duplicate. What breaks: customer charged twice, merchant trust damaged, regulatory violation. Recovery: PostgreSQL UNIQUE constraint makes double-insert physically impossible.
- PostgreSQL UNIQUE constraint on (merchant_id, idempotency_key) enforced in the same ACID transaction as the charge INSERT. Database-level enforcement is immune to application bugs.
- If UNIQUE violation occurs, return the original transaction (not an error). The client sees the same successful response regardless of retries.
- Monitor idempotency constraint violation rate. A spike indicates a client integration issue (same key, different params) or a retransmission storm.
- 24-hour idempotency key retention. Purge older keys via time-based table partitioning.
Webhook delivery failure to merchant
Unfulfilled orders lead to customer complaints and merchant churn. At 1% failure rate: 4.3M undelivered webhooks/day. DLQ and retry pipeline prevent permanent loss, but delayed notification affects merchant operations.
Constraint: merchant servers are outside our control. They may return 500, timeout, or be completely down. At 15K webhooks/sec, even a 1% failure rate means 150 failures/sec. What breaks: the merchant does not know a charge succeeded, fails to fulfill an order, and the customer contacts support. At scale, undelivered webhooks create a backlog of unfulfilled orders. Detection: webhook delivery success rate dropping below 99.9%. DLQ depth growing. Recovery: exponential backoff + DLQ.
- Exponential backoff: retry at 1 min, 5 min, 30 min, 2 hours, 24 hours. Five retries span ~26 hours.
- After 5 failed retries, move event to dead letter queue (DLQ). Notify merchant via email.
- Merchant dashboard allows replaying events from the DLQ. Events include an event_id for dedup on the merchant side.
- Circuit breaker: if a merchant's endpoint fails 10 times consecutively, pause delivery and alert the merchant. Resume on manual reset.
Card network returns ambiguous response
Ambiguous responses at scale create merchant confusion and reconciliation overhead. At 0.1% ambiguity rate: 432K ambiguous transactions/day requiring manual or automated resolution.
Constraint: card networks return structured responses, but some are ambiguous. Code '05' (Do Not Honor) means temporary decline at one bank, permanent block at another. Some networks return '00' (Approved) with a partial amount. What breaks: we mark the charge as authorized, but the settlement file shows a different status. Reconciliation catches it at T+1, but the merchant already fulfilled the order.
- Maintain a per-network response code mapping table. Classify each code as: approved, declined, retry-safe, or ambiguous.
- For ambiguous responses, mark the transaction as 'unknown' and reconcile at T+1, same as timeout handling.
- For partial approvals, capture only the approved amount and notify the merchant of the shortfall.
- Alert on-call when reconciliation delta for any network exceeds $1,000 in a single day.
Database failover during in-flight transaction
WAL and synchronous replication prevent data loss for committed transactions. Idempotency keys enable safe retries for in-flight transactions. Customer impact: 1-5 seconds of elevated latency during failover.
Constraint: PostgreSQL primary handles all writes. If the primary fails during a COMMIT, the transaction may or may not be committed. The WAL (Write-Ahead Log) ensures durability for committed transactions, but an in-flight transaction (COMMIT sent, ack not received) is ambiguous. What breaks: the client does not know if the charge was created. The standby is promoted to primary, but the in-flight transaction may not have replicated. Detection: PostgreSQL primary health check fails. Replication lag alert fires. Unconfirmed transaction count grows.
- Synchronous replication (synchronous_commit = on): WAL writes are confirmed on both primary and standby before COMMIT returns. In-flight transactions are preserved on failover.
- Client retries with the same idempotency key. If the transaction was committed on the new primary, the UNIQUE constraint returns the existing record. If not, a fresh transaction is created.
- Connection pool automatically reconnects to the new primary via a floating VIP (virtual IP) or DNS failover.
- Alert on-call immediately. Verify no data loss by comparing transaction count before and after failover.
PCI audit log gap
A PCI audit log gap does not affect runtime operations, but it threatens compliance certification. Loss of PCI-DSS certification means the company cannot process card payments.
Constraint: PCI-DSS requires continuous audit logging of all access to cardholder data. If the logging pipeline (application -> Kafka -> audit store) drops events, the audit trail has a gap. This can happen if Kafka partitions are unavailable, the audit store is full, or the application silently swallows logging errors. What breaks: the next PCI audit finds a gap, potentially resulting in compliance failure and loss of card processing privileges. Detection: audit event count for a time window is lower than expected based on transaction volume.
- Write audit events synchronously to a local file before sending to Kafka. If Kafka is down, the local file serves as a buffer. A catch-up process replays local files when Kafka recovers.
- Reconcile audit event count against transaction count daily. Any gap triggers an alert and investigation.
- Audit store uses append-only tables with no DELETE permissions. Even database admins cannot remove audit entries.
- Retain audit logs for 1 year per PCI-DSS requirement. Archive to cold storage (S3 Glacier) after 90 days.