Failure Modes

Payment Processing Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
CRITICAL

Network timeout between gateway and card network

A single timed-out charge can result in a double charge (if retried) or lost payment (if abandoned). At $50 avg, even 0.01% of 432M daily transactions = 43,200 affected charges = $2.16M.

Constraint: card network calls take 200-500ms normally. During degradation, latency climbs to 2s+. Our 2-second timeout fires. We sent the authorization to Visa, but we do not know if the card was charged. The issuing bank may have approved the charge after the timeout. What breaks: if we retry, we risk a double charge. If we mark it failed, we risk a lost payment (money was taken from the customer but we do not know). Detection: card network adapter timeout rate exceeding 1%. Transaction 'unknown' state count growing. Recovery: mark as 'unknown', reconcile at T+1 against settlement file.

Card network adapter timeout rate exceeding 1%. Growing count of transactions in 'unknown' state. Card network status page showing degradation.
Mitigation
  1. Mark timed-out transactions as 'unknown' (not failed, not succeeded). Never retry a timed-out charge.
  2. Reconciliation job at T+1 compares our records against the card network's settlement file. If the charge went through, update to 'authorized'. If not, the hold auto-expires within 7 days.
  3. If one card network (Visa) has elevated timeouts, route new charges to a backup network (Mastercard) for dual-network cards.
  4. Alert on-call if timeout rate exceeds 5%. Consider pausing new charges to that network until latency recovers.
CRITICAL

Double charge from idempotency failure

Zero tolerance. A single double charge triggers: customer refund, merchant notification, regulatory reporting, and root cause analysis. The PostgreSQL UNIQUE constraint makes this physically impossible when the schema is correct.

Constraint: two identical charge requests arrive within 1ms (client retry, load balancer duplicate). If idempotency uses Redis separate from the transaction store (PostgreSQL), a crash between the two loses the idempotency record while the charge processes. The second request passes the empty check and creates a duplicate. What breaks: customer charged twice, merchant trust damaged, regulatory violation. Recovery: PostgreSQL UNIQUE constraint makes double-insert physically impossible.

Idempotency constraint violation count in PostgreSQL logs. Duplicate transaction alerts from reconciliation. Customer support tickets reporting double charges.
Mitigation
  1. PostgreSQL UNIQUE constraint on (merchant_id, idempotency_key) enforced in the same ACID transaction as the charge INSERT. Database-level enforcement is immune to application bugs.
  2. If UNIQUE violation occurs, return the original transaction (not an error). The client sees the same successful response regardless of retries.
  3. Monitor idempotency constraint violation rate. A spike indicates a client integration issue (same key, different params) or a retransmission storm.
  4. 24-hour idempotency key retention. Purge older keys via time-based table partitioning.
HIGH

Webhook delivery failure to merchant

Unfulfilled orders lead to customer complaints and merchant churn. At 1% failure rate: 4.3M undelivered webhooks/day. DLQ and retry pipeline prevent permanent loss, but delayed notification affects merchant operations.

Constraint: merchant servers are outside our control. They may return 500, timeout, or be completely down. At 15K webhooks/sec, even a 1% failure rate means 150 failures/sec. What breaks: the merchant does not know a charge succeeded, fails to fulfill an order, and the customer contacts support. At scale, undelivered webhooks create a backlog of unfulfilled orders. Detection: webhook delivery success rate dropping below 99.9%. DLQ depth growing. Recovery: exponential backoff + DLQ.

Webhook delivery success rate below 99.9%. Dead letter queue depth exceeding 10,000 events. Individual merchant webhook failure rate exceeding 10%.
Mitigation
  1. Exponential backoff: retry at 1 min, 5 min, 30 min, 2 hours, 24 hours. Five retries span ~26 hours.
  2. After 5 failed retries, move event to dead letter queue (DLQ). Notify merchant via email.
  3. Merchant dashboard allows replaying events from the DLQ. Events include an event_id for dedup on the merchant side.
  4. Circuit breaker: if a merchant's endpoint fails 10 times consecutively, pause delivery and alert the merchant. Resume on manual reset.
HIGH

Card network returns ambiguous response

Ambiguous responses at scale create merchant confusion and reconciliation overhead. At 0.1% ambiguity rate: 432K ambiguous transactions/day requiring manual or automated resolution.

Constraint: card networks return structured responses, but some are ambiguous. Code '05' (Do Not Honor) means temporary decline at one bank, permanent block at another. Some networks return '00' (Approved) with a partial amount. What breaks: we mark the charge as authorized, but the settlement file shows a different status. Reconciliation catches it at T+1, but the merchant already fulfilled the order.

Reconciliation delta exceeding $0 for specific card networks. Response code distribution shifting from historical baseline. Merchant-reported discrepancies between charge status and actual settlement.
Mitigation
  1. Maintain a per-network response code mapping table. Classify each code as: approved, declined, retry-safe, or ambiguous.
  2. For ambiguous responses, mark the transaction as 'unknown' and reconcile at T+1, same as timeout handling.
  3. For partial approvals, capture only the approved amount and notify the merchant of the shortfall.
  4. Alert on-call when reconciliation delta for any network exceeds $1,000 in a single day.
MEDIUM

Database failover during in-flight transaction

WAL and synchronous replication prevent data loss for committed transactions. Idempotency keys enable safe retries for in-flight transactions. Customer impact: 1-5 seconds of elevated latency during failover.

Constraint: PostgreSQL primary handles all writes. If the primary fails during a COMMIT, the transaction may or may not be committed. The WAL (Write-Ahead Log) ensures durability for committed transactions, but an in-flight transaction (COMMIT sent, ack not received) is ambiguous. What breaks: the client does not know if the charge was created. The standby is promoted to primary, but the in-flight transaction may not have replicated. Detection: PostgreSQL primary health check fails. Replication lag alert fires. Unconfirmed transaction count grows.

PostgreSQL primary health check failure. Replication lag exceeding 1 second. Connection pool errors from the payment service. Elevated idempotency constraint violations (retried requests hitting the new primary).
Mitigation
  1. Synchronous replication (synchronous_commit = on): WAL writes are confirmed on both primary and standby before COMMIT returns. In-flight transactions are preserved on failover.
  2. Client retries with the same idempotency key. If the transaction was committed on the new primary, the UNIQUE constraint returns the existing record. If not, a fresh transaction is created.
  3. Connection pool automatically reconnects to the new primary via a floating VIP (virtual IP) or DNS failover.
  4. Alert on-call immediately. Verify no data loss by comparing transaction count before and after failover.
MEDIUM

PCI audit log gap

A PCI audit log gap does not affect runtime operations, but it threatens compliance certification. Loss of PCI-DSS certification means the company cannot process card payments.

Constraint: PCI-DSS requires continuous audit logging of all access to cardholder data. If the logging pipeline (application -> Kafka -> audit store) drops events, the audit trail has a gap. This can happen if Kafka partitions are unavailable, the audit store is full, or the application silently swallows logging errors. What breaks: the next PCI audit finds a gap, potentially resulting in compliance failure and loss of card processing privileges. Detection: audit event count for a time window is lower than expected based on transaction volume.

Audit event count diverging from transaction count by more than 0.01%. Kafka consumer lag for audit topic exceeding 10 minutes. Audit store disk utilization above 90%.
Mitigation
  1. Write audit events synchronously to a local file before sending to Kafka. If Kafka is down, the local file serves as a buffer. A catch-up process replays local files when Kafka recovers.
  2. Reconcile audit event count against transaction count daily. Any gap triggers an alert and investigation.
  3. Audit store uses append-only tables with no DELETE permissions. Even database admins cannot remove audit entries.
  4. Retain audit logs for 1 year per PCI-DSS requirement. Archive to cold storage (S3 Glacier) after 90 days.