Payment Processing Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Idempotency Key: DB UNIQUE Constraint Prevents Double Charges

The client sends an Idempotency-Key header (UUID) with every charge request. PostgreSQL enforces a UNIQUE constraint on (merchant_id, idempotency_key). The check and INSERT happen in one ACID transaction: if the key exists, return the original response. Why not Redis dedup? A crash between Redis check and PostgreSQL insert would lose the idempotency record. The UNIQUE constraint makes double-insert physically impossible at the database level. Cost: ~2ms per request for B-tree index scan. At $50 avg transaction, a single prevented double charge pays for years of the 2ms overhead.

💡 UNIQUE(merchant_id, idempotency_key) in same ACID txn as INSERT. 2ms overhead. Zero double charges.

⚠ Checking idempotency in Redis and inserting in PostgreSQL separately. A crash between the two steps allows a double charge.

Payment State Machine: 5 States, No Backward Transitions

Every payment flows through exactly 5 states: pending -> authorized -> captured -> settled -> failed. No backward transitions allowed. Pending can move to authorized or failed. Authorized can move to captured or failed (void). Captured can move to settled. Each transition is recorded as an event in the ledger. A simple status field would allow illegal transitions (settled back to pending), creating accounting errors. The state machine enforces valid transitions in code. A background job scans for authorized transactions older than 6 days and warns merchants before the 7-day capture window expires.

💡 5 states, forward-only transitions, each change logged in ledger. 7-day capture window.

⚠ Using a single mutable status field instead of a state machine. A developer could set 'settled' back to 'pending', breaking the ledger.

Two-Phase Capture: Authorize First, Capture Within 7 Days

Authorization places a hold on the cardholder's credit without moving money. The merchant captures within 7 days (card network rule). Hotels authorize at check-in, capture at checkout. E-commerce authorizes at order, captures at shipment. Capture amount can be less than or equal to the authorized amount (partial capture) but never more. If the merchant does not capture in time, the authorization expires and the hold is released. Single-step charges would require refunding the difference, which takes 5-10 business days.

💡 Authorize = hold, Capture = move money. 7-day window. Partial capture allowed.

⚠ Always charging immediately in one step. Hotels, car rentals, and restaurants need to adjust the final amount after authorization.

Tokenization: Raw Card Numbers Never Touch App Servers

Raw PANs (Primary Account Numbers) are replaced with random tokens (tok_abc123) at the edge before reaching application servers. A dedicated token vault with its own network segment, AES-256 encryption at rest, and HSM-backed key management is the only system that sees plaintext card numbers. This reduces PCI-DSS audit scope from hundreds of servers to one isolated vault. The vault round-trip adds ~5ms, negligible compared to card network latency (200-500ms). Application-layer encryption still exposes PANs in app server memory during encrypt/decrypt, which PCI-DSS considers in-scope.

💡 Tokenize at edge. Vault is the only PAN handler. Audit scope: 1 vault, not 100 servers.

⚠ Encrypting card numbers in the application layer. The app server still handles raw PANs in memory during encryption, which PCI-DSS considers in-scope.

Double-Entry Ledger: Every Cent Accounted Twice

Every transaction creates two ledger entries: a debit and a credit. Customer pays

50: debit customer_funds

50, credit merchant_revenue $50. Refunds create two new entries (not reversals): debit merchant_revenue, credit customer_funds. The ledger is append-only: no UPDATE, no DELETE. Sum of all debits must equal sum of all credits at all times. At 432M txns/day with 2 entries each: 864M ledger rows/day at ~200B each = 173 GB/day. Trade-off: double write volume, but the ability to reconcile to the penny is non-negotiable.

💡 Debit + credit for every txn. Append-only. 864M rows/day. Debits = Credits always.

⚠ Using single-entry accounting (one row per transaction). When discrepancies arise, there is no way to trace where money went without the matching debit/credit pair.

Network Timeout: 2-Second Cap with Automatic Reversal

Card network calls have a 2-second timeout (card network SLA). On timeout, we do not know if the card was charged. Retrying would risk a double charge. Our recovery: mark the transaction as unknown and run a reconciliation job at T+1 against the card network's settlement file. If the charge succeeded, update to authorized. If not, the hold auto-expires within 7 days. The customer may see a pending hold for up to 24 hours on a failed charge. The safe action on timeout is always do nothing and reconcile later.

💡 2s timeout. Never retry on timeout. Mark unknown, reconcile T+1. Holds auto-expire.

⚠ Retrying a timed-out charge request. The original may have succeeded at the issuing bank, and retrying creates a double charge.

Webhook Retry: Exponential Backoff + Dead Letter Queue

We deliver 1.3B webhooks/day (15K/sec) to merchant endpoints. Merchant servers are unreliable. On failure: retry with exponential backoff at 1 min, 5 min, 30 min, 2 hours, 24 hours. After 5 retries (~26 hours), move to a dead letter queue (DLQ). Merchants replay from the DLQ via dashboard. Each webhook includes an event_id for merchant-side dedup and an HMAC-SHA256 signature for authenticity verification. Why at-least-once (not exactly-once)? Exactly-once across an unreliable network requires the merchant to implement idempotency anyway.

💡 Exponential backoff: 1m, 5m, 30m, 2h, 24h. DLQ after 5 retries. HMAC signature.

⚠ Delivering webhooks synchronously in the charge response path. If the merchant endpoint is slow (5s), the customer waits 5s for their charge to complete.

PCI-DSS: Encrypt at Rest + In Transit + Audit Log

PCI-DSS Level 1 requires: (1) Encryption at rest: AES-256 for all stored card data. Only the token vault stores card data. (2) Encryption in transit: TLS 1.2+ for all network communication, including internal services. (3) Audit log: every access to cardholder data is logged with timestamp, user, action, and result. Logs are immutable and retained for 1 year. (4) Network segmentation: the token vault lives in its own VPC with no direct internet access. Only the payment service can reach it via a private link. Yearly audit cost:

50K-

500K depending on scope. Tokenization reduces scope to 1 vault.

💡 AES-256 at rest, TLS 1.2+ in transit, immutable audit logs, vault in isolated VPC.

⚠ Storing encrypted card numbers in the main application database. Even encrypted, PCI-DSS considers the database in-scope because the app server handles decryption keys.

Reconciliation: Batch Compare Gateway vs Network at T+1

Each day at T+1, the card network sends a settlement file listing every transaction with amounts and statuses. Our reconciliation job compares this file against our ledger_entries table line by line. Any mismatch triggers an alert for manual review. The job processes

432M \times 900B = 389\text{ GB}

of data per run, partitioned by merchant_id across 50 workers, completing in under 2 hours. This catches double charges, missing captures, and timeout ambiguities. Settlement (actual money movement) takes T+2 to T+3 business days.

💡 T+1 settlement file. Line-by-line comparison. 389 GB/run. 50 parallel workers. T+2/T+3 money movement.

⚠ Trusting your own ledger without comparing it against the card network's records. Two independent systems must agree on every transaction.

Sharding by merchant_id: Co-locates All Merchant Data

#10

We shard by merchant_id across 8 PostgreSQL instances. This co-locates a merchant's transactions, ledger entries, and merchant record on the same shard, so the idempotency check + INSERT + ledger write happen in one local ACID transaction without distributed coordination. Why not shard by transaction_id? Because the idempotency check requires looking up (merchant_id, idempotency_key), which would scatter across shards. Merchant-based sharding also enables per-merchant reconciliation in parallel. Trade-off: hot merchants (high-volume) create shard skew. We mitigate with consistent hashing and merchant migration tooling.

💡 Shard by merchant_id. ACID txn stays on one shard. 8 shards for 10K TPS.

⚠ Sharding by transaction_id. The idempotency check requires a lookup by (merchant_id, idempotency_key), which would require a scatter-gather across all shards.