Payment Processing Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Storing Raw Card Numbers Instead of Tokens

Very CommonFORMULA

Storing raw PANs in the application database puts every server in PCI-DSS audit scope. Tokenization reduces scope to one isolated vault.

Why: It is the simplest approach: store the card number in a column, encrypt it with AES-256, and call it secure. At prototype scale, this works. But PCI-DSS Level 1 considers any server that handles raw PANs (even encrypted) as in-scope. The audit scope expands from 1 vault to hundreds of application servers, databases, and load balancers. Annual audit cost jumps from $50K to$ 500K+. A single breach exposes millions of card numbers.

WRONG: Store encrypted card numbers in the main application database. Every application server handles decryption keys. PCI-DSS audit scope includes all servers, databases, and network paths. Annual audit cost: {$500K+}. A breach exposes the encryption key and all card numbers at once.

RIGHT: Use tokenization: replace raw PANs with random tokens at the edge. A dedicated token vault in an isolated VPC is the only system that sees plaintext card numbers. Application servers only see tokens like tok_abc123. PCI-DSS audit scope shrinks to one vault. The vault uses HSM-backed key management and AES-256 at rest. Round-trip adds ~5ms, negligible compared to card network latency (200-500ms). Trade-off accepted: dependency on the vault for every charge, mitigated by vault replication and caching.

No Idempotency Key on Payment Endpoints

Very CommonFORMULA

Without idempotency keys, a client retry after a timeout or network error double-charges the customer. A PostgreSQL UNIQUE constraint makes double-insert impossible.

Why: In the happy path, every request succeeds on the first try. Developers test the happy path and ship. But networks are unreliable. A customer clicks 'Pay', the charge succeeds, but the response is lost. The client retries. Without idempotency, the server processes a brand-new charge. At 0.01% retry rate and 432M daily transactions, that is 43,200 double charges per day at $50 average =$ 2.16M in erroneous charges daily.

WRONG: No idempotency protection. Every incoming charge request creates a new transaction. At 0.01% retry rate: 43,200 double charges/day, costing {$2.16M/day} in refunds, support tickets, and customer trust damage.

RIGHT: Require an Idempotency-Key header on every charge request. PostgreSQL UNIQUE constraint on (merchant_id, idempotency_key) enforced in the same ACID transaction as the charge INSERT. If the key exists, return the original response. We chose PostgreSQL UNIQUE over Redis dedup because the check and insert must be atomic. A crash between a Redis check and PostgreSQL insert would lose the idempotency record. Trade-off accepted: 2ms of B-tree index overhead per request.

Synchronous Webhook Delivery Blocking Transaction Response

Very CommonFORMULA

Delivering webhooks in the charge response path adds merchant server latency (1-10s) to every transaction. Async delivery via Kafka keeps charge latency under 2 seconds.

Why: It seems reliable: charge the card, notify the merchant, then return the response. The merchant gets notified before the customer sees success. But merchant servers are unreliable. If the merchant endpoint takes 5 seconds to respond (or times out), the customer waits 5+ seconds for their charge to complete. At 10K TPS, a single slow merchant can back up the charge pipeline. Even worse, if the webhook fails, should the charge fail too? That couples charge success to webhook delivery.

WRONG: In the charge path: charge card -> deliver webhook synchronously -> return response to customer. Merchant endpoint latency (1-10 seconds) adds directly to charge latency. If webhook delivery fails, the charge path must decide: fail the charge (lost revenue) or succeed without notification (merchant confusion).

RIGHT: Charge card and return response to customer synchronously (under 2s). Publish state-change event to Kafka. Webhook service consumes events and delivers asynchronously with exponential backoff. The charge response and webhook delivery are completely decoupled. We chose Kafka over an in-memory queue because webhook retries can span 26 hours and Kafka provides durable retention. Trade-off accepted: the merchant is notified seconds to minutes after the charge, not during it.

Using Eventual Consistency for Financial Transactions

Very CommonFORMULA

Eventual consistency means two reads might return different balances. For financial transactions, ACID guarantees via PostgreSQL are non-negotiable.

Why: NoSQL databases (Cassandra, DynamoDB) offer high write throughput and horizontal scalability. Candidates default to these for 'large scale' systems. But eventual consistency means: a charge is written to node A, a balance check on node B does not see it yet, and another charge is approved against the stale balance. The result: overdraft. For non-financial data (user profiles, analytics), eventual consistency is fine. For money movement, it creates accounting errors that require manual reconciliation.

WRONG: Store transactions in Cassandra or DynamoDB for write throughput. At 10K TPS with eventual consistency, two concurrent charges against the same card can both be approved against a stale balance. The ledger shows charges that should not exist. Reconciliation finds phantom debits.

RIGHT: Use PostgreSQL with serializable isolation for the charge path. The idempotency check, transaction INSERT, and ledger entries are in one ACID transaction. 10K TPS is well within PostgreSQL's capability with 8 shards. We chose PostgreSQL over Cassandra because financial transactions require strong consistency, UNIQUE constraints, and ACID transactions. Cassandra has none of these. Trade-off accepted: PostgreSQL's single-node write throughput is lower than Cassandra's, requiring sharding. But financial correctness is non-negotiable.

Single Status Field Instead of State Machine

CommonFORMULA

A mutable status column allows illegal transitions (settled back to pending). A state machine enforces forward-only transitions with every change logged in the ledger.

Why: A status VARCHAR column is simple: UPDATE transactions SET status = 'captured' WHERE id = ?. Any code anywhere can set any status. In early development, this is fast. But without transition rules, a bug can set a settled transaction back to pending. Worse, the UPDATE overwrites the previous status, destroying the history. When did this transaction move from authorized to captured? The current row does not tell you.

WRONG: A single status VARCHAR column with no transition rules. Any service can UPDATE status to any value. A bug sets settled back to pending. The update overwrites history: when did the transaction move from authorized to captured? Unknown. No audit trail.

RIGHT: A 5-state machine (pending, authorized, captured, settled, failed) with forward-only transitions enforced in the payment service. Each transition is logged as a new row in the append-only ledger_entries table. The status column still exists for fast queries, but transitions are validated before writing. We chose append-only ledger events (not UPDATE) because the ledger must be an immutable audit trail for regulatory compliance. Trade-off accepted: more storage (2 ledger rows per transition), but complete history for auditing and reconciliation.

No Timeout on Card Network Calls

CommonFORMULA

Without a timeout, a hung card network connection blocks the charge thread indefinitely. A 2-second timeout with reconciliation resolves ambiguous responses safely.

Why: HTTP clients default to 30-second or infinite timeouts. A developer calls the card network API and trusts it to respond. Usually it does (200-500ms). But card networks experience degradation: response times climb to 5s, 10s, 30s. Without a timeout, the thread blocks. At 10K TPS, threads exhaust within seconds. The entire payment service becomes unresponsive. Meanwhile, 10K customers per second see spinning wheels.

WRONG: Call the card network API with no explicit timeout or a 30-second default. During network degradation, threads block for 30 seconds each. Thread pool exhausted in seconds. All charges fail because no threads are available to process them. Cascading failure across the payment service.

RIGHT: Set a 2-second timeout on card network calls (matching the network's own SLA). On timeout, mark the transaction as unknown (not failed, not succeeded). A reconciliation job at T+1 compares against the settlement file. Never retry a timed-out charge because the card may already be charged. We chose 2 seconds (not 5 or 10) because card networks themselves time out at 2 seconds. Waiting longer than the network's SLA wastes threads. Trade-off accepted: some successful charges are initially marked unknown, creating up to 24 hours of pending holds visible to customers.

Mutable Transaction Records (Updating Instead of Appending)

CommonFORMULA

Updating transaction records destroys history and breaks the audit trail. Append-only ledger entries create an immutable record of every financial event.

Why: UPDATE is the default SQL operation. To refund a $50 charge, a developer writes: UPDATE transactions SET amount = 0, status = 'refunded'. The original$ 50 is gone. The audit trail is broken. When regulators ask 'show me the charge and refund as separate events', the single row cannot distinguish between them. Financial systems must reconstruct any historical state on demand.

WRONG: UPDATE transaction rows to reflect refunds, partial captures, and status changes. Original values are overwritten. No way to reconstruct historical states. Audit trail is broken. Regulators cannot verify that the refund amount matches the original charge. Reconciliation against card network records fails because timestamps of intermediate states are lost.

RIGHT: All financial events are append-only rows in the ledger_entries table. A refund creates two new entries (debit merchant_revenue, credit customer_funds), not an update to the original. The transaction status column is updated for query convenience, but the ledger is the source of truth. We chose append-only because regulatory compliance (PCI-DSS, SOX) requires immutable audit trails where any historical state is reconstructable. Trade-off accepted: 864M ledger rows/day at 173 GB/day, but complete financial history.

Trusting Client-Reported Transaction Status

CommonFORMULA

Accepting transaction status from the client allows fraud. The server must be the sole authority on payment state, derived from card network responses and the state machine.

Why: In a rush to ship, a developer exposes a PATCH /charges/id endpoint that accepts a status field from the merchant. The merchant reports 'captured' without actually capturing with the card network. Or a compromised client sends 'settled' to trigger a payout without money moving. The root cause is treating the client as a trusted party in the financial state machine.

WRONG: Accept client-reported status updates via API. A malicious or buggy merchant sends status: 'captured' without an actual card network capture. The ledger records a capture that never happened. Reconciliation at T+1 finds a mismatch: our records say captured, the card network says authorized. Manual investigation required for every occurrence.

RIGHT: The server is the sole authority on transaction state. Status transitions happen only in response to card network responses or internal state machine rules. The merchant API exposes actions (capture, refund), not status updates. The payment service executes the action with the card network, receives the response, and transitions the state machine accordingly. We chose action-based APIs (not status-update APIs) because the card network response is the source of truth, not the merchant's claim. Trade-off accepted: merchants cannot override payment state, which occasionally requires support intervention for edge cases.