TRICKYwalkthrough

Bloom Filter

7 of 8

1 related

We have 10 billion URLs in our database. Before inserting a new long URL, we need to check whether it already has a short code to avoid duplicates.

10 bits

per element at 1% FP rate

12 GB

of memory for the filter

more memory

At 29K writes per second, that is 29,000 database lookups per second for deduplication alone, which would saturate our primary MySQL instance and leave no headroom for actual inserts. We chose a Bloom filter (not a database index on long_url) because a Bloom filter answers membership queries in constant time with zero disk I/O.

“The constraint: a database lookup on the long_url column takes 5ms per query.”

Using 10 bits per element, 10 billion URLs require only 12 GB of memory to achieve a 1% false positive rate. Implication: 12 GB fits in a single server's RAM, so we avoid distributing the filter across nodes.

The filter answers "definitely not in the set" or "probably in the set." When the filter says a URL does not exist, we skip the database check entirely and proceed to insert. When it says the URL probably exists (wrong 1% of the time), we fall back to a database lookup to confirm.

That 1% false positive rate means only 1 in 100 new URLs triggers an unnecessary database check, reducing redundant write-path database traffic by roughly 85%. Trade-off: Bloom filters cannot delete entries.

If a URL is removed from the database, the filter still reports it as "probably present," causing unnecessary database lookups for that URL forever. Switching to a Counting Bloom filter solves this but uses 4x more memory (48 GB instead of 12 GB).

We accept this limitation because URL deletion is rare in our system. What if the interviewer asks: why not use a hash set instead?

A hash set for 10 billion entries would require roughly 80 GB (8 bytes per hash), 6.5x more than our 12 GB Bloom filter, and would not fit on a single node.

Formula & tradeoffs

Formula

\text{Memory} = -\frac{n \ln p}{(\ln 2)^2} \approx 10n \text{ bits at } p=1\%

Why it matters in interviews

Checking whether a long URL already has a short code requires a query on a non-primary-key column, which is expensive at scale. Explaining the false positive trade-off and deriving the 12 GB memory requirement shows we understand probabilistic data structures beyond the name.

Related concepts

← PreviousWrite-Ahead Log (WAL)Next →Read Replica