STANDARDwalkthrough

Checkpointing

7 of 8

3 related

The constraint: a full web crawl takes 4 weeks to complete. If a crawler node crashes on day 12 without any saved state, we lose 12 days of progress and must re-crawl billions of already-visited pages.

We solve this with checkpointing: a periodic snapshot of the URL frontier state plus the visited URL set, written to durable storage every 10 minutes. On crash recovery, the node loads the last checkpoint and resumes from where it left off, losing at most 10 minutes of work instead of weeks.

“That wasted compute alone costs thousands of dollars in bandwidth and server time.”

The checkpoint includes the frontier queues (both front and back), the Bloom filter's bit array, and the current crawl position within each domain's queue. We chose periodic snapshotting every 10 minutes (not a Write-Ahead Log for every frontier mutation) because at 6,200 pages per second, logging every URL insert and dequeue generates roughly 12,000 write operations per second to durable storage.

Trade-off: we gave up per-operation durability (losing up to 10 minutes of work on crash) in exchange for eliminating the I/O overhead of continuous logging. The checkpoint file size is proportional to the frontier size: a 120 GB Bloom filter plus a 50 GB frontier snapshot means each checkpoint is roughly 170 GB.

Implication: writing 170 GB takes about 2 minutes to SSD, so our 10-minute interval gives us an 8-minute buffer between writes. Google's distributed crawler uses a WAL (Write-Ahead Log) pattern for its most critical frontier mutations while batching less critical state into periodic snapshots, a hybrid approach we would adopt if per-operation durability became a requirement.

Internet Archive checkpoints its crawl state to S3 every 15 minutes. What if the interviewer asks: 'Why not checkpoint every minute for less data loss?' Because overlapping writes become a risk: if the 170 GB snapshot takes 2 minutes to write and we checkpoint every minute, the next snapshot starts before the previous one finishes, causing I/O contention and potential corruption.

Why it matters in interviews

Checkpointing tests whether we think about crash recovery in long-running systems. Quantifying the cost of losing 2 weeks of crawl progress versus the 170 GB checkpoint overhead shows we reason about durability trade-offs concretely.

Related concepts

← PreviousConsistent Hashing for Distribution Next →Crawler Traps