Web Crawler Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Crawler node crash mid-crawl loses in-flight URLs
Without leased dequeue, crashed-node URLs are permanently lost. With it, the impact is limited to a brief throughput dip.
Constraint: each of 50 fetcher nodes holds 500 URLs dequeued but not yet fetched. A node OOM-kills. What breaks: those 500 URLs vanish, removed from the frontier on dequeue but never processed. At 6,200 pages/sec across 50 nodes, each handles 124 pages/sec. A 10-minute outage loses pages plus 500 in-flight URLs. An 8-minute-old checkpoint means 297,600 pages get re-crawled. Detection: heartbeat failure (3 missed beats at 10s intervals), Kafka consumer lag spike, fetch rate drops to ~6,076/sec. User impact: throughput dips ~2% for 60s, no external impact.
- Lease-based dequeue: URLs are 'leased' from the frontier for 5 minutes. If not marked as fetched within that window, they return to the queue automatically. We chose leases (not permanent dequeue) to prevent URL loss on crashes.
- Checkpointing every 10 minutes to HDFS: maximum 3.72M pages of lost progress, recoverable in 10 minutes.
- Kubernetes auto-restarts the crashed pod within 60 seconds. Its domain queues are redistributed to surviving nodes within 30 seconds via consistent hashing.
- Over-provision by 10%: we run 55 nodes so losing 5 still meets the 6,200 pages/sec target.
DNS resolver becomes a bottleneck
DNS is a single point of failure. If resolution fails, fetching stops entirely for affected domains.
Constraint: local DNS cache holds 1M entries but we hit 100K+ unique domains per hour. What breaks: long-tail domains send lookups/sec to the external resolver at 30% cache miss. The resolver caps at 1,000 queries/sec, returning SERVFAIL for the excess 860. Failed fetches retry, creating a feedback loop amplifying DNS pressure. Detection: resolution latency spiking from 5ms to 500ms+, timeout rate above 5%, fetcher threads blocked on DNS, crawl rate below 5,000 pages/sec. User impact: crawl rate drops 20-50% until DNS capacity is restored.
- Run a local recursive DNS resolver (Unbound) per crawler machine, bypassing external resolvers for cached domains. We chose per-machine resolvers (not a centralized DNS proxy) to avoid a single point of failure.
- Pre-resolve DNS for domains in the frontier 30 seconds before fetching: batch resolution during idle fetcher cycles.
- Multiple upstream resolvers (8.8.8.8, 1.1.1.1, corporate resolver) with round-robin to spread the load.
- Increase local DNS cache to 5M entries (290 MB). Covers 95%+ of domains across a full crawl cycle.
Crawler trap consumes entire crawl budget
Wastes crawl budget but does not crash the system. Detected by monitoring per-domain crawl counts.
Constraint: 20 outlinks per page, no per-domain cap. What breaks: the crawler hits a calendar at /events?date=2024-01-01. Each page links to next/previous day and month: 4 outlinks, each unique. After 1,000 pages the frontier holds 4,000 URLs from this domain. After 24 hours: 100K pages of near-identical views costing wasted storage and 16,000 seconds of diverted fetcher time. Detection: per-domain count above 10K in one cycle, repetitive URL patterns, SimHash at 90%+ near-duplicates. User impact: legitimate domains starved of crawl budget.
- Hard cap of 10K pages per domain per crawl cycle. After 10K pages, the domain is skipped for the remainder of the cycle.
- Depth limit of 15 hops from seed URLs. Calendar traps are typically 50+ hops deep.
- URL pattern detection: if 100+ URLs match the same regex pattern, blacklist the pattern and remove matching URLs from the frontier.
- Human-reviewed domain blacklist for known trap sites, updated monthly.
Politeness violation triggers mass IP banning
IP bans can block crawling of entire CDN networks. Recovery requires manual outreach to domain operators.
Constraint: 1 req/sec/domain via per-node rate limiters in memory. What breaks: a rolling deployment restarts all 50 nodes in 5 minutes. Limiters reset to zero. For 30-60 seconds there is no politeness enforcement. A popular domain gets 50 concurrent requests/sec. Its WAF bans our IP range, blocking crawls to that domain and all domains behind the same CDN (Cloudflare protects 20%+ of the web). Detection: HTTP 429/403 rate above 10%, robots.txt failures for accessible domains, crawler IP on blocklists. User impact: entire CDN networks become uncrawlable, recovery takes days.
- Persist rate limiter state to disk. On restart, reload the last known state instead of resetting to zero. We chose disk persistence (not Redis) because each node's rate limiter is local and the state is small (1M domains x 12 bytes = 12 MB).
- Global per-domain rate tracking (not per-node): a centralized rate limiter in Redis that all nodes check before fetching.
- Gradual restart: rolling deployments restart 1 node at a time with 5-minute gaps, never resetting more than 2% of rate limiters simultaneously.
- IP rotation: use multiple IP ranges across data centers. If one range is banned, others continue.
SimHash false positive marks unique page as duplicate
Individual pages are missed, not a systemic failure. Caught by coverage audits comparing crawled vs expected page counts.
Constraint: 3-bit threshold for near-duplicate detection. What breaks: two pages with different content produce SimHash fingerprints differing by 2 bits. Dedup marks the second as duplicate and skips it. Random collision on 64-bit SimHash is rare (), but pages share templates and boilerplate. Template-heavy sites hide legitimate differences behind matching nav/footer. Detection: coverage audits showing missing pages for high-value domains, bit-difference histogram spiking at 2-3 bits. User impact: pages missing from the search index.
- Use a stricter threshold: flag as duplicate only if fewer than 2 bits differ (instead of 3). Reduces false positives at the cost of storing more near-duplicates.
- Two-phase dedup: SimHash as a fast filter, then exact content comparison (MD5 of stripped boilerplate) for pages flagged as near-duplicates.
- Strip boilerplate (navigation, headers, footers) before computing SimHash. Libraries like boilerpipe extract main content, making SimHash more accurate.
- Domain-specific thresholds: e-commerce sites get a stricter threshold (1 bit) because product pages share templates but have unique content.
URL frontier overflow under link explosion
Uncontrolled frontier growth degrades the entire crawl pipeline. Enqueue slowdowns cascade to fetcher throughput.
Constraint: frontier sized for 200M pending URLs on disk. What breaks: each fetched page adds 20 new URLs on average. After 1M pages: enqueued minus 1M dequeued = 19M net growth. After 1 week the frontier holds 500M+ URLs, consuming 50 GB of SSD. RocksDB compaction time rises with the LSM tree. Enqueue latency climbs from 0.1ms to 10ms, throttling the entire pipeline. Detection: frontier growing faster than drain rate (net positive over 24 hours), compaction exceeding 30% of I/O, enqueue p99 above 5ms. User impact: crawl throughput degrades behind frontier I/O.
- Frontier size cap: when pending URLs exceed 200M, stop enqueueing low-priority URLs (priority < 3). Resume when the frontier drops below 150M.
- Aggressive dedup at enqueue time: reject URLs from domains already well-represented in the frontier (> 1K pending URLs per domain).
- Prioritized eviction: drop the lowest-priority 10% of the frontier when it exceeds the cap, rather than blocking enqueue entirely.
- Scale out the frontier across more nodes via consistent hashing to distribute the disk I/O load.