EASYwalkthrough

DNS Resolution Cache

5 of 8

3 related

The constraint: at 6,200 pages per second, every millisecond of latency in the fetch pipeline matters. Each uncached DNS lookup adds 50 to 200ms of blocking time while the resolver walks the DNS hierarchy.

Implication: that is physically impossible to sustain without massive parallelism, so DNS resolution becomes the bottleneck before network bandwidth does. We solve this with a local DNS cache keyed by hostname that stores resolved IP addresses with TTL-based refresh.

“Without caching, 6,200 lookups per second at 100ms average means 620 seconds of cumulative DNS wait per second.”

We chose an in-process DNS cache (not relying on the OS resolver) because the OS resolver's cache is shared across all processes and its eviction policy does not prioritize crawler-critical domains. Trade-off: we gave up OS-level cache sharing in exchange for crawler-specific TTL tuning and cache warmth control.

The cache hit rate for a well-tuned crawler exceeds 95% because web pages overwhelmingly link to domains we have already visited. On a cache miss, we trigger an async DNS resolution that runs in parallel with other fetches, preventing one slow lookup from blocking the entire pipeline.

We pre-resolve DNS for all seed domains before starting a crawl batch, so the first wave of requests never blocks on cold DNS. Google runs its own DNS infrastructure (Google Public DNS at 8.8.8.8) partly because their crawler needs sub-millisecond lookups at scale. Internet Archive's Wayback Machine crawler uses the same pre-resolution strategy.

What if the interviewer asks: 'What if a domain's IP changes during a crawl?' We honor the DNS TTL and re-resolve when it expires. For long crawls spanning weeks, a typical 300-second TTL means we re-check every 5 minutes, which is frequent enough to catch IP migrations without adding meaningful overhead.

Why it matters in interviews

DNS is a hidden bottleneck that most candidates overlook. Mentioning the 50-200ms per lookup cost and calculating why caching is mandatory at 6,200 pages per second shows we think about infrastructure-level performance, not application logic alone.

Related concepts

← PreviousContent Fingerprinting / SimHash Next →Consistent Hashing for Distribution