STANDARDwalkthrough
URL Frontier
Where do the next URLs to crawl come from, and how do we prevent high-value pages from waiting behind obscure blogs? The constraint: a single FIFO queue treats cnn.com the same as a random personal homepage, and it has no way to prevent us from hammering one domain with 100 concurrent requests.
The front queues handle priority: we rank URLs by PageRank score, freshness signals, and domain authority, so high-value pages get crawled first. The back queues handle politeness: each back queue maps to exactly one domain, guaranteeing at most one request per second per site.
“We solve both problems with a two-level queue architecture called the URL frontier.”
We start with a set of seed URLs and expand outward using breadth-first traversal, discovering new links at each level. At Google's scale, the frontier holds hundreds of millions of URLs.
Implication: that volume cannot fit in memory, so we chose a disk-backed frontier (not a purely in-memory queue) because even at 100 bytes per URL, 500 million URLs require 50 GB of storage. Trade-off: we gave up sub-millisecond queue operations in exchange for capacity that scales with disk, not RAM.
We modeled this after Google's Mercator-style frontier where a front queue selector feeds URLs into domain-specific back queues, each with a heap-based timer enforcing crawl intervals. Common Crawl processes over 3 billion pages per month using a similar two-level design.
What if the interviewer asks: 'Why not use a priority queue with domain-level locking instead of two separate queue levels?' Because coupling priority and politeness in one structure means a lock on a slow domain blocks dequeue operations for all domains behind it in priority order. Separating the concerns lets us drain the priority queue at full speed while the politeness layer independently gates per-domain throughput.
Related concepts