EASYwalkthrough

Politeness / robots.txt

2 of 8
3 related
The constraint: a high-throughput crawler sending 100 concurrent requests to a small site will get our IP banned within minutes and potentially take the site offline. Before we build any fetching logic, we need a hard rate limit. Politeness caps our crawler at one request per second per domain, enforced by the back queue layer in the frontier.
Site owners also need to declare which paths are off-limits. The robots.txt standard, created by Martijn Koster in 1994, lets owners specify disallowed paths and a Crawl-delay directive.
But respecting rate limits is not enough.
We fetch and cache each domain's robots.txt for 24 hours before issuing any page requests. We chose a 24-hour cache TTL (not fetching on every request) because robots.txt changes rarely, and re-fetching it 6,200 times per second across all domains would consume bandwidth we need for actual page crawls.
Trade-off: we gave up real-time robots.txt accuracy in exchange for eliminating redundant fetches. The politeness module sits between the frontier and the HTTP fetcher, acting as a rate limiter keyed by domain.
If a site specifies Crawl-delay: 5, we wait 5 seconds between requests to that domain, even if the frontier has thousands of URLs queued for it. Google crawls roughly 10 billion pages per day but maintains only about 1 request per second to any individual site.
Bing and Internet Archive follow the same convention. What if the interviewer asks: 'What happens if a site has no robots.txt?' We treat a missing robots.txt as permission to crawl all paths but still enforce our per-domain rate limit.
A 404 on robots.txt is not the same as a 5xx: on server errors, we back off and retry later rather than assuming open access.
Why it matters in interviews
This is the simplest concept but ignoring it breaks everything. Interviewers want to hear that politeness is not optional. Explaining the 24-hour robots.txt cache and per-domain rate limiting shows we respect real-world constraints before optimizing for throughput.
Related concepts