← All Problems
MediumData Systems40 min
Design a Web Crawler
Design a web crawler that systematically browses the web to index pages for a search engine.
GoogleMicrosoftAmazonApple
Functional Requirements
- Crawl billions of web pages
- Politeness: respect robots.txt and rate limits
- Handle duplicate content detection
- Prioritize important/fresh pages
- Fault tolerance: resume after failures
Steps (0/6)
Step 1: Clarify Requirements
Scale: Crawl 1 billion pages/month ≈ ~400 pages/sec. Average page size: 500KB. Storage: 1B × 500KB = 500TB/month. What content? HTML only or also PDFs, images? How fresh? Re-crawl frequency?
Key Points
- 1 billion pages per month → ~400 pages/sec
- 500TB raw HTML per month
- Need to respect robots.txt and crawl-delay
- Re-crawl important pages more frequently
1 / 6