MediumData Systems40 min

Design a Web Crawler

Design a web crawler that systematically browses the web to index pages for a search engine.

GoogleMicrosoftAmazonApple

Functional Requirements

Crawl billions of web pages
Politeness: respect robots.txt and rate limits
Handle duplicate content detection
Prioritize important/fresh pages
Fault tolerance: resume after failures

Steps (0/6)

Step 1: Clarify Requirements

Scale: Crawl 1 billion pages/month ≈ ~400 pages/sec. Average page size: 500KB. Storage: 1B × 500KB = 500TB/month. What content? HTML only or also PDFs, images? How fresh? Re-crawl frequency?

Key Points

1 billion pages per month → ~400 pages/sec
500TB raw HTML per month
Need to respect robots.txt and crawl-delay
Re-crawl important pages more frequently

1 / 6

← Previous ProblemDesign a Distributed CacheHard Next Problem →Design a Payment SystemHard