← All Problems
MediumData Systems40 min

Design a Web Crawler

Design a web crawler that systematically browses the web to index pages for a search engine.

GoogleMicrosoftAmazonApple

Functional Requirements

  • Crawl billions of web pages
  • Politeness: respect robots.txt and rate limits
  • Handle duplicate content detection
  • Prioritize important/fresh pages
  • Fault tolerance: resume after failures

Steps (0/6)

Step 1: Clarify Requirements

Scale: Crawl 1 billion pages/month ≈ ~400 pages/sec. Average page size: 500KB. Storage: 1B × 500KB = 500TB/month. What content? HTML only or also PDFs, images? How fresh? Re-crawl frequency?

Key Points

  • 1 billion pages per month → ~400 pages/sec
  • 500TB raw HTML per month
  • Need to respect robots.txt and crawl-delay
  • Re-crawl important pages more frequently
1 / 6