Scalability Playbook

10 essential scaling strategies with diagrams, trade-offs, and interview talking points. Know when and how to apply each.

Quick Decision Guide

QPS > 10K→ Horizontal scaling + Load balancer
R:W > 10:1→ Caching (Redis) + Read replicas
Data > 1TB→ Database sharding
Latency > 200ms→ Async processing + Message queues
Global users→ CDN + Multi-region deployment
Variable load→ Auto-scaling + Connection pooling
  Client
    │
    ▼
┌──────────┐
│    LB    │
└──┬──┬──┬─┘
   │  │  │
   ▼  ▼  ▼
  S1  S2  S3   ← Stateless servers
   │  │  │
   └──┴──┘
      │
      ▼
   Database

How It Works

  1. Load balancer distributes requests across N identical servers
  2. Servers must be stateless (no local session data)
  3. Auto-scaling adjusts N based on CPU/memory/QPS metrics
  4. New servers register with the load balancer automatically
  5. Health checks remove unhealthy servers from the pool

When to Use

  • Request rate exceeds what a single server can handle
  • You need high availability (no single point of failure)
  • Traffic is unpredictable and needs elastic scaling
  • During capacity planning when vertical scaling is maxed out

Pitfalls

  • ! Session data can't live on individual servers
  • ! Need distributed caching for shared state
  • ! Database becomes the bottleneck if not scaled too
  • ! More servers = more complexity in deployment and monitoring

Real World

  • Netflix: thousands of EC2 instances behind ELBs
  • Uber: auto-scales based on real-time demand
  • All major cloud-native applications use this

Say This in Interview

"We'll add stateless app servers behind a load balancer and auto-scale based on CPU utilization, targeting 60-70%."