Scalability Playbook
10 essential scaling strategies with diagrams, trade-offs, and interview talking points. Know when and how to apply each.
Quick Decision Guide
QPS > 10K→ Horizontal scaling + Load balancer
R:W > 10:1→ Caching (Redis) + Read replicas
Data > 1TB→ Database sharding
Latency > 200ms→ Async processing + Message queues
Global users→ CDN + Multi-region deployment
Variable load→ Auto-scaling + Connection pooling
Client
│
▼
┌──────────┐
│ LB │
└──┬──┬──┬─┘
│ │ │
▼ ▼ ▼
S1 S2 S3 ← Stateless servers
│ │ │
└──┴──┘
│
▼
DatabaseHow It Works
- Load balancer distributes requests across N identical servers
- Servers must be stateless (no local session data)
- Auto-scaling adjusts N based on CPU/memory/QPS metrics
- New servers register with the load balancer automatically
- Health checks remove unhealthy servers from the pool
When to Use
- ✓ Request rate exceeds what a single server can handle
- ✓ You need high availability (no single point of failure)
- ✓ Traffic is unpredictable and needs elastic scaling
- ✓ During capacity planning when vertical scaling is maxed out
Pitfalls
- ! Session data can't live on individual servers
- ! Need distributed caching for shared state
- ! Database becomes the bottleneck if not scaled too
- ! More servers = more complexity in deployment and monitoring
Real World
- → Netflix: thousands of EC2 instances behind ELBs
- → Uber: auto-scales based on real-time demand
- → All major cloud-native applications use this
Say This in Interview
"We'll add stateless app servers behind a load balancer and auto-scale based on CPU utilization, targeting 60-70%."