Databases

Database Sharding

Horizontally partitions data across multiple database instances to handle large datasets and high throughput.

**Sharding** splits a database into smaller pieces (shards), each stored on a separate server. **Sharding strategies:** - **Range-based**: Shard by value ranges (A-M on shard 1, N-Z on shard 2). Can lead to hotspots. - **Hash-based**: Hash the shard key, mod by number of shards. Even distribution but range queries are hard. - **Directory-based**: Lookup table maps keys to shards. Flexible but adds lookup overhead. - **Geographic**: Shard by user region. Good for locality but uneven sizes. **Choosing a shard key:** - Should distribute data evenly - Should minimize cross-shard queries - Should align with access patterns (most queries hit one shard) **Challenges:** - Cross-shard joins (expensive or impossible) - Rebalancing when adding shards - Distributed transactions - Auto-increment IDs need a distributed ID generator

Common Use Cases

User data partitioned by user_id
Multi-tenant SaaS (shard by tenant)
Time-series data (shard by time range)
Geographic data (shard by region)

Advantages

+Handles datasets larger than single server capacity
+Increases read/write throughput linearly
+Reduces index sizes per shard (faster queries)
+Isolates failures to individual shards

Disadvantages

-Cross-shard queries are expensive
-Application complexity increases significantly
-Rebalancing shards is operationally difficult
-Distributed transactions add latency and complexity

Related Concepts

Consistent Hashing Database Replicationpartitioning