← All Concepts
Databases
Data Partitioning (Sharding)
Splitting a large dataset across multiple machines. Each partition (shard) holds a subset of the data, enabling horizontal scaling beyond a single machine's capacity.
**Data partitioning** distributes rows/documents across multiple database instances to scale writes and storage.
**Partitioning strategies:**
1. **Range-based**: Partition by key range (A-M on shard 1, N-Z on shard 2)
- Good for range queries
- Risk: hot spots if data isn't uniformly distributed
2. **Hash-based**: Apply hash function to partition key
- Uniform distribution
- Loses range query ability
- Use consistent hashing to minimize rebalancing
3. **Directory-based**: Lookup table maps keys to shards
- Most flexible
- Lookup table is a single point of failure
**Choosing a partition key:**
- High cardinality (many unique values)
- Even distribution (no hot spots)
- Matches query patterns (queries should hit 1-2 shards, not all)
**Challenges:**
- **Cross-shard queries**: Joins across shards are expensive (scatter-gather)
- **Rebalancing**: Adding/removing shards requires data migration
- **Transactions**: ACID across shards needs 2PC or Saga
- **Auto-increment IDs**: Need distributed ID generation (Snowflake, ULID)
Common Use Cases
- Scaling databases beyond single-machine limits
- Geographic data distribution (shard by region)
- Multi-tenant SaaS (shard by tenant_id)
- Time-series data (shard by time range)
Advantages
- +Horizontal scaling for both reads and writes
- +Each shard can be on separate hardware
- +Failure of one shard doesn't affect others
- +Can place shards near users geographically
Disadvantages
- -Cross-shard queries are slow and complex
- -Rebalancing shards is operationally expensive
- -Application complexity increases significantly
- -Distributed transactions are hard