← All Concepts
Databases

Data Partitioning (Sharding)

Splitting a large dataset across multiple machines. Each partition (shard) holds a subset of the data, enabling horizontal scaling beyond a single machine's capacity.

**Data partitioning** distributes rows/documents across multiple database instances to scale writes and storage. **Partitioning strategies:** 1. **Range-based**: Partition by key range (A-M on shard 1, N-Z on shard 2) - Good for range queries - Risk: hot spots if data isn't uniformly distributed 2. **Hash-based**: Apply hash function to partition key - Uniform distribution - Loses range query ability - Use consistent hashing to minimize rebalancing 3. **Directory-based**: Lookup table maps keys to shards - Most flexible - Lookup table is a single point of failure **Choosing a partition key:** - High cardinality (many unique values) - Even distribution (no hot spots) - Matches query patterns (queries should hit 1-2 shards, not all) **Challenges:** - **Cross-shard queries**: Joins across shards are expensive (scatter-gather) - **Rebalancing**: Adding/removing shards requires data migration - **Transactions**: ACID across shards needs 2PC or Saga - **Auto-increment IDs**: Need distributed ID generation (Snowflake, ULID)

Common Use Cases

  • Scaling databases beyond single-machine limits
  • Geographic data distribution (shard by region)
  • Multi-tenant SaaS (shard by tenant_id)
  • Time-series data (shard by time range)

Advantages

  • +Horizontal scaling for both reads and writes
  • +Each shard can be on separate hardware
  • +Failure of one shard doesn't affect others
  • +Can place shards near users geographically

Disadvantages

  • -Cross-shard queries are slow and complex
  • -Rebalancing shards is operationally expensive
  • -Application complexity increases significantly
  • -Distributed transactions are hard