Why Shard a Database?
- When you outgrow a database.
- You need more writes/second.
- Storage is approaching a limit
- Database in general is straining.
- Sharding is useful to scale horizontally, meaning more databases rather than bigger databases.
What is Sharding
- Splitting the data across multiple databases.
- The shards together form the full dataset.
- As you grow, you add more shards, not larger databases.
How to Shard Data
This raises questions:
- What to shard by (shard key - how to group data)
- How to distribute data ()
Choosing Shard Keys
Good shard keys
- High cardinality (field with lots of unique values)
- even distribution (values should naturally spread, so each shard has an even amount of data)
- aligns with queries (1 query hits one shard)
Good examples
- Split by user ID (shard 1 has users 0, 10m; shard 2 has users 10m, 20m; etc)
- Split by order ID (shard 1 has orders 0, 10m; shard 2 has orders 10m, 20m; etc)
- Think about what users are requesting frequently.
What to Distribute By
- Range-based (user IDs, order IDs)
- Hash-based distribution.
- Take a uniqueID, hash it (to randomise) and run a modulus operation on how many databases you have. The problem is, if you add a new shard.
- Consistent-hashing.
- The coin spinning analogy.
- Eliminates the giant reshuffling problem.
- Directory-based Sharding