MongoDB Sharding Overview
Scaling
Before talking about sharding, let’s first look at scaling.
In general, when a system grows larger, scaling becomes necessary. Scaling methods are broadly divided into two types: vertical scaling and horizontal scaling.
Vertical scaling
Vertical scaling is a method of replacing CPU, RAM, or storage with higher-performance components. Scaling is simple, but there are limits to CPU and memory in the first place, so there is a practical performance limit.
Horizontal scaling
Horizontal scaling divides a system by data set and processes data in parallel from multiple servers. Overall performance can be better than vertical scaling, but the more it scales, the more complex it becomes, making management difficult.
What is sharding?
Sharding is a horizontal scaling structure that distributes data across multiple servers.
Sharding has three major benefits: read/write processing becomes faster and performance is easier to improve, storage is easier to expand, and availability is high. Each is described below. MongoDB supports sharding, or horizontal scaling, as a standard feature.
Reads and writes
When data is distributed and stored in a shard cluster, read/write processing, or file I/O, can be distributed. File read/write operations are slow, so distributing them enables faster reads and writes. In addition, when read/write load increases, the system can horizontally scale by adding servers.
Storage
Fragmented data is distributed and stored in the shard cluster. Even when the amount of data increases, the system can horizontally scale by adding servers.
High availability
Shard clusters allow partial reads and writes. If a shard cannot be read from or written to, data can also be retrieved from a shard server that can perform the work.
Shard cluster
MongoDB sharding consists of three components: shards, routers, and config servers. Their relationships and explanations are as follows.
Shard
Shard servers store fragmented data, or chunks, from split collections. A shard server can be configured as a replica set.
Router
mongos is a query router that provides an interface from applications to the shard cluster. When accessing a shard cluster, access always goes through mongos. Even collections that are not sharded must be accessed through mongos.
Config server
Config servers store metadata about shard cluster settings. Information such as which shard server holds the data is also stored on these servers. Since MongoDB 3.4, config servers are configured as replica sets.
Shard key
A shard key is information used as a key when distributing data and storing it on shard servers. MongoDB uses the shard key to distribute documents within a collection.
The following figure is a simple example distributed by an x shard key.
… figure …
The important points about shard keys are the following three.
- They consist of immutable fields or groups of fields that exist in every document.
- Once a shard key is chosen and distribution is performed, the shard key cannot be changed again.
- A sharded collection requires an index that includes the shard key.
Needless to say, shard key selection greatly affects performance, efficiency, and scalability. If the shard key is chosen poorly, data can become biased toward specific shard servers, as described above, causing inefficient behavior.
Chunks
As mentioned above, MongoDB distributes data storage by shard key value, but it groups data into units called chunks and distributes those chunks across shard servers. MongoDB automatically moves chunks so that they are balanced within the shard cluster.
The image of splitting data into chunks and distributing them across shards is as follows.
… figure …
Operation constraints on sharded data
There are several restrictions on operations for sharded data.
- The
groupcommand cannot be used. - Use MapReduce instead of aggregate.
updateOneordeleteOnemust specify_id.- An error occurs if the shard key
_idis not included. - A sharded collection requires a unique index.
Sharded and unsharded collections
MongoDB can mix sharded collections and unsharded collections. Sharded collections are distributed and stored in the shard cluster, while unsharded collections are stored on the primary shard.
Data distribution methods
There are two distribution methods for sharding: hashed sharding and ranged sharding.
Hashed sharding
Hashed sharding distributes data using the hash value of the shard key. With hashed sharding, chunks tend to become scattered even when shard key values are close. In other words, it is a distribution method that makes distribution easier when the shard key changes monotonically. On the other hand, because data is easy to distribute, broadcast operations may increase depending on the data structure and retrieval method.
Ranged sharding
Ranged sharding distributes data according to the shard key range. Unlike hashed sharding, when shard key values are close, they are more likely to exist in the same chunk. If the shard key is chosen poorly, load can become concentrated on specific servers even though sharding is being used as a distribution technique.