MongoDB Sharding Overview

Scaling

Before talking about sharding, let’s first look at scaling.

In general, when a system grows larger, scaling becomes necessary. Scaling methods are broadly divided into two types: vertical scaling and horizontal scaling.

Vertical scaling

Vertical scaling is a method of replacing CPU, RAM, or storage with higher-performance components. Scaling is simple, but there are limits to CPU and memory in the first place, so there is a practical performance limit.

Horizontal scaling

Horizontal scaling divides a system by data set and processes data in parallel from multiple servers. Overall performance can be better than vertical scaling, but the more it scales, the more complex it becomes, making management difficult.

What is sharding?

Sharding is a horizontal scaling structure that distributes data across multiple servers.

Sharding has three major benefits: read/write processing becomes faster and performance is easier to improve, storage is easier to expand, and availability is high. Each is described below. MongoDB supports sharding, or horizontal scaling, as a standard feature.

Reads and writes

When data is distributed and stored in a shard cluster, read/write processing, or file I/O, can be distributed. File read/write operations are slow, so distributing them enables faster reads and writes. In addition, when read/write load increases, the system can horizontally scale by adding servers.

Storage

Fragmented data is distributed and stored in the shard cluster. Even when the amount of data increases, the system can horizontally scale by adding servers.

High availability

Shard clusters allow partial reads and writes. If a shard cannot be read from or written to, data can also be retrieved from a shard server that can perform the work.

Shard cluster

MongoDB sharding consists of three components: shards, routers, and config servers. Their relationships and explanations are as follows.

Shard

Shard servers store fragmented data, or chunks, from split collections. A shard server can be configured as a replica set.

Router

mongos is a query router that provides an interface from applications to the shard cluster. When accessing a shard cluster, access always goes through mongos. Even collections that are not sharded must be accessed through mongos.

Config server

Config servers store metadata about shard cluster settings. Information such as which shard server holds the data is also stored on these servers. Since MongoDB 3.4, config servers are configured as replica sets.

Shard key

A shard key is information used as a key when distributing data and storing it on shard servers. MongoDB uses the shard key to distribute documents within a collection.

The following figure is a simple example distributed by an x shard key.

… figure …

The important points about shard keys are the following three.

They consist of immutable fields or groups of fields that exist in every document.
Once a shard key is chosen and distribution is performed, the shard key cannot be changed again.
A sharded collection requires an index that includes the shard key.

Needless to say, shard key selection greatly affects performance, efficiency, and scalability. If the shard key is chosen poorly, data can become biased toward specific shard servers, as described above, causing inefficient behavior.

Chunks

As mentioned above, MongoDB distributes data storage by shard key value, but it groups data into units called chunks and distributes those chunks across shard servers. MongoDB automatically moves chunks so that they are balanced within the shard cluster.

The image of splitting data into chunks and distributing them across shards is as follows.

… figure …

Operation constraints on sharded data

There are several restrictions on operations for sharded data.

The group command cannot be used.
Use MapReduce instead of aggregate.
updateOne or deleteOne must specify _id.
An error occurs if the shard key _id is not included.
A sharded collection requires a unique index.

Sharded and unsharded collections

MongoDB can mix sharded collections and unsharded collections. Sharded collections are distributed and stored in the shard cluster, while unsharded collections are stored on the primary shard.

Data distribution methods

There are two distribution methods for sharding: hashed sharding and ranged sharding.

Hashed sharding

Hashed sharding distributes data using the hash value of the shard key. With hashed sharding, chunks tend to become scattered even when shard key values are close. In other words, it is a distribution method that makes distribution easier when the shard key changes monotonically. On the other hand, because data is easy to distribute, broadcast operations may increase depending on the data structure and retrieval method.

Ranged sharding

Ranged sharding distributes data according to the shard key range. Unlike hashed sharding, when shard key values are close, they are more likely to exist in the same chunk. If the shard key is chosen poorly, load can become concentrated on specific servers even though sharding is being used as a distribution technique.