TiDB Scheduling

Why Scheduling Is Needed

The previous article introduced the TiKV cluster as the distributed KV storage engine of the TiDB database. Data is managed after being replicated into Regions. Each Region has several replicas distributed across different TiKV nodes. Among these replicas, the leader handles reads and writes, and followers synchronize Raft logs sent by the leader. Now consider the following questions.

How can you ensure that several replicas of the same Region are distributed across different nodes? And what if several TiKV instances are started on one machine?
If a TiKV cluster is deployed in several locations for disaster recovery, how can you ensure that several replicas of a Raft group are not lost even if a single data center fails?
How can data from other nodes in the TiKV cluster be moved to a newly added node?
What happens when a node fails? What should the whole cluster do? If a node has a temporary failure, such as a service restart, how should it be handled? What about a long-term failure, such as disk failure or data loss?
Each Raft group must have N replicas. A Raft group may have too few replicas, for example when a node fails and a replica is lost, or too many replicas, for example when a failed node restarts and automatically rejoins the cluster. How can the replica count be configured? Reads and writes are handled by leaders, so what happens if all leaders are gathered on several nodes?
Not all Regions need to be accessed, and hotspots appear to exist in several Regions. What should be done in that case?
During load balancing, the cluster must migrate data. This data movement consumes a lot of network bandwidth, disk I/O, and CPU. Will it affect online services?

Solving each of the problems above one by one is simple, but solving them together becomes difficult. For example, whether to add a replica can be considered by checking whether the count is insufficient and looking at the internal situation of one Raft group. In reality, however, a comprehensive perspective is needed to decide where to add the replica. The entire system changes dynamically. Region splits, node joins, node failures, and hotspot access changes occur constantly. The scheduling system must also keep adjusting to maintain the best state. Without a component that can understand overall information, schedule, and configure the system, meeting these requirements is difficult. Therefore, a central node that controls and coordinates the overall system is needed. This is where the Placement Driver, or PD, module appears.

Tip

**Hotspot** A hotspot generally means a point where something is unusually high compared with its surroundings. The detailed meaning differs by field. For example, it can mean a high-crime area, a high-concentration pollution area, or a particularly popular place in a city. In common usage, a personal hotspot or portable hotspot on a mobile phone means a physical location where internet access is possible through a wireless network, usually Wi-Fi, using a router connected to an internet service provider. In distributed processing here, it can be interpreted as an area on one node with high usage and high execution frequency.

Scheduling Requirements

Let’s classify and organize the questions mentioned above. Broadly, there are two kinds.

A distributed and highly available storage system must satisfy the following requirements.

Create an appropriate number of replicas.
Distribute replicas to different machines.
Move replicas from other nodes after adding a node.
Move data from a node when it goes offline.

An excellent distributed system needs the following optimizations.

Balanced placement of leaders in the cluster.
Balanced placement of storage capacity on each node.
Balanced distribution of hotspot access.
Control of balancing speed so online services are not affected.
Node state management, such as manually switching nodes online/offline or automatically taking failed nodes offline.

If the first set of requirements is satisfied, the system supports disaster recovery with multiple replicas, dynamic scalability, node fault tolerance, and automatic disaster recovery. If the second set of requirements is satisfied, the system has good load balancing and is easy to manage.

To satisfy these requirements, it is first necessary to collect enough information, such as each node’s status, each Raft group’s information, business access, and operational statistics. Then policies must be configured so PD can establish scheduling plans that satisfy the requests above based on this information and the scheduling policy.

Basic Scheduling Operations

The basic scheduling operation is very simple: it can follow scheduling policy. This is the foundation of the whole scheduler.

The scheduler requirements above look complex, but they can be broadly divided into three operations.

Add replica
Remove replica
Move the leader role between different replicas in a Raft group

The Raft protocol happens to satisfy these requirements, and the commands AddReplica, RemoveReplica, and TransferLeader support these three basic operations.

Information Collection

Scheduling depends on collecting information from the whole cluster. Simply put, it must know the status of each TiKV node and each Region. The TiKV cluster reports two kinds of information to PD.

Each TiKV node periodically reports information about the whole node to PD.

There is a heartbeat between the TiKV store and PD. Through the heartbeat, PD checks whether each data store is active and whether a newly added data store exists. The heartbeat also includes status information about that data store. Mainly, it includes the following.

Total disk capacity
Free disk capacity
Number of Regions
Data write speed
Number of snapshots sent and received, because replicas synchronize data by snapshot
Whether it is overloaded
Label information, where labels are a hierarchical set of tags

Each Raft group leader periodically reports to PD.

The leader of each Raft group is connected to PD by heartbeat and reports the Region status as follows.

Leader status
Follower locations
Number of offline replicas
Data read/write speed

Through these two kinds of heartbeats, PD collects and judges information for the whole cluster. PD also obtains additional information through the management interface to make more accurate decisions. For example, if a data store’s heartbeat stops, PD cannot know whether it is temporary or permanent. PD can only wait for a certain time, 30 minutes by default. If there is still no heartbeat after that time, PD determines that the data store is offline and must move all Regions on that data store. However, if operations staff manually set a system offline, they must tell PD through the management interface that the data store is unavailable. In that case, PD immediately moves all Regions from the data store.

Scheduling Policy

After collecting information, PD needs several policies to draw up concrete scheduling plans.

1. The number of replicas in a Region is correct.

If PD discovers through the Region leader heartbeat that the number of replicas in a Region does not satisfy requirements, it fixes the replica count by adding or removing replicas. This can happen in the following cases.

A node fails and loses all data, so some Regions no longer have replicas.
A failed node starts working again and automatically joins the cluster. In this case, redundant replicas exist and must be removed.
An administrator changes the replication policy and maximum replica count settings.

2. Several replicas of one Raft group must not exist in the same location.

Note that this means the same location, not the same node. In general, PD can guarantee that several replicas do not exist on the same node to avoid the problem of losing many replicas when a node fails. In actual deployment, the following requirements may appear.

Several nodes are deployed on the same physical machine.
TiKV nodes are distributed across several servers. The system is assumed to remain available even if a server is shut down.
TiKV nodes are distributed across several IDCs. The system remains available even if a data center is shut down.

Basically, nodes need a common location attribute and form the smallest built-in unit. It is preferable that several Region replicas do not coexist in this unit. In this case, set labels on nodes and configure PD’s location labels to specify which labels identify locations. When deploying replicas, nodes that store several replicas do not have the same location identifier.

3. Replicas are evenly distributed across each data store.

Because the data storage capacity of each replica is fixed, balancing the number of replicas on each node makes the overall load more balanced.

4. The number of leaders is evenly distributed across each data store.

Because the Raft protocol performs reads and writes through leaders, computational load is mainly placed on leaders. Therefore, PD manages leaders by distributing them across different data stores.

5. The number of hotspots is evenly distributed across each data store.

When sending information, each data store and Region leader delivers information about current access load, such as key read/write speed. PD identifies hotspots and distributes them across multiple nodes.

6. The storage space usage rate of each data store is almost the same.

Each data store specifies a capacity parameter at startup, but this represents the storage space limit of the data store. PD considers the remaining space on nodes during scheduling.

7. Control scheduling speed so online services are not affected.

Scheduling operations consume CPU, memory, disk I/O, and network bandwidth, so they must not affect online services. PD controls the number of ongoing operations, and the default speed is conservative. If you want to accelerate scheduling, such as during service upgrade downtime or after adding new nodes, you can manually accelerate it using pd-ctl.

8. Manual support for offline nodes.

If you manually set a node offline with pd-ctl, PD removes data from the node within a specific speed control range. Then it takes the node offline.

Scheduling Implementation

Now look at the scheduling flow.

PD constantly collects information through heartbeats from data stores and leaders, and retrieves detailed cluster data. Based on this information and scheduling policies, PD creates an operation sequence. When it receives a heartbeat sent by a Region leader, it checks whether there is any operation to perform on that Region. PD returns future operations to the Region leader through the heartbeat response message and monitors the execution result in the next heartbeat. These operations are only suggestions to the Region leader, and execution is not guaranteed. Whether and when they are executed is determined by the Region leader according to its current state.

Conclusion

This article introduced what must be considered to build a distributed storage system for scheduling, and how to separate policy and implementation so policies can be extended more flexibly.

I hope these three blog posts, storage, computing, and scheduling, help you understand TiDB’s basic concepts and implementation principles.