Designing Data-Intensive Applications | Chapter 08. The Trouble with Distributed Systems

Presenters: Yoonho Hwang, Euntaek Kim

Working with distributed systems is fundamentally different from writing software that runs on a single computer.
Engineers must build systems that keep doing their job even when everything goes wrong.

Faults and partial failures

In a single-computer environment, two outcomes can usually be predicted.
In a distributed computing environment, failures happen in unpredictable ways. This is called a partial failure.

Single-computer environment

  1. Works normally.
  2. Does not work. -> It may look like “an unlucky day” sometimes, but most cases are the result of incorrectly written software.

Distributed-computer environment

  1. Software runs on multiple computers connected by a network.
  2. In a distributed system, it is normal for some parts of the system to work while other parts fail in unpredictable ways. This is called a partial failure.
  3. Partial failures are nondeterministic, so they are difficult to handle.
  4. Sometimes you cannot even know whether something succeeded.

Cloud computing and supercomputing

High-performance computing:

  • Supercomputers with thousands of CPUs.
  • Usually used for scientific computing tasks with very high computational cost, such as weather forecasting or molecular dynamics.

Cloud computing:

  • Multi-tenant data centers.
  • Commodity computers connected by IP networks such as Ethernet.

Traditional enterprise data centers are somewhere between these two extremes.

  • Jobs running on supercomputers usually store their computational state periodically as checkpoints in durable storage.
  • When a failure occurs, computation restarts from the last checkpoint. This is closer to a single-node computer.

To make a distributed system work, you must accept the possibility of partial failures and add fault-tolerance mechanisms to the software.
You must build a reliable system from unreliable components. Perfect reliability does not exist.
Fault handling must be part of software design, and you must know how the software behaves when faults occur.

Unreliable networks

The distributed systems mainly discussed in the book are shared-nothing systems: many machines connected by a network.
Shared-nothing systems are not the only way to build systems, but for several reasons they have become the main way to build internet services.

They do not require special hardware, so they are relatively inexpensive, can use commodity cloud services, and can achieve high reliability by placing redundancy across geographically distributed data centers.
Each node cannot access another machine’s memory or disk. The network is the only communication mechanism.
In this kind of network, a node can send a message, or packet, to another node, but the network does not guarantee that the message will arrive.

There are many reasons why delivery is not guaranteed:

  1. The request is lost.
  2. The request waits in a queue and is sent later.
  3. The remote node has failed.
  4. The remote node temporarily stopped responding but starts responding again later.
  5. The response is lost on the network.
  6. The response is delayed and sent later.

The sender cannot even distinguish whether a packet was delivered.
Therefore the receiver sends a response message, but the response message can also be lost or delayed.
A common way to handle this problem is a timeout.

Network faults in practice

There is still no complete way to build a perfectly reliable network.
Even in a company’s controlled environment, network problems occur surprisingly often.
Network faults also happen frequently in public cloud services.
Even if network faults are rare, software must be designed with the knowledge that faults can happen and must be handled.

If error handling is not defined and tested, bad things can happen unpredictably.
It is not always necessary to tolerate network faults.
If the network is reliable enough, it may be reasonable to simply show an error message to users when a problem occurs.

However, you must know how the software reacts to network problems and ensure that the system can recover from them.

Fault detection

Many systems must be able to automatically detect faulty nodes.

  • A load balancer must stop sending requests to a dead node.
  • In a distributed database that uses single-leader replication, if the leader fails, one of the followers must be promoted to leader.

Timeouts and unbounded delays

If timeouts are the only reliable way to detect faults, how long should a timeout be? Unfortunately, there is no simple answer.

  • If the timeout is long, it takes longer before a node is declared dead.
  • If the timeout is short, faults are detected quickly, but there is a high risk of declaring a node dead even when its response is only temporarily slow.

When a node is declared dead, its responsibilities must be transferred to another node, adding load to other nodes and the network.
In particular, the node might not actually be dead; it may only be slow because it is overloaded. Moving that load to other nodes can cause cascading failures.

Network congestion and queueing
Variation in packet delay on a network is often caused by queueing.

  1. When several nodes try to send packets to the same destination at the same time, the network switch puts packets into a queue and forwards them one at a time over the network link.
  2. If the network link is busy, a packet may wait until it gets a slot. This is called network congestion.
  3. TCP performs flow control and also limits sending through congestion avoidance and backpressure so the network is not overloaded.
  4. If TCP does not receive an acknowledgment within the timeout, it assumes the packet was lost and retransmits it.

Latency-sensitive applications use UDP instead of TCP.

UDP does not perform flow control and does not retransmit lost packets, removing some causes of high network latency variation.

UDP is useful when delayed data has no value.

There is a trade-off between reliability and latency variability.

Synchronous networks vs. asynchronous networks
Telephone networks have extremely high reliability. End-to-end latency must be low, and there must be enough bandwidth to transmit voice samples. During a phone call, a circuit is established.
A fixed and guaranteed amount of bandwidth is allocated for that call across the whole path between the two people. ISDN networks run at a fixed rate of 4,000 frames per second.
This kind of network is synchronous. Even if data passes through multiple routers, it does not suffer from queueing delays.

Can we simply make network delays predictable?
If data center networks and the internet were circuit-switched networks, they could guarantee a maximum round-trip time. Ethernet and IP are packet-switched protocols affected by queueing, so they have unbounded network delays. These protocols do not have the concept of a circuit.

Why do data center networks and the internet use packet switching? Because it is optimized for bursty traffic. If bursty data transfers used circuits, network capacity would be wasted and transfers would be unnecessarily slow. TCP dynamically adjusts the data transfer rate to match available network capacity.

There is no correct timeout value. It must be determined through experimentation.

Unreliable clocks

  1. Has this request timed out?
  2. What is the 99th percentile response time of this service?
  3. How many queries per second did this service process on average over the last five minutes?
  4. How long did a user spend on our site?
  5. When was this article published?
  6. On what day and time should a reminder email be sent?
  7. When does this cache entry expire?
  8. What is the timestamp of this error message in the log file?

Monotonic clocks vs. time-of-day clocks

Time-of-day clock: Returns a date and time.

Synchronized using NTP (Network Time Protocol), but not highly reliable.

Firewall. Network latency, up to about one second. Leap seconds occur, so a minute may be 59 or 61 seconds long. NTP servers often smear this gradually over a day. Mobile and embedded devices are unreliable.

Monotonic clock: Suitable for measuring durations.

Comparing it with another monotonic clock is meaningless.

Relying on synchronized clocks: problems: The root cause is that you may not notice that a clock is wrong.

Rather than obvious errors, subtle data loss can occur. Nodes whose clocks drift apart must be declared dead and removed. Last write wins (LWW).

Leaderless databases: multi-leader replication, Cassandra, Riak.

Database writes mysteriously disappear. It is important to know this fact. Synchronized clocks for global snapshots: Google case. Spanner and Google’s TrueTime API.

Explicitly reports confidence intervals. If the intervals containing the earliest and latest timestamps do not overlap, then B definitely happened after A. Spanner delays execution of B by the confidence interval to reflect causality. Google places GPS and atomic clocks in each data center, synchronizing within about 7 milliseconds.

Process pauses: Assume a database with one leader per partition, where only the leader accepts writes.

Question: how can the leader node know that it is still the leader?

One method is for the leader to obtain a lease from other nodes. Only one leader can exist at a particular point in time.

Scenario: To remain leader, it must renew the lease periodically. If it fails, it stops renewing the lease, and when it expires another node takes over the leader role.

while (true) {
	request = getIncomingRequest();

	// Always ensure that at least 10 seconds remain on the lease.
  if (lease.expiryTimeMillis - System.currentTimeMillis() < 10_000) {
    lease = lease.renew();
  }

  if (lease.isValid()) {
    process(request);
  }
}

What is the problem?

It relies on synchronized clocks. It assumes only a very short time passes between checking the time and processing the request. What if the program pauses in between? Another node takes over as leader. Nobody tells this thread that it was paused, so it continues processing work. This is unsafe.

Cases where a system pauses: Stop-the-world GC pause. Suspend in a virtualized environment: memory contents are saved and execution resumes later. On laptops, this can also happen during hibernation. Thread context switching. Slow disk I/O operations. On Unix, the SIGSTOP command (^Z).

Useful tools on a single machine:

Mutex. Semaphore. Atomic counter. Lock-free data structures. Blocking queue.

Response-time guarantees: Real-time operating system (RTOS): aircraft, rockets, robots, automotive industry, and so on.

Worst-case execution time is specified. Dynamic memory allocation may be prohibited. A huge amount of testing is required. This strictly limits the range of programming languages, libraries, and tools.

Knowledge, truth, and lies

The network is an unreliable boundary. Eventually, systems can suffer from partial failures, unreliable clocks, and process pauses.
This section introduces concepts that help with building reliable software.

Truth is defined by majority vote

It is difficult to determine whether a node is operating normally by looking at only one node.

Example: a service paused by a full GC is judged dead, but later comes back and continues the work it was doing.

Instead, systems rely on quorums, or voting among nodes. To reduce dependence on a specific node, decisions require a minimum number of votes from multiple nodes.

It is most common to use a majority of nodes as a quorum. The system can still operate normally with one failure out of three nodes or two failures out of five nodes.

Leaders and locks When a system must act alone:

  • To avoid split brain, only one node can be the leader of a database partition.
  • To prevent concurrent writes or corruption of a particular resource or object, only one transaction or client can acquire the lock for that resource or object.
  • Since users must be uniquely identified by username, only one user can register with a particular username.

… image …

Fencing tokens

… image …

If ZooKeeper is used as a lock service, the transaction ID zxid or node version cversion can be used as a fencing token. These have the required property because they are guaranteed to increase monotonically.

Byzantine faults

Distributed system problems become much harder if there is a risk that nodes may “lie”, meaning they may have arbitrary faults or send corrupted responses. This behavior is called a Byzantine fault, and the problem of reaching agreement in such an unreliable environment is called the Byzantine Generals Problem. If a system continues to operate correctly even when some nodes malfunction, do not follow the protocol, or malicious attackers interfere with the network, the system is said to be Byzantine fault-tolerant. Because this is generally not practical, traditional mechanisms such as authentication, access control, encryption, and firewalls are still used as the main means of protection against attackers.

Weak forms of lying

  • Corruption of network packets caused by hardware problems, operating systems, drivers, or router bugs.
  • Usually detected with TCP/UDP checksums.
  • Applications can use checksums to address the problem.
  • Public applications must carefully sanitize user input.
  • Check value ranges, limit input size for appropriate memory allocation, and perform basic sanity checks.
  • It helps NTP clients to configure several server addresses.

System models and reality

System model: a formalization of the types of faults expected to occur in a system.

Timing assumptions

  • Synchronous model: assumes that network delay, process pauses, and clock errors all have bounds.
  • Partially synchronous model: means the system behaves like a synchronous system most of the time, but sometimes exceeds the bounds on network delay, process pauses, and clock drift.
  • Asynchronous model: in this model, algorithms cannot make any timing assumptions.

Node failures

  • Crash-stop fault: if a node suddenly stops responding at some point, that node can never be used again and never returns.
  • Crash-recovery fault: assumes a node may die at some point but will probably start responding again after an unknown amount of time.
  • In the crash-recovery model, a node loses in-memory state, but it is assumed to have stable storage, meaning nonvolatile disk storage, where data remains even after death.
  • Byzantine, or arbitrary, fault: a node can do absolutely anything, including deceiving or misleading other nodes as described in the previous section.

Correctness of algorithms
Describe properties and verify that they are always satisfied.

Example: fencing token

  • Uniqueness: no two fencing token requests return the same value.
  • Monotonic sequence number: if request x returns token tx, request y returns token ty, and x completed before y started, then tx < ty.
  • Availability: a node that requests a fencing token and does not die eventually receives a response.

Safety and liveness

  • Safety
    • If a safety property is violated, you can point to the particular moment when the property was broken. For example, if uniqueness is violated, you can identify the specific operation that returned a duplicate fencing token.
    • Once a safety property has been violated, the violation cannot be undone. The state is already corrupted.
  • Liveness
    • A liveness property works in the opposite way. You may not be able to name a particular moment, for example because a node sent a request but has not yet received a response, but there is always hope that the property can be satisfied in the future.

Safety: must always be satisfied, even if all nodes or the entire network fail, the system must not return an incorrect result.

Liveness: warnings are allowed. If there is a network partition, set a limit on the duration of the partition.

Responding to reality
System models are very useful for reasoning about the correctness of distributed systems, but their limitations are clear.

  • Crash-recovery model: assumes data remains even if a node dies.
    • Disk corruption.
    • Hardware faults.
    • Data loss due to incorrect configuration.

The difference between computer science and computer engineering:

  • In real implementations, you may need to handle events that were assumed to be impossible.
    • printf(“너라서 짜증나”)
    • exit(666)

Nevertheless, abstract system models are important.

  • Complexity of real systems -> extract a manageable set of faults that can be reasoned about -> understand and systematically solve the problem.

Summary

  • Network packets can always be corrupted or delayed.
  • Node clocks cannot be trusted for time.
  • Processes can experience stop-the-world pauses during execution.

Partial failure is a distinct characteristic of distributed systems.

Detection itself is where the difficulty begins. Timeouts are used, but they cannot distinguish network faults from node faults.
It is possible to provide strict real-time response guarantees and network delay bounds, but the cost is very high and resource utilization becomes low.

The next chapter deals with solutions for coping with the problems of all distributed systems.