分布式系统中的心跳

# Heartbeats in Distributed Systems
# 分布式系统中的心跳

**By Arpit Bhayani**
**作者：Arpit Bhayani**

---

In distributed systems, one of the fundamental challenges is knowing whether a node or service is alive and functioning properly. Unlike monolithic applications, where everything runs in a single process, distributed systems span multiple machines, networks, and data centers. This becomes even glaring when the nodes are geographically separated. This is where heartbeat mechanisms come into play.

在分布式系统中，一个根本性的挑战是如何知晓一个节点或服务是否存活并正常运行。与单体应用中所有组件都在一个进程中运行不同，分布式系统横跨多台机器、网络和数据中心。当节点在地理上分离时，这个问题变得更加突出。这就是心跳机制发挥作用的地方。

Imagine a cluster of servers working together to process millions of requests per day. If one server silently crashes, how quickly can the system detect this failure and react? How do we distinguish between a truly dead server and one that is just temporarily slow due to network congestion? These questions form the core of why heartbeat mechanisms matter.

想象一下一个服务器集群协同工作，每天处理数百万个请求。如果其中一台服务器悄然崩溃，系统能多快地检测到这个故障并做出反应？我们如何区分一个真正宕机的服务器和一个仅仅因为网络拥堵而暂时变慢的服务器？这些问题构成了心跳机制之所以重要的核心。

## What are Heartbeat Messages
## 什么是心跳消息

At its most basic level, a heartbeat is a periodic signal sent from one component in a distributed system to another to indicate that the sender is still alive and functioning. Think of it as a simple message that says “I am alive!”

在最基本的层面上，心跳是分布式系统中一个组件周期性地发送给另一个组件的信号，以表明发送方仍然存活并正常运行。可以把它想象成一个简单的消息，说“我还活着！”

Heartbeat messages are typically small and lightweight, often containing just a timestamp, a sequence number, or an identifier. The key characteristic is that they are sent regularly at fixed intervals, creating a predictable pattern that other components can monitor.

心跳消息通常很小且轻量，常常只包含一个时间戳、一个序列号或一个标识符。其关键特征是它们以固定的时间间隔定期发送，从而形成一个其他组件可以监控的可预测模式。

The mechanism works through a simple contract between two parties: the sender and the receiver. The sender commits to broadcasting its heartbeat at regular intervals, say every 2 seconds. The receiver monitors these incoming heartbeats and maintains a record of when the last heartbeat was received. If the receiver does not hear from the sender within an expected timeframe, it can reasonably assume something has gone wrong.

该机制通过发送方和接收方之间的简单契约来工作。发送方承诺以固定的时间间隔广播其心跳，比如每2秒一次。接收方监控这些传入的心跳，并记录最后一次接收心跳的时间。如果接收方在预期的时间内没有收到发送方的消息，它就可以合理地假设出了问题。

```python
class HeartbeatSender:  
    def __init__(self, interval_seconds):  
        self.interval = interval_seconds  
        self.sequence_number = 0

def send_heartbeat(self, target):  
        message = {  
            'node_id': self.get_node_id(),  
            'timestamp': time.time(),  
            'sequence': self.sequence_number  
        }  
        send_to(message, target)  
        self.sequence_number += 1

def run(self):  
        while True:  
            self.send_heartbeat(target_node)  
            time.sleep(self.interval)
```

When a node crashes, stops responding, or becomes isolated due to network partitions, the heartbeats stop arriving. The monitoring system can then take appropriate action, such as removing the failed node from a load balancer pool, redirecting traffic to healthy nodes, or triggering failover procedures.

当一个节点崩溃、停止响应或因网络分区而被隔离时，心跳就会停止到达。监控系统随后可以采取适当的行动，例如从负载均衡器池中移除故障节点、将流量重定向到健康节点或触发故障转移程序。

## Core Components of Heartbeat Systems
## 心跳系统的核心组件

The first component is the heartbeat sender. This is the node or service that periodically generates and transmits heartbeat signals. In most implementations, the sender runs on a separate thread or as a background task to avoid interfering with the primary application logic.

第一个组件是心跳发送器。这是周期性生成和传输心跳信号的节点或服务。在大多数实现中，发送器在单独的线程或作为后台任务运行，以避免干扰主应用程序逻辑。

The second component is the heartbeat receiver or monitor. This component listens for incoming heartbeats and tracks when each heartbeat was received. The monitor maintains state about all the nodes it is tracking, typically storing the timestamp of the last received heartbeat for each node. When evaluating node health, the monitor compares the current time against the last received heartbeat to determine if a node should be considered failed.

第二个组件是心跳接收器或监视器。该组件监听传入的心跳并跟踪每个心跳的接收时间。监视器维护其正在跟踪的所有节点的状态，通常为每个节点存储最后接收到的心跳的时间戳。在评估节点健康状况时，监视器会将当前时间与最后接收到的心跳时间进行比较，以确定一个节点是否应被视为故障。

```python
class HeartbeatMonitor:  
    def __init__(self, timeout_seconds):  
        self.timeout = timeout_seconds  
        self.last_heartbeats = {}  
        
    def receive_heartbeat(self, message):  
        node_id = message['node_id']  
        self.last_heartbeats[node_id] = {  
            'timestamp': message['timestamp'],  
            'sequence': message['sequence'],  
            'received_at': time.time()  
        }  
        
    def check_node_health(self, node_id):  
        if node_id not in self.last_heartbeats:  
            return False  
            
        last_heartbeat_time = self.last_heartbeats[node_id]['received_at']  
        time_since_heartbeat = time.time() - last_heartbeat_time  
        
        return time_since_heartbeat < self.timeout  
        
    def get_failed_nodes(self):  
        failed_nodes = []  
        current_time = time.time()  
        
        for node_id, data in self.last_heartbeats.items():  
            if current_time - data['received_at'] > self.timeout:  
                failed_nodes.append(node_id)  
                
        return failed_nodes
```

The third parameter is the heartbeat interval, which determines how frequently heartbeats are sent. This interval represents a fundamental trade-off in distributed systems. Sending heartbeats too frequently, we waste network bandwidth and CPU cycles. Send them too infrequently, and we will be slow to detect failures. Most systems use intervals ranging from 1 to 10 seconds, depending on the application requirements and network characteristics.

第三个参数是心跳间隔，它决定了心跳发送的频率。这个间隔代表了分布式系统中的一个基本权衡。发送心跳过于频繁，我们会浪费网络带宽和CPU周期。发送得太不频繁，我们检测故障的速度就会很慢。大多数系统使用的间隔在1到10秒之间，具体取决于应用需求和网络特性。

The fourth one is the timeout or failure threshold. This defines how long the monitor will wait without receiving a heartbeat before declaring a node as failed.

第四个是超时或故障阈值。这定义了监视器在没有收到心跳的情况下会等待多长时间才宣布一个节点为故障。

Note, the timeout must be carefully chosen to balance two competing concerns: fast failure detection versus tolerance for temporary network delays or processing pauses. A typical rule of thumb is to set the timeout to at least 2 to 3 times the heartbeat interval, allowing for some missed heartbeats before declaring failure.

请注意，超时时间必须仔细选择，以平衡两个相互竞争的考虑因素：快速故障检测与对临时网络延迟或处理暂停的容忍度。一个典型的经验法则是，将超时时间设置为心跳间隔的至少2到3倍，从而在宣布故障前允许一些心跳丢失。

## Deciding Heartbeat Intervals and Timeouts
## 决定心跳间隔和超时

When a system uses very short intervals, such as sending heartbeats every 500 milliseconds, it can detect failures quickly. However, this comes at a cost. Each heartbeat consumes network bandwidth, and in a large cluster with hundreds or thousands of nodes, the cumulative traffic can become significant. Additionally, very short intervals make the system more sensitive to transient issues like brief network congestion or garbage collection pauses.

当一个系统使用非常短的间隔，例如每500毫秒发送一次心跳，它可以快速检测到故障。然而，这是有代价的。每个心跳都会消耗网络带宽，在一个拥有数百或数千个节点的大型集群中，累积的流量可能会变得非常可观。此外，非常短的间隔使得系统对短暂的网络拥塞或垃圾回收暂停等瞬时问题更加敏感。

Consider a system with 1000 nodes where each node sends heartbeats to a central monitor every 500 milliseconds. This results in 2000 heartbeat messages per second just for health monitoring. In a busy production environment, this overhead can interfere with actual application traffic.

考虑一个拥有1000个节点的系统，其中每个节点每500毫秒向一个中央监视器发送心跳。这仅用于健康监控就会产生每秒2000条心跳消息。在一个繁忙的生产环境中，这种开销会干扰实际的应用程序流量。

Conversely, if the heartbeat interval is too long, say 30 seconds, the system becomes sluggish in detecting failures. A node could crash, but the system would not notice for 30 seconds or more. During this window, requests might continue to be routed to the failed node, resulting in user-facing errors.

相反，如果心跳间隔太长，比如说30秒，系统检测故障就会变得迟缓。一个节点可能会崩溃，但系统在30秒或更长时间内都不会注意到。在这个窗口期内，请求可能会继续被路由到故障节点，从而导致面向用户的错误。

Similarly, the timeout value must also account for network characteristics. In a distributed system spanning multiple data centers, network latency varies. A heartbeat sent from a node in California to a monitor in Virginia might take 80 milliseconds under normal conditions, but could spike to 200 milliseconds during periods of congestion.

同样，超时值也必须考虑网络特性。在一个跨越多个数据中心的分布式系统中，网络延迟会有所不同。一个从加利福尼亚的节点发送到弗吉尼亚的监视器的心跳，在正常情况下可能需要80毫秒，但在拥堵期间可能会飙升到200毫秒。

Hence, if the timeout is set too aggressively, these transient delays trigger false alarms.

因此，如果超时设置得过于激进，这些短暂的延迟会触发误报。

A practical approach is to measure the actual round-trip time in the network and use that as a baseline. Many systems follow the rule that the timeout should be at least 10 times the round-trip time. For example, if the average round-trip time is 10 milliseconds, the timeout should be at least 100 milliseconds to account for variance.

一个实用的方法是测量网络中的实际往返时间，并将其作为基准。许多系统遵循的规则是，超时时间应至少是往返时间的10倍。例如，如果平均往返时间是10毫秒，那么超时时间应至少为100毫秒，以考虑变化。

```python
def calculate_timeout(round_trip_time_ms, heartbeat_interval_ms):  
    # Timeout is 10x the RTT  
    rtt_based_timeout = round_trip_time_ms * 10  
    
    # Timeout should also be at least 2-3x the heartbeat interval  
    interval_based_timeout = heartbeat_interval_ms * 3  
    
    # Use the larger of the two  
    return max(rtt_based_timeout, interval_based_timeout)
```

Another important consideration is the concept of multiple missed heartbeats before declaring failure. Rather than marking a node as dead after a single missed heartbeat, systems wait until several consecutive heartbeats are missed. This approach reduces false positives caused by packet loss or momentary delays.

另一个重要的考虑因素是在宣布故障前允许多次心跳丢失的概念。系统不会在一次心跳丢失后就将节点标记为死亡，而是会等到连续多次心跳丢失。这种方法减少了因数据包丢失或瞬间延迟造成的误报。

For instance, if we send heartbeats every 2 seconds and require 3 missed heartbeats before declaring failure, a node would need to be unresponsive for at least 6 seconds before being marked as failed. This provides a good balance between quick failure detection and tolerance for transient issues.

例如，如果我们每2秒发送一次心跳，并要求在宣布故障前有3次心跳丢失，那么一个节点需要至少无响应6秒才会被标记为故障。这在快速故障检测和对瞬时问题的容忍度之间提供了一个很好的平衡。

## Push vs Pull Heartbeat Models
## 推送与拉取心跳模型

Heartbeat mechanisms can be implemented using two different communication models: push and pull.

心跳机制可以使用两种不同的通信模型来实现：推送和拉取。

In a push model, the monitored node actively sends heartbeat messages to the monitoring system at regular intervals. The node takes responsibility for broadcasting its own health status. The monitored service simply runs a background thread that periodically sends a heartbeat message.

在推送模型中，被监控的节点主动地以固定的时间间隔向监控系统发送心跳消息。节点负责广播自身的健康状态。被监控的服务只需运行一个后台线程，定期发送心跳消息。

```python
class PushHeartbeat:  
    def __init__(self, monitor_address, interval):  
        self.monitor_address = monitor_address  
        self.interval = interval  
        self.running = False  
        
    def start(self):  
        self.running = True  
        self.heartbeat_thread = threading.Thread(target=self._send_loop)  
        self.heartbeat_thread.daemon = True  
        self.heartbeat_thread.start()  
        
    def _send_loop(self):  
        while self.running:  
            try:  
                self._send_heartbeat()  
            except Exception as e:  
                logging.error(f"Failed to send heartbeat: {e}")  
            time.sleep(self.interval)  
            
    def _send_heartbeat(self):  
        message = {  
            'node_id': self.get_node_id(),  
            'timestamp': time.time(),  
            'status': 'alive'  
        }  
        requests.post(self.monitor_address, json=message)
```

The push model works well in many scenarios, but it has limitations. If the node itself becomes completely unresponsive or crashes, it obviously cannot send heartbeats. Additionally, in networks with strict firewall rules, the monitored nodes might not be able to initiate outbound connections to the monitoring system.

推送模型在许多场景下都运作良好，但它也有局限性。如果节点本身完全无响应或崩溃，它显然无法发送心跳。此外，在具有严格防火墙规则的网络中，被监控的节点可能无法向监控系统发起出站连接。

*   Kubernetes Node Heartbeats
*   Hadoop YARN NodeManagers push heartbeats to the ResourceManager
*   Celery and Airflow workers push heartbeats to the schedule

*   Kubernetes 节点心跳
*   Hadoop YARN NodeManager 向 ResourceManager 推送心跳
*   Celery 和 Airflow 工作节点向调度器推送心跳

In a pull model, the monitoring system actively queries the nodes at regular intervals to check their health. Instead of waiting for heartbeats to arrive, the monitor reaches out and asks, “Are you alive?” The monitored services expose a health endpoint that responds to these queries.

在拉取模型中，监控系统主动地以固定的时间间隔查询节点以检查其健康状况。监视器不是等待心跳到达，而是主动联系并询问：“你还活着吗？”被监控的服务会公开一个健康端点来响应这些查询。

```python
class PullHeartbeat:  
    def __init__(self, nodes, interval):  
        self.nodes = nodes  # List of nodes to monitor  
        self.interval = interval  
        self.health_status = {}  
        
    def start(self):  
        self.running = True  
        self.poll_thread = threading.Thread(target=self._poll_loop)  
        self.poll_thread.daemon = True  
        self.poll_thread.start()  
        
    def _poll_loop(self):  
        while self.running:  
            for node in self.nodes:  
                self._check_node(node)  
            time.sleep(self.interval)  
            
    def _check_node(self, node):  
        try:  
            response = requests.get(f"http://{node}/health", timeout=2)  
            if response.status_code == 200:  
                self.health_status[node] = {  
                    'alive': True,  
                    'last_check': time.time()  
                }  
            else:  
                self.mark_node_unhealthy(node)  
        except Exception as e:  
            self.mark_node_unhealthy(node)
```

The pull model provides more control to the monitoring system and can be more reliable in some scenarios. Since the monitor initiates the connection, it works better in environments with asymmetric network configurations. However, it also introduces additional load on the monitor, especially in large clusters where hundreds or thousands of nodes need to be polled regularly.

拉取模型为监控系统提供了更多的控制权，并且在某些场景下可能更可靠。由于监视器发起连接，它在具有不对称网络配置的环境中表现更好。然而，它也给监视器带来了额外的负载，特别是在需要定期轮询数百或数千个节点的大型集群中。

*   Load balancers actively probe backend servers
*   Prometheus pulls metrics endpoints on each target
*   Redis Sentinel monitors and polls Redis instances with PING

*   负载均衡器主动探测后端服务器
*   Prometheus 拉取每个目标上的指标端点
*   Redis Sentinel 使用 PING 监控和轮询 Redis 实例

By the way, many real-world systems use a hybrid approach that combines elements of both models. For example, nodes might send heartbeats proactively (push), but the monitoring system also periodically polls critical nodes (pull) as a backup mechanism. This redundancy improves overall reliability.

顺便说一句，许多现实世界的系统使用一种混合方法，结合了两种模型的元素。例如，节点可能会主动发送心跳（推送），但监控系统也会定期轮询关键节点（拉取）作为备用机制。这种冗余提高了整体可靠性。

## Failure Detection Algorithms
## 故障检测算法

While basic heartbeat mechanisms are effective, they struggle with the challenge of distinguishing between actual failures and temporary slowdowns. This is where more sophisticated failure detection algorithms come into play.

虽然基本的心跳机制是有效的，但它们在区分实际故障和暂时性减速方面存在困难。这就是更复杂的故障检测算法发挥作用的地方。

The simplest failure detection algorithm uses a fixed timeout. If no heartbeat is received within the specified timeout period, the node is declared failed. While easy to implement, this binary approach is inflexible and prone to false positives in networks with variable latency.

最简单的故障检测算法使用固定的超时时间。如果在指定的超时时间内没有收到心跳，则宣布节点故障。虽然易于实现，但这种二元方法不灵活，并且在具有可变延迟的网络中容易产生误报。

```python
class FixedTimeoutDetector:  
    def __init__(self, timeout):  
        self.timeout = timeout  
        self.last_heartbeats = {}  
        
    def is_node_alive(self, node_id):  
        if node_id not in self.last_heartbeats:  
            return False  
        
        elapsed = time.time() - self.last_heartbeats[node_id]  
        return elapsed < self.timeout
```

### Phi Accrual Failure Detection
### Phi 累积故障检测

A more sophisticated approach is the [phi accrual failure detector](https://arpitbhayani.me/blogs/phi-accrual), originally developed for the Cassandra database. Instead of providing a binary output (alive or dead), the phi accrual detector calculates a suspicion level on a continuous scale. The higher the suspicion value, the more likely it is that the node has failed.

一种更复杂的方法是最初为 Cassandra 数据库开发的 [phi 累积故障检测器](https://arpitbhayani.me/blogs/phi-accrual)。phi 累积检测器不是提供二元输出（存活或死亡），而是在一个连续的尺度上计算一个怀疑级别。怀疑值越高，节点发生故障的可能性就越大。

The phi value is calculated using statistical analysis of historical heartbeat arrival times. The algorithm maintains a sliding window of recent inter-arrival times and uses this data to estimate the probability distribution of when the next heartbeat should arrive. If a heartbeat is late, the phi value increases gradually rather than jumping immediately to a failure state.

phi 值是使用历史心跳到达时间的统计分析来计算的。该算法维护一个最近到达间隔时间的滑动窗口，并使用这些数据来估计下一次心跳应该到达的概率分布。如果心跳迟到，phi 值会逐渐增加，而不是立即跳到故障状态。

The phi value represents the confidence level that a node has failed. For example, a phi value of 1 corresponds to approximately 90% confidence, a phi of 2 corresponds to 99% confidence, and a phi of 3 corresponds to 99.9% confidence.

phi 值代表一个节点已经失效的置信度。例如，phi 值为 1 约对应 90% 的置信度，phi 值为 2 对应 99% 的置信度，phi 值为 3 对应 99.9% 的置信度。

## Gossip Protocols for Heartbeats
## 用于心跳的 Gossip 协议

As distributed systems grow in size, centralized heartbeat monitoring becomes a bottleneck. A single monitoring node responsible for tracking thousands of servers creates a single point of failure and does not scale well. This is where gossip protocols come into play.

随着分布式系统规模的增长，中心化的心跳监控成为一个瓶颈。一个负责跟踪数千台服务器的单一监控节点会造成单点故障，并且扩展性不佳。这就是 Gossip 协议发挥作用的地方。

Gossip protocols distribute the responsibility of failure detection across all nodes in the cluster. Instead of reporting to a central authority, each node periodically exchanges heartbeat information with a randomly selected subset of peers. Over time, information about the health of every node spreads throughout the entire cluster, much like gossip spreads in a social network.

Gossip 协议将故障检测的责任分散到集群中的所有节点。每个节点不是向中央机构报告，而是定期与随机选择的一部分对等节点交换心跳信息。随着时间的推移，关于每个节点健康状况的信息会像社交网络中的八卦一样在整个集群中传播。

The basic gossip algorithm: each node maintains a local membership list containing information about all known nodes in the cluster, including their heartbeat counters. Periodically, the node selects one or more random peers and exchanges its entire membership list with them. When receiving a membership list from a peer, the node merges it with its own list, keeping the most recent information for each node.

基本的 Gossip 算法：每个节点维护一个本地成员列表，其中包含集群中所有已知节点的信息，包括它们的心跳计数器。节点会定期选择一个或多个随机的对等节点，并与它们交换整个成员列表。当从一个对等节点接收到一个成员列表时，节点会将其与自己的列表合并，并保留每个节点的最新信息。

```python
class GossipNode:  
    def __init__(self, node_id, peers):  
        self.node_id = node_id  
        self.peers = peers  
        self.membership_list = {}  
        self.heartbeat_counter = 0  
        
    def update_heartbeat(self):  
        self.heartbeat_counter += 1  
        self.membership_list[self.node_id] = {  
            'heartbeat': self.heartbeat_counter,  
            'timestamp': time.time()  
        }  
        
    def gossip_round(self):  
        # Update own heartbeat  
        self.update_heartbeat()  
        
        # Select random peers to gossip with  
        num_peers = min(3, len(self.peers))  
        selected_peers = random.sample(self.peers, num_peers)  
        
        # Send membership list to selected peers  
        for peer in selected_peers:  
            self._send_gossip(peer)  
            
    def _send_gossip(self, peer):  
        try:  
            response = requests.post(  
                f"http://{peer}/gossip",  
                json=self.membership_list  
            )  
            received_list = response.json()  
            self._merge_membership_list(received_list)  
        except Exception as e:  
            logging.error(f"Failed to gossip with {peer}: {e}")  
            
    def _merge_membership_list(self, received_list):  
        for node_id, info in received_list.items():  
            if node_id not in self.membership_list:  
                self.membership_list[node_id] = info  
            else:  
                # Keep the entry with the higher heartbeat counter  
                if info['heartbeat'] > self.membership_list[node_id]['heartbeat']:  
                    self.membership_list[node_id] = info  
                    
    def detect_failures(self, timeout_seconds):  
        failed_nodes = []  
        current_time = time.time()  
        
        for node_id, info in self.membership_list.items():  
            if node_id != self.node_id:  
                time_since_update = current_time - info['timestamp']  
                if time_since_update > timeout_seconds:  
                    failed_nodes.append(node_id)  
                    
        return failed_nodes
```

The gossip protocol eliminates single points of failure since every node participates in failure detection. It scales well because the number of messages each node sends remains constant regardless of cluster size. It is also resilient to node failures since information continues to spread as long as some nodes remain connected.

Gossip协议消除了单点故障，因为每个节点都参与故障检测。它的扩展性很好，因为每个节点发送的消息数量与集群大小无关，保持不变。它对节点故障也具有弹性，因为只要一些节点保持连接，信息就会继续传播。

However, gossip protocols also introduce complexity. Because information spreads gradually, there can be a delay before all nodes learn about a failure. This eventual consistency model means that different nodes might temporarily have different views of the cluster state. The protocol also generates more total network traffic since information is duplicated across many gossip exchanges, though this is usually acceptable since gossip messages are small.

然而，gossip协议也带来了复杂性。由于信息是逐渐传播的，所以在所有节点都了解到故障之前可能会有延迟。这种最终一致性模型意味着不同的节点可能会暂时对集群状态有不同的看法。该协议还会产生更多的总网络流量，因为信息在许多gossip交换中被复制，不过由于gossip消息很小，这通常是可以接受的。

Many production systems use gossip-based failure detection. Cassandra, for example, uses a gossip protocol where each node gossips with up to three other nodes every second. Nodes track both heartbeat generation numbers and version numbers to handle various failure scenarios. The protocol also includes mechanisms to handle network partitions and prevent split-brain scenarios.

许多生产系统使用基于Gossip的故障检测。例如，Cassandra使用一种Gossip协议，其中每个节点每秒与最多三个其他节点进行Gossip。节点跟踪心跳生成号和版本号，以处理各种故障场景。该协议还包括处理网络分区和防止脑裂场景的机制。

## Implementation Considerations
## 实现考量

One important implementation consideration is the transport protocol.

一个重要的实现考虑因素是传输协议。

Should heartbeats use TCP or UDP? TCP provides reliable delivery and guarantees that messages arrive in order, but it also introduces overhead and can be slower due to connection establishment and acknowledgment mechanisms.

心跳应该使用TCP还是UDP？TCP提供可靠的传输，并保证消息按序到达，但它也带来了开销，并且由于连接建立和确认机制可能会更慢。

UDP is faster and more lightweight, but packets can be lost or arrive out of order. Many systems use UDP for heartbeat messages because occasional packet loss is acceptable, the receiver can tolerate missing a few heartbeats without declaring a node dead.

UDP 更快、更轻量，但数据包可能会丢失或乱序到达。许多系统使用 UDP 发送心跳消息，因为偶尔的数据包丢失是可以接受的，接收方可以容忍错过几个心跳而不会将节点声明为死亡。

However, TCP is often preferred when heartbeat messages carry critical state information that must not be lost.

然而，当心跳消息携带不容丢失的关键状态信息时，通常首选 TCP。

Another consideration is network topology. In systems spanning multiple data centers, network latency and reliability vary significantly between different paths. A heartbeat between two nodes in the same data center might have a round-trip time of 1 millisecond, while a heartbeat crossing continents might take 100 milliseconds or more. Systems should account for these differences, potentially using different timeout values for local versus remote nodes.

另一个考虑因素是网络拓扑。在跨越多个数据中心的系统中，不同路径的网络延迟和可靠性差异很大。同一数据中心内两个节点之间的心跳往返时间可能为1毫秒，而跨越大陆的心跳可能需要100毫秒或更多。系统应考虑这些差异，可能会对本地和远程节点使用不同的超时值。

```python
class AdaptiveHeartbeatConfig:  
    def __init__(self):  
        self.configs = {}  
        
    def configure_for_node(self, node_id, location):  
        if location == 'local':  
            config = {  
                'interval': 1000,  # 1 second  
                'timeout': 3000,   # 3 seconds  
                'protocol': 'UDP'  
            }  
        elif location == 'same_datacenter':  
            config = {  
                'interval': 2000,  # 2 seconds  
                'timeout': 6000,   # 6 seconds  
                'protocol': 'UDP'  
            }  
        else:  # remote_datacenter  
            config = {  
                'interval': 5000,  # 5 seconds  
                'timeout': 15000,  # 15 seconds  
                'protocol': 'TCP'  
            }  
            
        self.configs[node_id] = config  
        return config
```

Another important implementation consideration is to ensure that we do not have blocking operations in the heartbeat processing path. Heartbeat handlers should execute quickly and defer any expensive operations to separate worker threads.

另一个重要的实现考虑是，确保在心跳处理路径中没有阻塞操作。心跳处理程序应快速执行，并将任何昂贵的操作推迟到单独的工作线程中。

Resource management is also critical. In a system with thousands of nodes, maintaining separate threads or timers for each node can exhaust system resources. We should prefer event-driven architectures or thread pools to efficiently manage concurrent heartbeat processing. Connection pooling would also reduce the overhead of establishing new connections for each heartbeat message.

资源管理也至关重要。在一个拥有数千个节点的系统中，为每个节点维护单独的线程或计时器会耗尽系统资源。我们应该优先选择事件驱动架构或线程池来高效地管理并发的心跳处理。连接池也可以减少为每条心跳消息建立新连接的开销。

## Network Partitions and Split-brain
## 网络分区与脑裂

A network partition occurs when network connectivity is disrupted, splitting a cluster into two or more isolated groups. Nodes within each partition can communicate with each other but cannot reach nodes in other partitions.

当网络连接中断，将一个集群分割成两个或多个孤立的组时，就会发生网络分区。每个分区内的节点可以相互通信，但无法到达其他分区中的节点。

During a partition, nodes on each side will stop receiving heartbeats from nodes on the other side. This creates an ambiguous situation where both sides might believe the other has failed. If not handled carefully, this can lead to split-brain scenarios where both sides continue operating independently, potentially leading to data inconsistency or resource conflicts.

在分区期间，每一方的节点将停止接收来自另一方节点的心跳。这造成了一种模棱两可的情况，即双方都可能认为对方已经失效。如果处理不当，这可能会导致脑裂场景，即双方继续独立运行，从而可能导致数据不一致或资源冲突。

Consider a database cluster with three nodes spread across two data centers. If the network connection between data centers fails, the nodes in each data center will form separate partitions. Without proper safeguards, both partitions might elect their own leader, accept writes, and diverge from each other.

考虑一个拥有三个节点的数据库集群，分布在两个数据中心。如果数据中心之间的网络连接失败，每个数据中心的节点将形成单独的分区。如果没有适当的保障措施，两个分区都可能选举自己的领导者，接受写入，并相互分歧。

To handle network partitions correctly, systems often use quorum-based approaches. A quorum is the minimum number of nodes that must agree before taking certain actions. For example, a cluster of five nodes might require a quorum of three nodes to elect a leader or accept writes.

为了正确处理网络分区，系统通常使用基于法定人数的方法。法定人数是采取某些行动前必须达成一致的最小节点数。例如，一个由五个节点组成的集群可能需要三个节点的法定人数才能选举领导者或接受写入。

During a partition, only the partition containing at least three nodes can continue operating normally. The minority partition recognizes it has lost quorum and stops accepting writes.

在分区期间，只有包含至少三个节点的分区才能继续正常运行。少数分区会意识到自己失去了法定人数，并停止接受写入。

```python
class QuorumBasedFailureHandler:  
    def __init__(self, total_nodes, quorum_size):  
        self.total_nodes = total_nodes  
        self.quorum_size = quorum_size  
        self.reachable_nodes = set()  
        
    def update_reachable_nodes(self, node_list):  
        self.reachable_nodes = set(node_list)  
        
    def has_quorum(self):  
        return len(self.reachable_nodes) >= self.quorum_size  
        
    def can_accept_writes(self):  
        return self.has_quorum()  
        
    def should_step_down_as_leader(self):  
        return not self.has_quorum()
```

## Real-world Applications
## 现实世界中的应用

Each node in a Kubernetes cluster runs a kubelet agent that periodically sends node status updates to the API server. By default, kubelets send updates every 10 seconds. If the API server does not receive an update within 40 seconds, it marks the node as NotReady.

Kubernetes集群中的每个节点都运行一个kubelet代理，它会定期向API服务器发送节点状态更新。默认情况下，kubelet每10秒发送一次更新。如果API服务器在40秒内没有收到更新，它会将该节点标记为NotReady。

Kubernetes also implements liveness and readiness probes at the pod level. A liveness probe checks whether a container is running properly, and if the probe fails repeatedly, Kubernetes restarts the container. A readiness probe determines whether a container is ready to accept traffic, and failing readiness probes cause the pod to be removed from service endpoints.

Kubernetes 还在 Pod 级别实现了存活探针和就绪探针。存活探针检查容器是否正常运行，如果探针连续失败，Kubernetes 会重启容器。就绪探针确定容器是否准备好接受流量，如果就绪探针失败，Pod 将会从服务端点中移除。

```yaml
apiVersion: v1  
kind: Pod  
metadata:  
  name: example-pod  
spec:  
  containers:  
  - name: app  
    image: myapp:latest  
    livenessProbe:  
      httpGet:  
        path: /healthz  
        port: 8080  
      initialDelaySeconds: 15  
      periodSeconds: 10  
      timeoutSeconds: 2  
      failureThreshold: 3  
    readinessProbe:  
      httpGet:  
        path: /ready  
        port: 8080  
      initialDelaySeconds: 5  
      periodSeconds: 5  
      timeoutSeconds: 2
```

Cassandra, a distributed NoSQL database, uses gossip-based heartbeats to maintain cluster membership. Each Cassandra node gossip with up to three other random nodes every second. The gossip messages include heartbeat generation numbers that increment whenever a node restarts and heartbeat version numbers that increment with each gossip round.

Cassandra，一个分布式NoSQL数据库，使用基于gossip的心跳来维护集群成员关系。每个Cassandra节点每秒与最多三个其他随机节点进行gossip。gossip消息包括心跳生成号（每当节点重启时递增）和心跳版本号（每轮gossip时递增）。

Cassandra uses the phi accrual failure detector to determine when nodes are down. The default phi threshold is 8, meaning a node is considered down when the algorithm is about 99.9999% confident it has failed. This adaptive approach allows Cassandra to work reliably across diverse network environments.

Cassandra 使用 phi 累积故障检测器来确定节点何时宕机。默认的 phi 阈值为 8，这意味着当算法对节点故障的置信度约为 99.9999% 时，该节点被视为宕机。这种自适应方法使得 Cassandra 能够在不同的网络环境中可靠地工作。

etcd, a distributed key-value store used by Kubernetes, implements heartbeats as part of its Raft consensus protocol. The Raft leader sends heartbeat messages to followers every 100 milliseconds by default. If a follower does not receive a heartbeat within the election timeout (typically 1000 milliseconds), it initiates a new leader election.

etcd，一个由 Kubernetes 使用的分布式键值存储，将心跳作为其 Raft 共识协议的一部分来实现。Raft 领导者默认每 100 毫秒向追随者发送心跳消息。如果一个追随者在选举超时（通常为 1000 毫秒）内没有收到心跳，它就会发起一次新的领导者选举。

## Footnotes
## 脚注

Heartbeats are essential to distributed systems. From simple periodic messages to sophisticated adaptive algorithms, heartbeats enable systems to maintain awareness of component health and respond to failures quickly.

心跳对于分布式系统至关重要。从简单的周期性消息到复杂的自适应算法，心跳使得系统能够保持对组件健康状况的感知并快速响应故障。

The key to effective heartbeat design lies in balancing competing concerns. Fast failure detection requires frequent heartbeats and aggressive timeouts, but this increases network overhead and sensitivity to transient issues. Slow detection reduces resource consumption and false positives but leaves the system vulnerable to longer outages.

有效心跳设计的关键在于平衡相互竞争的考虑因素。快速的故障检测需要频繁的心跳和激进的超时，但这会增加网络开销和对瞬时问题的敏感性。缓慢的检测可以减少资源消耗和误报，但会使系统在更长的停机时间内容易受到攻击。

As we design distributed systems, consider heartbeat mechanisms early in the architecture process. The choice of heartbeat intervals, timeout values, and failure detection algorithms significantly impacts system behavior under failure conditions.

在设计分布式系统时，应尽早考虑心跳机制。心跳间隔、超时值和故障检测算法的选择，会对系统在故障条件下的行为产生重大影响。

No matter what we are building, heartbeats remain an essential tool for maintaining reliability.

无论我们正在构建什么，心跳仍然是维护可靠性的重要工具。