Works  

Pulse 2021

Pulse

Pulse is an easy-to-use hybrid failure detection library based on simple heartbeat message exchanges overlayed on a gossip protocol. Failure detectors were proposed by Chandra and Toueg used to solve consensus in asynchronous systems with crash failures. In a fully asynchronous system, a failure detector is impossible to operate. But with time bounds (RTT) we can reasonably suspect a crashed node as failed. The simplest way to build a failure detector would be send and receive heartbeat messages among all nodes in the network.

The problem with heartbeat-based FD is that it is not scalable. Every node in the network exchanges heartbeat message with other nodes, causing the network load to reach an order of O(n^2). For small number of nodes, less than 100, this is a perfectly acceptable way of communicating. However, as the numbers begin to escalate, at 1000 we are exchanging 1,000,000+ messages! This is where gossip protocol helps us reduce the network load to an order of O(n). In a gossip protocol, every node chooses a random node to (gossip) exchange message with and piggybacks status of other nodes it knows about (SWIM).

Pulse uses a simple heartbeat protocol when the number of nodes involved are small (less than 100). As the number of nodes grows (customizable by the user), the nodes start disseminating gossip style messages to relay their liveliness. An individual node can opt to keep a heartbeat protocol to receive RTT bounded updates for nodes of their choosing, but the rest of the node discovery will be done via gossip message exchange.


© 2022 Thomas Min. All Rights Reserved. All Credits.