Distributed systems require Leader Election and today I am going to try to explain two most popular algorithms - PAXOS and RAFT.
Once the problem is defined and some basic building blocks are identified both the algorithms are very easy to understand. The Paxos protocol was first published in 1989 by Leslie Lamport as the basis of State Machine Replication method. Raft was designed in last couple years from "ground up" specifically for easy understandability.
Background : Need For Consensus
Army of clones
In a typical scale out architecture we have a number of clients making requests to a cluster of servers.
- When these servers are identical and stateless any server can process any request from any client. That is the basis for "scale-out" architectures. When servers are stateful the same logic still applies as long as each server has the same state.
- When the client request change the state, that is when we need consistency and atomicity with many concurrent clients talking to many servers. This is where you need to co-ordinate using consensus protocols
Electing a leader and following commands
- One very powerful approach is to elect a single leader who makes the changes to the state and then all other machines follow exact steps leader followed to make state change.
- This change of state is defined by a deterministic state machine.(Mealy State Machine to be exact). Same sequence of input command executed on each nodes creates the identical state changes at each node. Leader broadcasts the input messages to this state machine and followers make local state changes. This is what Leslie Lamport means when he says "State Machine Replication".
Crashes and Dropped Messages
- Life would have been awesome if there were no crashes and no lost messages and no network partitions. But failures do happen and that's exactly the reason why we need complicated protocols.
- The basic assumption of distributed system is that there is no global shared state and the state has to be transferred by passing messages. Messages have delays and so the state changes have delays.
- Every node in the cluster will be slightly different due to delays between when commands were sent by the leader and when they were received and applied by each follower.
The basic requirement for Leader Election
The basic requirement of consensus is that despite crashes and network partitioning there is exactly one elected leader and every one knows about it when elected.
Toolbox: The Building Blocks
Let's first start with examining some familiar concepts and tools before we analyze PAXOS and RAFT. In both algorithms any node can propose but only one is finally accepted as leader. "Proposer" proposes and "Voters" vote on it. In both algorithms once the leader is elected the results are broadcast.
Optimistic Concurrency and Thomas Write Rule :
- The optimistic concurrency works by requiring every write to follow a two step process - first fetch the latest version of data and then update it.
- This is just Compare and Swap in different form. If your update is based on outdated version then it will be rejected. Thomas' write rule says ignore outdated writes.
- This is "lock-free" and gives very high throughput.
Two phase commit
- The classical commit protocol has two phases - a prepare phase and a commit phase . In prepare phase we solicit votes and leader is elected only when commit is issued and successful.
Why we need majority ?
- It is possible that two nodes simultaneously believe that they are the leaders.
- But by definition and mathematical certainty only 1 can get the majority votes. Therefore there can be only one leader - thus satisfying basic requirement.
Bidding and Auctions
- Think of leader election as as an auction - the highest bidder always wins. Proposals from higher bidder cancels voter's previous commitments - the transaction is aborted for previous bidder/proposal.
- There is always some well defined way of saying which Candidate has higher bid.
- This bidding and auctioning process is integral part of the leader election.
Below are the high level details of the algorithm. Here is how it works
- At a higher level Paxos combines "read value" part of optimistic concurrency with propose phase of two phase commit in first step. In second step it combines "write value" part with commit phase.
- In step 1, the proposer "makes up" a integer version number/timestamp and calls it "proposal number". In the spirit of Thomas's rule no voter accepts any "outdated version" by making sure it always accepts the highest proposal number. This pretty much looks like an auction - the highest bidder always wins.
- If proposal gets ACK from the majority of the nodes then proposer send the "commit message". The only confusing aspect of the algorithm is that if the value is already chosen then proposer just sets the same already chosen value instead of its own. (It sure does feel like an uneccesary no-op though)
- Again using Thomas' rule voters reject the request to commit if they have seen a higher numbered proposal. Proposal to write higher version cancels voter's previous commitments - as far voter is concerned the commit is aborted for the proposer.
- When proposal gets majority of committed ACKs, the election is over.
- Leslie Lamport compared this method to a fictional legislative consensus system used on the Paxos island in Greece. That is your "Part Time Parliament". Leslie envisioned that for each new state change a new leader is elected and everyone follows it only for that single change. For next state change request the process repeats.
- RAFT instead elects a dictator for fixed term. Once elected, everyone follows the leader until a new leader is elected.
- For each state change, the leader first sends a Prepare message on behalf of the client. When it receives majority of ACK on prepares it sends commit message. Commit is complete when majority have committed. The cycle continues for the next change..
- New election is started when the term for current leader ends or any node thinks leader is dead. A node thinks a leader is dead if it has not received any heartbeat or command from the leader in given timespan.
- The node suspecting a leader death then decides to become a candidate. Candidates seek a majority vote. If the Candidate doesn't get elected in the first round then it just keeps on trying again and again until someone is elected. As simple as that, keep on asking for vote until either you win or know that someone else is already a winner.
- If node itself is not a candidate then it always accept the candidate which has advanced most in its processing. Again this is kind of an auction. In other words the bid from the candidate with the most consistent state always wins.
- But how does RAFT avoid or minimize the chance of election getting into the infinite loop where no one ever has the majority ? The answer is annoyingly simple - The node waits for a random amount of time before becoming a candidate and asking for a vote. It is statistically improbable that there will be two candidates asking for the vote at the same time. By carefully choosing the random wait time progress can be made - this is exactly how ethernets work !
Leader election is fact of life in distributed systems. Although Paxos and Raft look very different at the surface , deep down they use the similar building blocks.
- Paxos Made Simplehttp://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
- In Search of an Understandable Consensus Algorithm