
    • 一致性
      • 一致性算法的特性
    • Raft状态和状态的转化
      • Raft定义
      • Raft状态的转化过程
      • Raft实例初始化
    • Raft的一些规则
    • Leader Election
      • Candidate选举过程与相应处理
      • Receiver投票策略
    • Log Replication
      • Leader 复制log请求与响应
      • Receiver对AppendEntries的处理
      • 复制日志的优化
      • 状态持久化


分布式系统最核心问题:维持多个节点副本的一致性。一致性协议通常基于replicated state machines,即所有结点都从同一个state出发,都经过同样的一些操作序列(log),最后到达同样的state。

  • 状态机:当我们说一致性的时候,实际就是在说要保证这个状态机的一致性。状态机会从log里面取出所有的命令,然后执行一遍,得到的结果就是我们对外提供的保证了一致性的数据
  • Log: 保存了所有修改记录
  • 一致性模块: 一致性模块算法就是用来保证写入的log的命令的一致性,这也是raft算法核心内容


  • safety: 在非拜占庭条件下不会返回不正确的结果(网络延迟、分区、丢包、重复、顺序重排)
  • available:前提大多数节点存活
  • do not depend on timing to ensure the consistency of the logs
  • a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls



  • Leader:所有请求的处理者,Leader副本接受client的更新请求,本地处理后再同步至多个其他副本;
  • Follower:请求的被动更新者,从Leader接受更新请求,然后写入本地日志文件
  • Candidate:如果Follower副本在一段时间内没有收到Leader副本的心跳,则判断Leader可能已经故障,此时启动选主过程,此时副本会变成Candidate状态,直到选主结束。



type Raft struct {mu        sync.Mutex          // Lock to protect shared access to this peer's statepeers     []*labrpc.ClientEnd // RPC end points of all peerspersister *Persister          // Object to hold this peer's persisted stateme        int                 // this peer's index into peers[]// Your data here (2A, 2B, 2C).// Look at the paper's Figure 2 for a description of what// state a Raft server must maintain.currentTerm    int         //2AvotedFor       int         //2AelectionTimer  *time.Timer //2AheartbeatTimer *time.Timer //2Astate          NodeState   //2Alogs         []LogEntry    //2BcommitIndex int           //2BlastApplied int           //2BnextIndex   []int         //2BmatchIndex  []int         //2BapplyCh     chan ApplyMsg //2B



  • follower -> candidate: follower的选举超时时间(electionTimer)到
  • candidate -> candidate: candidate在选举时间内没有赢得选举并且也没有收到来自新leader的心跳



  • candiadte -> leader: candidate 在获得majority选票时转化为leader

转化为leader后:初始化nextIndex[], 停止选举超时定时器、向其他server发送心跳、发送完心跳后也要重置心跳定时器(heartbeatTimer)


  • leader -> follower
  • candidate -> follower

无论对于leader还是candidate转化为follower的前提就是在RequestVoteRPC和AppendEntriesRPC的request或者response中的term 大于 currentTerm.

根据上述介绍可以定义状态转化方法convertTo, 完成状态转化后的操作,并在相应的情况下进行convertTo的调用

func (rf *Raft) convertTo(state NodeState) {if rf.state == state {return}DPrintf("Term %d: server %d convert from %v to %v\n",rf.currentTerm, rf.me, rf.state, state)// update staterf.state = stateswitch state {case Follower:rf.heartbeatTimer.Stop() //if native server is leader, stop heartbeatrf.electionTimer.Reset(randomTime(ElectionTimeoutLower, ElectionTimeoutHigher))rf.votedFor = -1 //clear votedForcase Candidate:rf.startElection()case Leader:for i := range rf.nextIndex {rf.nextIndex[i] = len(rf.logs)}rf.electionTimer.Stop()rf.broadcastHeartbeat()rf.heartbeatTimer.Reset(HeartbeatInterval)}

选举定时器(electionTimer), 当出现以下四种情况需要重置定时器:

  • candidate或者leader转化为follower
  • candidate或者follower收到来自leader的心跳
  • candidate开始进行选举
  • 投票给除自身以外的candidate

心跳定时器(heartbeatTimer), 当leader发送完心跳后重置


主要完成Raft实例属性的初始化,并kick off 一个 goroutine 来处理 timer 相关的事件,注意加锁。有几个点需要注意:

  • 所有的Raft实例都初始化为Follower
  • logs[]的index要从1开始rf.logs = make([]LogEntry, 1): logs 从第 1 个开始,这样第 0 个必然是所有节点都相同的。在发现log不一致时,nextIndex减一并在重试时不会越界。
  • nextIndex[]初始化为len(logs)
  • 采用select监听两个定时器,定时时间到完成相应的操作,并注意在读取state时加锁
func Make(peers []*labrpc.ClientEnd, me int,persister *Persister, applyCh chan ApplyMsg) *Raft {rf := &Raft{}rf.peers = peersrf.persister = persisterrf.me = me//2Arf.currentTerm = 0rf.votedFor = -1rf.electionTimer = time.NewTimer(randomTime(ElectionTimeoutLower, ElectionTimeoutHigher))rf.heartbeatTimer = time.NewTimer(HeartbeatInterval)rf.state = Follower//2Brf.applyCh = applyChrf.commitIndex = 0rf.lastApplied = 0rf.logs = make([]LogEntry, 1)rf.matchIndex = make([]int, len(rf.peers))rf.nextIndex = make([]int, len(rf.peers))for i := range rf.nextIndex {rf.nextIndex[i] = len(rf.logs)}//2c persist// initialize from state persisted before a crash//rf.readPersist(persister.ReadRaftState())rf.mu.Lock()rf.readPersist(persister.ReadRaftState())rf.mu.Unlock()go func(node *Raft) {for {select {case <-rf.electionTimer.C:rf.mu.Lock()if rf.state == Follower { //there are two situations: Follower and Candidaterf.convertTo(Candidate)} else {rf.startElection()}rf.mu.Unlock()case <-rf.heartbeatTimer.C:rf.mu.Lock()if rf.state == Leader {DPrintf("%v its log len is %d", rf, len(rf.logs))rf.broadcastHeartbeat()rf.heartbeatTimer.Reset(HeartbeatInterval)}rf.mu.Unlock()}}}(rf)return rf



Leader Election

Leader Election 阶段需要保证safety和liveness:

  • safety: 每个term至多选出一个leader。主要由于:1、Raft保证一个term内同一个server只能投一票,并将voteFor持久化;2、另外要满足majority原则
  • liveness: 参与竞选的候选人最终肯定会有一个成为leader。主要由于:1、election timeout 在[T, 2T]之间进行随机初始化,先超时的会赢得选举,从而克服网络分区;2、保证T >> broadcast time

Leader Election 过程可以将其细化为Candidate发送选举请求与处理请求和server投票两个部分



  1. currentTerm 加一
  2. 重置选举定时器
  3. 给自己投票
  4. 并行发送RequestVoteRPC消息给其它所有server


  1. 自己被选成了leader。当收到了majority的投票后,状态切成Leader,并且定期给其它的所有server发心跳消息(不带log的AppendEntriesRPC)。
  2. 别人成为了leader。当Candidate在等待投票的过程中,收到了大于或者等于本地的currentTerm的声明对方是leader的AppendEntriesRPC时,则将自己的state切成follower,并且更新本地的currentTerm。
  3. 没有选出leader。当投票被瓜分,没有任何一个candidate收到了majority的vote时,没有leader被选出。这种情况下,每个candidate等待的投票的过程就超时了,接着candidates都会将本地的currentTerm再加1,进行新一轮的leader election。
func (rf *Raft) startElection() {defer rf.persist()rf.currentTerm += 1 // term+1rf.electionTimer.Reset(randomTime(ElectionTimeoutLower, ElectionTimeoutHigher))lastIndex := len(rf.logs) - 1args := RequestVoteArgs{Term:        rf.currentTerm,CandidateId: rf.me,LastLogIndex: lastIndex,LastLogTerm: rf.logs[lastIndex].Term,}var voteCount int32for i := range rf.peers {if i == rf.me {rf.votedFor = rf.meatomic.AddInt32(&voteCount, 1)continue}go func(server int) {var reply RequestVoteReplyif rf.sendRequestVote(server, &args, &reply) {rf.mu.Lock()DPrintf("%v: got RequestVote response from node %d, VoteGranted=%v, Term=%d",rf, server, reply.VoteGranted, reply.Term)if reply.VoteGranted && rf.state == Candidate { // term must >= currentTermatomic.AddInt32(&voteCount, 1)if atomic.LoadInt32(&voteCount) > int32(len(rf.peers)/2) {rf.convertTo(Leader)}} else {if reply.Term > rf.currentTerm {rf.currentTerm = reply.Termrf.convertTo(Follower)rf.persist()}}rf.mu.Unlock()} else {DPrintf("%v:send request vote to server %d failed", rf, server)}}(i)if rf.state == Leader {break}}



  1. 每个server只能给同一个term的candidate投一票
  2. 投票顺序按照先来先服务的原则
  3. 为保证safety,candidate的log至少要和自己的log一样新(先比较两者的lastLogTerm, 谁的大谁更新;lastLogTerm一样时比较两者日志的长度,谁的更长谁更新)


func (rf *Raft) RequestVote(args *RequestVoteArgs, reply *RequestVoteReply) {//lockrf.mu.Lock()defer rf.mu.Unlock()defer rf.persist()// Your code here (2A, 2B).// candidate term is smallerif args.Term < rf.currentTerm {reply.Term = rf.currentTermreply.VoteGranted = falsereturn}// 第一条和第二条:server has voted for other: only one vote for same term and FIFOif args.Term == rf.currentTerm && rf.votedFor != -1 && rf.votedFor != args.CandidateId {reply.Term = rf.currentTermreply.VoteGranted = falsereturn}// update current server: if the server is candidate or leader, it will be followerif args.Term > rf.currentTerm {rf.currentTerm = args.Termrf.convertTo(Follower)}//第三条:candidate log limitationlastLogIndex := len(rf.logs) - 1if args.LastLogTerm < rf.logs[lastLogIndex].Term ||(args.LastLogTerm == rf.logs[lastLogIndex].Term && args.LastLogIndex < lastLogIndex) {reply.Term = rf.currentTermreply.VoteGranted = falsereturn}rf.votedFor = args.CandidateIdreply.Term = rf.currentTerm // no usereply.VoteGranted = true// 重要:reset election timeout timerrf.electionTimer.Reset(randomTime(ElectionTimeoutLower, ElectionTimeoutHigher))

Log Replication

当Leader被选出来后,就可以接受客户端发来的请求了。leader会把它作为一个log entry append到日志中,然后给其它的server发AppendEntriesRPC请求。当Leader确定一个log entry被大多数replicated了,就apply这条log entry到状态机中然后返回结果给客户端。如果某个Follower宕机了或者运行的很慢,或者网络丢包了,则会一直给这个Follower发AppendEntriesRPC直到日志一致。


为了使 Follower 的状态与 Leader 一致,Follower 需要找到它和 Leader 之间最后一个共同的 LogEntry, 将其后的所有 LogEntry 全部覆盖为 Leader 的。Leader 维护了所有节点的 nextIndex, 其含义是 Leader 需要与某个节点同步的第一个 LogEntry 的 index。 当一个 Leader 刚被选出的时候,它会把所有 nextIndex 初始化为自己的下一个 LogEntry index。 也就是说无需同步。这样如果有不一致的情况,Leader 发出的 AppendEntries RPC 会失败, Leader 会减少 nextIndex 然后重试,直到一致为止。 这个机制也解释了为什么我们的 logs 从第 1 个开始,这样第 0 个必然是所有节点都相同的。在重试时不会越界。

初始化,nextIndex为11,leader给b发送AppendEntriesRPC(6,10),b在自己log的10号槽位中没有找到term_id为6的log entry。则给leader回应一个拒绝消息。接着,leader将nextIndex减一,变成10,然后给b发送AppendEntriesRPC(6, 9),b在自己log的9号槽位中同样没有找到term_id为6的log entry。循环下去,直到leader发送了AppendEntriesRPC(4,4),b在自己log的槽位4中找到了term_id为4的log entry。接收了消息。随后,leader就可以从槽位5开始给b推送日志了。


  • committed(commitIndex参数)
    其含义是,该分布式系统 log 中已经达成一致的部分(LogEntry被复制到majority)
  • applied (lastApplied参数)

Leader 复制log请求与响应


func (rf *Raft) broadcastHeartbeat() {for i := range rf.peers {if i == rf.me {continue}go func(server int) {rf.mu.Lock()if rf.state != Leader {rf.mu.Unlock()return}preLogIndex := rf.nextIndex[server] - 1preLogTerm := rf.logs[preLogIndex].Term//1entries := make([]LogEntry, len(rf.logs[preLogIndex+1:]))copy(entries, rf.logs[preLogIndex+1:])args := AppendEntriesArgs{Term:     rf.currentTerm,LeaderId: rf.me,PreLogIndex: preLogIndex,PreLogTerm: preLogTerm,LogEntries: entries,LeaderCommit: rf.commitIndex,}rf.mu.Unlock()var reply AppendEntriesReplyif rf.sendAppendEntries(server, &args, &reply) {rf.mu.Lock()if rf.state != Leader {rf.mu.Unlock()return}if reply.Success {//2rf.matchIndex[server] = args.PreLogIndex + len(args.LogEntries)rf.nextIndex[server] = rf.matchIndex[server] + 1for i := len(rf.logs) - 1; i > rf.commitIndex; i-- {count := 0for _, index := range rf.matchIndex {if index >= i {count += 1}}if count > len(rf.peers) / 2 {//&& rf.currentTerm == rf.logs[i].Term{DPrintf("-----------%v commit %v", rf, rf.logs)rf.setCommitIndex(i)break}}//3} else {if reply.Term > rf.currentTerm {rf.currentTerm = reply.Termrf.convertTo(Follower)} else {rf.nextIndex[server] -= 1}}rf.mu.Unlock()}}(i)}


  1. 向AppendEntries传递参数时,entries[]应该传递副本。因为可能出现这种情况:我们在对心跳包发送的 entries 进行 encoding 的时候,同时另一个地方正在对这些 entries 进行修改。比如:Leader 正在发送心跳包,Leader 由于某些原因转为了 Follower,log 被新的 Leader 改写。所以就会出现log被同时读写的情况。
  2. RPC请求成功说明log被复制到follower,需要更新matchIndex[]和nextIndex[]; 当复制到majority后更新leader的commitIndex: 即在log最后一个位置到commitIndex之间寻找N,满足多数的matchIndex[i] >= N,并且 log[N].term == currentTerm:设置commitIndex = N。只有leader的commitIndex的更新需要满足多数派,follower的commitIndex依靠leader的commitIndex更新。
  3. RPC请求失败有两种情况:一是RPC传回的term>currentTerm,此时需要更新currentTerm并转化为follower;二是follower和leader不满足Log Matching Rule, 此时需要更新nextIndex, 将当前的值减一。

apply log

更新完commitIndex后就可以将[lastApplied+1, commIndex]之间的log应用到状态机,并更新lastApplied。该操作可以单独创建新的goroutine处理。

func (rf *Raft) setCommitIndex(index int) {rf.commitIndex = indexDPrintf("%v commit index %d", rf, rf.commitIndex)if rf.commitIndex > rf.lastApplied {DPrintf("%v apply form index %d to %d", rf, rf.lastApplied+1, rf.commitIndex)applyLogs := append([]LogEntry{}, rf.logs[rf.lastApplied+1: rf.commitIndex+1]...)//new grountinego func(startIndex int, entries []LogEntry) {for i, entry := range entries {var msg ApplyMsgmsg.Command = entry.Commandmsg.CommandIndex = startIndex + imsg.CommandValid = truerf.applyCh <- msgrf.mu.Lock()if msg.CommandIndex > rf.lastApplied {rf.lastApplied = msg.CommandIndex}rf.mu.Unlock()}}(rf.lastApplied+1, applyLogs)}


代码中的序号分别与AppendEntriesRPC中receiver implement对应

func (rf *Raft) AppendEntries(args *AppendEntriesArgs, reply *AppendEntriesReply) {rf.mu.Lock()defer rf.mu.Unlock()//DPrintf("------curFollower------%d----------", rf.me)//第一条:不满足最小规则返回falseif args.Term < rf.currentTerm {reply.Term = rf.currentTermreply.Success = falsereturn}if args.Term >= rf.currentTerm {rf.currentTerm = args.Termrf.convertTo(Follower)}//只要满足最小规则(说明收到了leader的心跳),无论日志复制成功与否,都需要重置选举定时器rf.electionTimer.Reset(randomTime(ElectionTimeoutLower, ElectionTimeoutHigher))//第二条:不满足Log Matching Rule,返回false. 包括两种情况:follower日志比args.PreLogIndex的短;args.PreLogIndex位置日志不一致curLastLogIndex := len(rf.logs) - 1if curLastLogIndex < args.PreLogIndex ||rf.logs[args.PreLogIndex].Term != args.PreLogTerm{reply.Term = rf.currentTermreply.Success = falsereturn}//第三条:follower日志比args.PreLogIndex长时,说明后面的log可能会存在冲突,寻找第一个冲突的位置,将后面的日志删除// find first position of unmached logentryunmachedIndex := -1for idx := range args.LogEntries {//DPrintf("args.logen*****%d*****%d", rf.me, len(args.LogEntries))if len(rf.logs) - 1 < args.PreLogIndex + 1 + idx ||rf.logs[args.PreLogIndex+1+idx].Term != args.LogEntries[idx].Term {unmachedIndex = idxbreak}}//copy log//第四条:删除冲突日志追加新日志if unmachedIndex != -1 {rf.logs = rf.logs[:args.PreLogIndex+1+unmachedIndex]rf.logs = append(rf.logs, args.LogEntries[unmachedIndex:]...)}//第五条:根据leader的commitIndex更新follower的commitIndex, 并进行applyif args.LeaderCommit > rf.commitIndex {if args.LeaderCommit <= len(rf.logs) - 1 {rf.setCommitIndex(args.LeaderCommit)} else {rf.setCommitIndex(len(rf.logs) - 1)}}reply.Success = true


优化之前每个 AppendEntries RPC 只能检查一个 index。做了之后可以检查一个 term。为 reply 增加两个信息。

type AppendEntriesReply struct {Term    int  // 2ASuccess bool // 2A// OPTIMIZE: see thesis section 5.3ConflictTerm  int // Follower 与 Leader 的不一致 log 的 termConflictIndex int // 不一致term的第一个index,下一次请求的 prevLogIndex = ConflictIndex - 1


// entries before args.PrevLogIndex might be unmatch
// return false and ask Leader to decrement PrevLogIndex
if len(rf.logs) < args.PrevLogIndex + 1 {reply.Success = falsereply.Term = rf.currentTerm// optimistically thinks receiver's log matches with Leader's as a subsetreply.ConflictIndex = len(rf.logs)// no conflict termreply.ConflictTerm = -1return
}if rf.logs[args.PrevLogIndex].Term != args.PrevLogTerm {reply.Success = falsereply.Term = rf.currentTerm// receiver's log in certain term unmatches Leader's logreply.ConflictTerm = rf.logs[args.PrevLogIndex].Term// expecting Leader to check the former term// so set ConflictIndex to the first one of entries in ConflictTermconflictIndex := args.PrevLogIndex// apparently, since rf.logs[0] are ensured to match among all servers// ConflictIndex must be > 0, safe to minus 1for rf.logs[conflictIndex - 1].Term == reply.ConflictTerm {conflictIndex--}reply.ConflictIndex = ConflictIndexreturn


// log unmatch, update nextIndex[server] for the next trial
rf.nextIndex[server] = reply.ConflictIndex// if term found, override it to
// the first entry after entries in ConflictTerm
if reply.ConflictTerm != -1 {for i := args.PrevLogIndex; i >= 1; i-- {if rf.logs[i-1].Term == reply.ConflictTerm {// in next trial, check if log entries in ConflictTerm matchesrf.nextIndex[server] = ibreak}}


需要同步的三个变量:log, votedFor 和 currentTerm。

向 Leader 添加一个新的需要同步的 LogEntry 时
Leader 处理 AppendEntries 的回复,并且需要改变自身 term 时
Candidate 处理 RequestVote 的回复,并且需要改变自身 term 时
Receiver 处理完 AppendEntries 或者 RequestVote 时



ppt 16 17 20 22 23

Q: Are there systems like Raft that can survive and continue to
operate when only a minority of the cluster is active?

A: Not with Raft’s properties. But you can do it with different
assumptions, or different client-visible semantics. The basic problem
is split-brain – the possibility of more than one server acting as
leader. There are two approaches that I know of.

If somehow clients and servers can learn exactly which servers are live
and which are dead (as opposed to live but partitioned by network
failure), then one can build a system that can function as long as one
is alive, picking (say) the lowest-numbered server known to be alive.
However, it’s hard for one computer to decide if another computer is
dead, as opposed to the network losing the messages between them. One
way to do it is to have a human decide – the human can inspect each
server and decide which are alive and dead.

The other approach is to allow split-brain operation, and to have a way
for servers to reconcile the resulting diverging state after partitions
are healed. This can be made to work for some kinds of services, but has
complex client-visible semantics (usually called “eventual
consistency”). Have a look at the Bayou and Dynamo papers which are
assigned later in the course.

Q: In Raft, the service which is being replicated is not available to
the clients during an election process. In practice how much of a
problem does this cause?

A: The client-visible pause seems likely to be on the order of a tenth of a
second. The authors expect failures (and thus elections) to be rare,
since they only happen if machines or the network fails. Many servers
and networks stay up continuously for months or even years at a time, so
this doesn’t seem like a huge problem for many applications.

Q: Are there other consensus systems that don’t have leader-election

A: There are versions of Paxos-based replication that do not have a leader
or elections, and thus don’t suffer from pauses during elections.
Instead, any server can effectively act as leader at any time.

The paper mentions that Raft works under all non-Byzantine
conditions. What are Byzantine conditions and why could they make Raft
A: “Non-Byzantine conditions” means that the servers are fail-stop:
they either follow the Raft protocol correctly, or they halt. For
example, most power failures are non-Byzantine because they cause
computers to simply stop executing instructions; if a power failure
occurs, Raft may stop operating, but it won’t send incorrect results
to clients.

Byzantine failure refers to situations in which some computers execute
incorrectly, because of bugs or because someone malicious is
controlling the computers. If a failure like this occurs, Raft may
send incorrect results to clients.

Most of 6.824 is about tolerating non-Byzantine faults. Correct
operation despite Byzantine faults is more difficult; we’ll touch on
this topic at the end of the term.

Q: What if a client sends a request to a leader, the the leader
crashes before sending the client request to all followers, and the
new leader doesn’t have the request in its log? Won’t that cause the
client request to be lost?

A: Yes, the request may be lost. If a log entry isn’t committed, Raft
may not preserve it across a leader change.

That’s OK because the client could not have received a reply to its
request if Raft didn’t commit the request. The client will know (by
seeing a timeout or leader change) that its request wasn’t served, and
will re-send it.

The fact that clients can re-send requests means that the system has
to be on its guard against duplicate requests; you’ll deal with this
in Lab 3.

Q: If there’s a network partition, can Raft end up with two leaders
and split brain?

A: No. There can be at most one active leader.

A new leader can only be elected if it can contact a majority of servers
(including itself) with RequestVote RPCs. So if there’s a partition, and
one of the partitions contains a majority of the servers, that one
partition can elect a new leader. Other partitions must have only a
minority, so they cannot elect a leader. If there is no majority
partition, there will be no leader (until someone repairs the network

Q: Suppose a new leader is elected while the network is partitioned,
but the old leader is in a different partition. How will the old
leader know to stop committing new entries?

A: The old leader will either not be able to get a majority of
successful responses to its AppendEntries RPCs (if it’s in a minority
partition), or if it can talk to a majority, that majority must
overlap with the new leader’s majority, and the servers in the overlap
will tell the old leader that there’s a higher term. That will cause
the old leader to switch to follower.

Q: When some servers have failed, does “majority” refer to a majority
of the live servers, or a majority of all servers (even the dead

A: Always a majority of all servers. So if there are 5 Raft peers in
total, but two have failed, a candidate must still get 3 votes
(including itself) in order to elected leader.

There are many reasons for this. It could be that the two “failed”
servers are actually up and running in a different partition. From
their point of view, there are three failed servers. If they were
allowed to elect a leader using just two votes (from just the two
live-looking servers), we would get split brain. Another reason is
that we need the majorities of any two leader to overlap at at least
one server, to guarantee that a new leader sees the previous term
number and any log entries committed in previous terms; this requires
a majority out of all servers, dead and alive.

Q: What if the election timeout is too short? Will that cause Raft to

A: A bad choice of election timeout does not affect safety, it only
affects liveness.

If the election timeout is too small, then followers may repeatedly
time out before the leader has a chance to send out any AppendEntries.
In that case Raft may spend all its time electing new leaders, and no
time processing client requests. If the election timeout is too large,
then there will be a needlessly large pause after a leader failure
before a new leader is elected.

Q: Can a candidate declare itself the leader as soon as it receives
votes from a majority, and not bother waiting for further RequestVote

A: Yes – a majority is sufficient. It would be a mistake to wait
longer, because some peers might have failed and thus not ever reply.

Q: Can a leader in Raft ever stop being a leader except by crashing?

A: Yes. If a leader’s CPU is slow, or its network connection breaks,
or loses too many packets, or delivers packets too slowly, the other
servers won’t see its AppendEntries RPCs, and will start an election.



