Raft Consensus Protocol

atonse · on Nov 4, 2021

I’m going to ask folks here a question.

Recently our consul and nomad clusters both blew up within a day of each other.

In some crazy twist of luck, Amazon removed the underlying instances that happened to run the leaders to both clusters. This was in our QA environment so they were all t2.nanos.

In this situation, shouldn’t we just expect the other two nodes to hold an election and elect a new leader? Isn’t this the most plain vanilla use case there is?

In both situations, the clusters got stuck without leaders indefinitely because they got stuck trying to ping the leader to see if they could hold an election (how could they?? the whole point of the election is because the leader disappeared). And the only way to recover was to do some insane dance of building a JSON file and hard coding IPs.

Based on some research, it seems like we need to hard code IPs to avoid this in the future. This seems like a huge smell and goes against everything I’ve ever read about raft and the idea of self healing clusters.

What am I missing here? I don’t remember ever having to deal with this with mongodb in 2014.

kamikaz1k · on Nov 4, 2021

> A Raft cluster of 3 nodes can tolerate a single node failure while a cluster of 5 can tolerate 2 node failures.

The document supports what you were expecting. So... I too am curious what happened to you.

cyberge99 · on Nov 4, 2021

You can use the Go binding to bind to the first private adapter in bind_addr: "bind_addr": "{{ GetPrivateInterfaces | include \"network\" \"172.31.0.0/24\" | attr \"address\" }}",

It sounds like your team had hard coded IPs and when the consensus was lost, they had to manually edit the peers.json file to remove the failed instance.

Also, how many consul servers were running? If it's an even number, the odds of getting into a split-brain scenario is high.

atonse · on Nov 4, 2021

3 were running and our consul config file does have the server’s IP in it, and is generated at boot time, so joy really hard coded. But we ended up just wiping everything out to re-bootstrap it.

If you know any devops consultants that have solid experience with consul/other hashicorp tools, we’d love to talk to them! (US only)

Intermernet · on Nov 4, 2021

The current leader of consul cluster can be found with with a GET call to `/status/leader`, but that should be lower level than you need to do manually. Consul should be tracking this and updating the cluster. You should not be hardcoding IP addresses at all. Instead you can use Consul's inbuilt DNS load-balancing, or more advanced techniques. Have a look at https://www.hashicorp.com/blog/load-balancing-strategies-for... for more details.

Kinrany · on Nov 4, 2021

Surely the number becomes odd every time a Consul server goes down?

toolz · on Nov 4, 2021

You need an odd number of max nodes, not an odd number of live nodes. Raft nodes require knowing they are in the majority to be healthy, if you have a network partition and half are on each side you get split brain because neither side can know for sure which side should be considered authoritative/healthy.

NavinF · on Nov 4, 2021

I think what he meant to ask is “Are 4 nodes less reliable than 3?” and I’m pretty sure the answer is “no” despite the other comment implying otherwise. AFAIK, even nodes can hurt latency, but not reliability.

mirekrusin · on Nov 4, 2021

Counter-intuitively 4 nodes are less reliable than 3 nodes for majority vote systems. Both can tolerate single failed node. However probability of 2nd node failure is 50% higher for 4 node setup than for 3 node setup.

That's because 2nd node failure probablity for 4 node system is P(a) + P(b) + P(c) = 3P and for 3 node system is P(a) + P(b) = 2P; it's 3P vs 2P probablity.

It's more than 50% overall because similar probability happens for first node failure - it's more likely that one out of 4 nodes will fail than one out of 3.

In other words you have to guarantee one more node running without any reliability gains and that guarantee costs reliability.

eklavya · on Nov 4, 2021

It depends on the network topology (likelihood not theoretical) but with the amount of split brains I have witnessed, I would say definitely yes, 4 is less reliable than 3.

Kinrany · on Nov 4, 2021

Gah, I meant even, not odd!

That is, if having an even number of nodes causes problems, you'll have problems every time one of your odd number of nodes goes down!

tick_tock_tick · on Nov 4, 2021

“Are 4 nodes less reliable than 3?”

The answer is 3 is no less reliable then 4.

sriram_malhar · on Nov 4, 2021

That's a bogus implementation of leader election and consensus. You aren't missing anything. They are.

0des · on Nov 4, 2021

For those interested in a comparison to another popular consensus protocol, Paxos, see the two compared here by Heidi Howard and Richard Mortier, both of Cambridge UK: https://www.ics.uci.edu/~cs237/reading/Paxos_vs_Raft.pdf

Even for the uninitiated, the content is surprisingly digestible. The comparison and a brief description of the two begins in section 3.

wahern · on Nov 4, 2021

As always[1], this paper compares MultiPaxos, not Paxos. Paxos is more like the conceptual underpinning to a family of protocols, one leaf of which arguably includes Raft itself. A summary and evolutionary timeline (up to 2018) of the various algorithms in the Paxos lineage is https://vadosware.io/post/paxosmon-gotta-concensus-them-all/

Raft looks good next to MultiPaxos, but MultiPaxos is hardly the state of the art in the mainline Paxos lineage. MultiPaxos was published in 2001, over a decade before Raft. EPaxos is more of a contemporary to Raft, and arguably more in the spirit of Paxos as originally conceived--truly distributed, leaderless consensus.

[1] Practically every paper discussing Raft, including the original Raft paper(s), aliases Paxos to MultiPaxos.

_benedict · on Nov 4, 2021

Lamport intended MultiPaxos to be called Paxos, and it was described in his original Part Time Parliament paper which was written in 89 and published in 98.

These days I think Paxos refers both to what Lamport thought of as single decree Paxos, or Paxos leader election, and also the lineage of protocols that derive from it, as well as sometimes specifically MultiPaxos. But it’s OK, words can have multiple meanings.

For what it’s worth even EPaxos is long in the tooth now. There have been several advancements in the intervening years, the latest being Accord[1], a protocol the Cassandra community has developed for cross shard distributed transactions.

[1] https://cwiki.apache.org/confluence/download/attachments/188...

(Full disclosure: I’m one of the authors)

hardwaresofton · on Nov 4, 2021

Did I miss something newer from 2018+? I did an update recently thanks to some suggestions from HN:

https://vadosware.io/post/paxosmon-2-the-journey-continues/

I haven't read any papers recently but if you know of some that are worth giving a read, please let me know!

zuj · on Nov 4, 2021

For those looking for how raft works

http://thesecretlivesofdata.com/raft/

cyberge99 · on Nov 4, 2021

This was a very informative visualization of raft. Thank you!

jimsimmons · on Nov 4, 2021

Are there higher throughout versions of Paxos that are not multi-paxos/raft? Closest thing I can think of is Egalitarian Paxos but it's super complex and introduces a bunch of other trade offs.

I don't find multi paxos appealing because it feels like a step back from leaderless vanilla paxos. I have been itching to implement a neat consensus protocol that's a cut above the common ones and has better throughout than etcd. Unfortunately I haven't found anything

_benedict · on Nov 4, 2021

Most of the trade-offs inherent to EPaxos are resolved by Accord[1], the protocol being developed by the Apache Cassandra community (myself included). There's also Tempo[2] and Caesar[3], amongst others, but they retain some of the trade-offs (and introduce their own). All of these protocols are also capable of supporting multi-shard operations, making them a significant advance over MultiPaxos and Raft.

[1] https://cwiki.apache.org/confluence/download/attachments/188...

[2] https://arxiv.org/abs/2104.01142

[3] https://arxiv.org/abs/1704.03319

jimsimmons · on Nov 4, 2021

Didn't find any benchmarks in the Accord paper. Do you have pointers?

_benedict · on Nov 4, 2021

It's still under development, this paper proposes it as an enhancement for Cassandra, so benchmarks are a way off. Given its properties it is likely to have similar performance to Tempo and Caesar for consensus (perhaps lower throughput than Tempo but better latency than Caesar). It is not dissimilar to Calvin for multi-shard execution, trading away the NxN sequencing layer (and poor failure properties) for dependency tracking.

jimsimmons · on Nov 5, 2021

Perfect. Will dig through it! I didn't realise leaderless protocols are less robust and the paper sets up the situation quite well.

_benedict · on Nov 7, 2021

Thanks, glad to hear it is helpful. Do let me know if you have any feedback, particularly that might help clarify things.

dnautics · on Nov 4, 2021

Try VSR (Viewstamped replication).

Also supposedly Heidi Howard was working on a consensus protocol appropriate for geographically replicated datacenters, which have fast almost always reliable networks intra-datacenter but fallible slow networks inter-datacenter... That sounds amazing to me.

jimsimmons · on Nov 4, 2021

I've come across a few hierarchical consensus protocol papers. AFAIU, you'll gain immensely in terms of local latency but lose in terms of global latency.

But yeah, something in this space would be very interesting. For example Kubernetes and nomad deploy multiple container pods per machine. Instead of sending heartbeats from each pod it would make sense to batch heartbeats of all pods in each machine and send a single beat instead.

hawk_ · on Nov 4, 2021

> hierarchical consensus protocol papers

do you have any examples handy? would like to explore that direction.

jimsimmons · on Nov 4, 2021

https://arxiv.org/abs/2004.06215

https://nsl.cs.rpi.edu/nagawiecki_ms_2021.pdf

hawk_ · on Nov 4, 2021

cheers! i will check these out.

ryanworl · on Nov 4, 2021

https://mwhittaker.github.io/publications/compartmentalized_...

This is both recent enough that you've probably not seen it and extremely relevant to what you're interested in!

_vvhw · on Nov 4, 2021

Aleksey Charapko's DistSys Reading Group had an interesting discussion on RAFT, Paxos and Viewstamped Replication recently, where I was able to present James Cowling's 2012 revision of VSR: http://charap.co/reading-group-viewstamped-replication-revis...

RAFT is more within the tradition and family of Viewstamped Replication than Multi-Paxos, and VSR has some optimizations beyond RAFT that are interesting, especially where a real-world storage fault model is at play.

jimsimmons · on Nov 4, 2021

Very interesting, I was under the impression that VSR was but Raft by another name. Not sure what caveats come with the alleged "in memory" nature of VSR but if it indeed doesn't require disk persistence, then it would bring a massive throughout boost. It'd be kinda surprising for the industry and academia to have sidelined it relative to Raft and Multi Paxos in that case.

_vvhw · on Nov 4, 2021

It's interesting as you say, since Brian Oki's VSR would pioneer consensus in 1988, a year before Paxos, and James Cowling's and Barbara Liskov's 2012 revision of VSR would also be ahead of RAFT by 2 years, and with an even more intuitive presentation since the view change in 2012 VSR is completely deterministic and not random.

The talk I gave linked above goes into 2012 VSR and disk persistence in much more detail, taking into account our experience implementing VSR for TigerBeetle: https://www.tigerbeetle.com

You may also especially enjoy these two recent back-to-back interviews with Brian Oki and James Cowling, paying tribute to the pioneering protocol and especially the people behind it, with Barbara Liskov at the center connecting 1988 and 2012: https://youtu.be/_Jlikdtm4OA?t=708

The interviews are a fascinating look at the history of consensus, and the background around the design decisions that went into the protocol over the years. James Cowling also shares some details of his experience leading the Magic Pocket storage infrastructure team at Dropbox that moved Dropbox off AWS.

andreev_io · on Nov 4, 2021

Shameless plug: I implemented a lightweight Raft crate in Rust and am looking for contributors to implement log compaction, one of the few things that's missing. https://github.com/andreev-io/little-raft.

bborud · on Nov 4, 2021

Which Go implementation if Raft do people recommend? (I was led to believe that none of the widely adopted implementations were "smooth sailing" by a colleague who has spent a lot of time trying to make use of Raft).

jekude · on Nov 4, 2021

In general HashiCorp's repos are high quality:

https://github.com/hashicorp/raft

Example application: https://github.com/Jille/raft-grpc-example

butterisgood · on Nov 4, 2021

Can Raft scale to 100s of nodes? It doesn't seem likely, or at least, I'm not seeing people trying to.

I suppose one can have a subset of a cluster participating in Raft, but then the "raft group membership" itself is a consensus problem isn't it?

It's all turtles...

teraflop · on Nov 4, 2021

> I suppose one can have a subset of a cluster participating in Raft, but then the "raft group membership" itself is a consensus problem isn't it?

Yes, this is discussed briefly in the article and more thoroughly in the original Raft paper.

It's possible to do a "membership change" operation, whereby the replicas agree to change the set of nodes in the cluster. It's slightly trickier than achieving consensus on an ordinary state machine operation, because you have to ensure that quorums from both the old and new node lists agree to the change.

With this feature, you can make a "self-healing" cluster. Say you have a pool of 100 physical machines, with 5 of them running Raft nodes. If one node fails, and the failure appears to be transient, you can just wait for it to come back up. Otherwise, you can start a new node, wait for it to come up, and perform a membership change to swap it for the failed one.

This provides availability as long as you don't have a burst of more than N/2 failures faster than the cluster can recover.

butterisgood · on Nov 4, 2021

Thanks! This is a great, thought-provoking answer.

I've only worked with variants of paxos which were greatly simplified, and wondered how Raft actually simplified things here.

I feel like Raft, despite being easier to explain quickly, really doesn't seem much simpler than a basic explanation of Paxos. But perhaps that's because of what I've already worked with.

I've got an itch to explore Viewstamped Replication as well...

hawk_ · on Nov 4, 2021

since each node is a replica, 100s of replica nodes would be wasteful. for scaling say something like 100s of "shards" where each shard is a raft cluster of 3-5 nodes is more appropriate.

butterisgood · on Nov 4, 2021

It depends really... If you need an "offline readable copy" of something, then you need to be able to enter and exit the consensus group somehow. Perhaps that's replaceable though by a client/server relationship with the Raft cluster and keep things separate? (Probably, though one of the systems I've worked with didn't have that as an option - though it likely could have been done as such).

This was only really feasible because the value under consensus was very small (far less than 1MB).

In a system where the value(s) are much larger, I think what you've proposed as a system of Raft clusters hosting shards makes a lot of sense too.

Thanks for the considerate answer!

jordiburgos · on Nov 4, 2021

Any good *practical* tutorial on how to use it?

Preferably in Go.

tubby12345 · on Nov 4, 2021

Fun fact: in the extremely highly cited "In Search of an Understandable Consensus Algorithm (Extended Version)" there's a gap.

In the otherwise very good figure 2:

>If successful: update `nextIndex` and `matchIndex` for follower

but it is never stated how either `matchIndex` or `nextIndex` should be updated (on successful response). In section 5.3 it's only stated that "Eventually `nextIndex` will reach a point where the leader and follower logs match." I went to the TLA spec in Diego's github repo for his dissertation and found the answer to be

  HandleAppendEntriesResponse(i, j, m) ==
      /\ m.mterm = currentTerm[i]
      /\ \/ /\ m.msuccess \* successful
            /\ nextIndex'  = [nextIndex  EXCEPT ![i][j] = m.mmatchIndex + 1]
            /\ matchIndex' = [matchIndex EXCEPT ![i][j] = m.mmatchIndex]

but this is predicated on followers responding with `mmatchIndex`, which is not included in the response as stated in the "In Search of" paper.

I discovered this while TAing a class on consensus protocols, during which I built an implementation (in rust) and though I was confused about `matchIndex` as well, I just did what I thought was the natural thing, i.e. on receipt of a successful `AppendEntries` response I do

  *(self.leader_state.next_index.get_mut(node_id)) = self.get_last_log_index() + 1;
  *(self.leader_state.match_index.get_mut(node_id) = self.get_last_log_index();

I emailed Diego to find out if this was a reasonable solution but he told me he didn't remember the protocol clearly enough to advise. Nice to know everyone forgets things sometimes :)

valzam · on Nov 4, 2021

I took the MIT Distributed Systems course and also got stuck several times on how exactly something should be done. The Raft paper is good but in practice I found it quite difficult to reason about when certain things should be allowed to happen. I.e. when a "supposed" leader receives a ping from another leader in a newer epoch, how often do I check that? Should I just interrupt what I was doing? Only after I finished processing a command?

I am probably overthinking it and should just keep it simple and have the leader produce "wasted" entries until it realises that there is another accepted leader.

I tried implementing Raft without looking at any existing implementations, but that is probably not a good idea. It's very easy to get lost in the weeds.

jimsimmons · on Nov 4, 2021

The moment you see another leader with higher epoch you cecede and become a follower. This should be part of message recieve logic and has highest priority. Any actions you might take are moot anyway because higher epoch leader always the source of truth

valzam · on Nov 4, 2021

That is what I was thinking. But does that mean you just pepper your command processing logic with guards that you are still actually the leader?

I think one mistake I made is trying to run the different parts of the protocol in parallel, i.e. processing a message, accepting messages, sending heartbeats etc. It probably makes more sense for the raft part to basically be single threaded.

jimsimmons · on Nov 4, 2021

> That is what I was thinking. But does that mean you just pepper your command processing logic with guards that you are still actually the leader?

Yep. It's annoying to code it this way but you have to. One way to make this relatively clean is via throwing and catching exceptions if your language has good support for it.

Also agree re single-threadedness. In some ways raft assumes a single state machine loop running on a node at any point. Concurrency introduces additional non determinism to the state machine and isn't really accounted for in the formal spec.

I've heard great things about etcd source and how clean it is. I've never studied it but it might give you some pointers on best practices.

_vvhw · on Nov 4, 2021

If you can avoid multithreading in the design of your consensus protocol, this also means that your implementation becomes more deterministic.

If you can have the design fully deterministic by stubbing out the message passing, storage and time source (by representing time as ticks), then you can do deterministic simulation fuzzing like we do in TigerBeetle, which is a high velocity testing technique for finding/reproducing/fixing interesting (and rare) distributed systems bugs: https://github.com/coilhq/tigerbeetle#simulation-tests

tubby12345 · on Nov 5, 2021

My implementation is multithreaded and parallel (ie I have multiple cores). The way I handle this is to have the leader check if it's still the leader before it does anything. That flag (`iamtheleader`) is set and read behind a mutex. I guess there's a race condition (old leader initiates read right before rpc sets that leader has ceded) but I never hit it. I guess the right way to do it is that the rpc that handles ceding should kill other rpc threads but that seems fraught. I wonder if there's some kind of interrupt request handler type thing in threading models...

ditsing · on Nov 4, 2021

In my implementation, in each AppendEntries request, the leader includes X commits starting from nextIndex. The follower either accepts all X commits, or reject all of them. If the follower accepted all commits, the leader moves nextIndex to (X + nextIndex when the request was sent). See code here (matchIndex = nextIndex + X - 1): https://github.com/ditsing/ruaft/blob/master/src/sync_log_en....