MIT 6.824 Lec5&6. Fault Tolerance: Raft_随笔

MIT 6.824 Lec5&6. Fault Tolerance: Raft

文章目录

Split-Barin
- Fault-tolerant System
- What is Split-brain
- Majority Voting
Raft
- Leader Election
- Raft Log
- Persistence
- Log Compaction and Snapshots
Linearizability

Split-Barin Fault-tolerant System

目前我们已经学习了多个Fault-tolerant System：

MapReduce：会对compute tasks进行复制，但是依赖single master来实现该机制
GFS：GFS会对data进行复制，但是依赖single master来选择primary
VM-FT：VM-FT会对服务进行复制，但是依赖test-and-set single server

单个结点来进行重要决策的好处：可以防止split-brain现象。

What is Split-brain

假设我们将VM-FT中的test-and-set server进行备份，实例如下：

clients	     servers
c1 			 s1
----------------- network partition
c2           s2

假设由client c1,c2和server s1,s2，但是由于网络故障c1只能和s1通信，c2只能和s2通信，这样就会造成c1和c2都认为自己获得了test-and-set server上的primary lock，从而出现两个primary node，这就是所谓的split-brain现象。

Majority Voting

一种多servers间解决split brain的方法是majority voting。任何系统的决策都需要集群间多数servers的同意。Majority Voting能解决split-brain的重要原因是，在network partitioning中，majority partitioning有且仅有一个。

majority voting具有一个重要的性质，就是任何一个leader的决策log中肯定包含之前leader的log，因为每次投同意票的servers中肯定会有相同servers。

基于Majority Voting的系统主要有两类：Paxos和Raft。

Raft Leader Election

Leader保证了所有servers以相同的顺序执行相同的commands。 Raft用term number来标识leader，每个term内只能有一个leader。每个server都有一个random election timeout，定时器时间一到就开始新一轮选举。new leader选出后，通过heart break来抑制新的选举出现（比较term），servers收到heart break会重置election timeout。

选举失败可能由两个原因造成：

less than a majority of servers are reachable
simultaneous candidates split the vote, none gets majority

选举失败时会在election timeout后开始新的一轮选举。election timeout的选择通常需要考虑以下因素：

at least a few heartbeat intervals (in case network drops a heartbeat) to avoid needless elections, which waste time
random part long enough to let one candidate succeed before next starts
short enough to react quickly to failure, avoid long pauses

如果因为network partition导致old leader不知道new leader的存在，可能会导致日志的不一致性。但是由于old leader持有的term小于网络中的大多数servers，因此client发向old leader的请求都不会被处理，所以old leader并不会添加新的log entry。

Raft Log

当leader启动时，client只与leader进行交互，client看不见followers的状态和日志。为了保证一致性，如果任何一个server在index处执行了任何一个log entry，则其他server不会在相同的index处执行其它的log entry。

例子如下所示：

S1: put(k1,v1) | put(k1,v2) 
S2: put(k1,v1) | put(k2,x)

Raft不能允许S1或S2执行第二个命令，因为它们的log entry不一致。

在实际的分布式系统运行中，如果出现leader crash，则可能出现多个server上日志不一致的情况。

S1: 3
S2: 3 3
S3: 3 3
----- after crash -----
S1:  3
S2:  3  3  4
S3:  3  3  5

可以看见出现了相同的log index下不同log entry的情况。Raft通过roll back机制来解决不同servers的冲突日志，roll back机制如下所示：

each live follower deletes tail of log that differs from leader
then each live follower accepts leader’s entries after that point
now followers’ logs are identical to leader’s log

那么roll back会不会丢掉已经提交的log entry呢？Raft通过leader restriction来保证leader包含所有已经提交的日志，这样在roll back时就不会丢掉提交日志 Raft的leader restriction如下:

candidate has higher term in last log entry
candidate has same last term and same length or longer log

只有满足以上两个条件之一的candidate才会获得赞成票，从而有机会成为leader。

示例如下：

  example:
    S1: 5 6 7
    S2: 5 8
    S3: 5 8

S2和S3不会给S1投票，因此新的leader只能是S2或S3，符合Raft限制条件的leader包含了所有可能的提交log entry。

Raft的roll back机制中，为了找到日志的分歧点需要将leader和follower的log entry逐一比对，这十分消耗时间。Raft提出了一种快速进行分歧点查找的方法。下面列出了roll back时可能出现的所有情况。

      Case 1      Case 2       Case 3
  S1: 4 5 5       4 4 4        4
  S2: 4 6 6 6 or  4 6 6 6  or  4 6 6 6

 S2 is leader for term 6, S1 comes back to life, S2 sends AE for last 6, AE has prevLogTerm=6
 rejection from S1 includes:
    XTerm:  term in the conflicting entry (if any)
    XIndex: index of first entry with that term (if any)
    XLen:   log length

对于所有的情况，快速roll back的策略如下。

  Case 1 (leader doesn't have XTerm):
    nextIndex = XIndex
  Case 2 (leader has XTerm):
    nextIndex = leader's last entry for XTerm
  Case 3 (follower's log is too short):
    nextIndex = XLen

Persistence

在Raft中，需要持久化存储一些信息，以便于server crash重启后可以恢复之前的状态继续使用。Raft中持久化保存了一下信息:

log[]：如果server在crash前属于majority，保存log可以保证后续的leader一定包含commit entries。如果不及时的对log进行持久化存储，可能导致commit entries缺失。
currentTerm：保证term的正常增加，从而确保每个term内最多有一个leader，并检测stale data和stale leader。
votedFor：防止出现candidate投票之后crash，restart之后又重新进行投票的情况。

持久化存储通常会成为分布式系统的性能瓶颈，hard disk写入大概需要10ms，SSD大概需要0.1ms。加速持久化存储通常有一下集中方式：

batch多个log一次性写入
使用battery-backed RAM，而不是disk

Log Compaction and Snapshots

随着分布式系统的运行，日志可能会变得越来越大，而很久之前的数据或许并不需要，Raft会丢弃一些数据来压缩日志大小。Raft中以下日志不能丢弃：

un-executed entries – not yet reflected in the state
un-committed entries – might be part of leader’s majority

Raft中每个server定期的构建Snapshot来丢弃多余的日志，并将Snapshot持久化存储到disk。Snapshot中包含了存储的最后一个log index，Raft会丢掉所有该log index之前的日志，server可以在任何时间创建Snapshot。

当server crash并restart后，其执行流程如下：

service reads snapLshot from disk
Raft reads persisted log from disk
service tells Raft to set lastApplied to last included index to avoid re-applying already-applied log entries

如果follower的log落后于leader的start log，nextIndex会被置于start index，此时leader会发送InstallSnapshot RPCs给server。

Linearizability

由于分布式系统中需要并发的处理大量请求，如何判断请求的顺序和返回的结果是正确的？线性一致性是一种对分布式系统输出表现的定义，符合该标准的分布式系统多个client在进行读请求时，看到的结果都是相同的。

线性一致性要求可以找到一种处理operation的顺序，该顺序要求如下：

time constraint： matches real-time (for non-overlapping ops)
value constraint： each read sees the value from the write preceding it in the order.

Raft是一个可以保证线性一致性（强一致性）的公式算法。

下面用几个例子来说明如何判断分布式系统是否符合线性一致性。

example history 1:
  |-Wx1-| |-Wx2-|
    |---Rx2---|
      |-Rx1-|
"Wx1" means "write value 1 to record x"
"Rx1" means "a read of record x yielded value 1"

为了满足time constraint，Wx1必须在Wx2之前执行，为了满足value constraint，Wx1必须在Rx1之前执行，Wx2必须在Rx2之前执行，因此可能的执行序列如下：Wx1 Rx1 Wx2 Rx2。该执行结果符合线性一致性。

example history 2:
  |-Wx1-| |-Wx2-|
    |--Rx2--|
              |-Rx1-|
draw the constraint arrows:
  Wx1 before Wx2 (time)
  Wx2 before Rx2 (value)
  Rx2 before Rx1 (time)
  Rx1 before Wx2 (value)

按照约束，发现出现constraint cycle，无法排出线性执行序列，不符合线性一致性。

example history 3:
|--Wx0--|  |--Wx1--|
            |--Wx2--|
        |-Rx2-| |-Rx1-|
order: Wx0 Wx2 Rx2 Wx1 Rx1

符合线性一致性。

example history 4:
|--Wx0--|  |--Wx1--|
            |--Wx2--|
C1:     |-Rx2-| |-Rx1-|
C2:     |-Rx1-| |-Rx2-|
what are the constraints?
  Wx2 then C1:Rx2 (value)
  C1:Rx2 then Wx1 (value)
  Wx1 then C2:Rx1 (value)
  C2:Rx1 then Wx2 (value)

C1和C2的结果无法形成串行序列，不满足线性一致性。

example history 5:
suppose clients re-send requests if they don't get a reply
in case it was the response that was lost:
  leader remembers client requests it has already seen
  if sees duplicate, replies with saved response from first execution
but this may yield a saved value from long ago -- a stale value!
what does linearizabilty say?
C1: |-Wx3-|          |-Wx4-|
C2:          |-Rx3-------------|
order: Wx3 Rx3 Wx4

该例子要求我们在线性一致性下，需要对重复请求进行正确的处理。即使C2：Rx的 *** 作在Wx4后进行了重复请求，也应该返回第一次读到的x=3，不然就会出现一致性错误。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://www.outofmemory.cn/zaji/5690409.html

MIT 6.824 Lec5&6. Fault Tolerance: Raft

发表评论

评论列表（0条）