keepalived几个配置问题讲解

配置

使用keepalived做主备,其中一台设置为master,一台设置为backup。当master出现异常后,backup自动切换为master。当backup成为master后,master恢复正常后会再次抢占成为master,导致不必要的主备切换。因此可以将两台keepalived初始状态均配置为backup,设置不同的优先级,优先级高的设置nopreempt解决异常恢复后再次抢占的问题。

有如下配置表示意思也比较简单,VIP为192.168.0.18,2台机器的初始state都是BACKUP,machineA的优先级是15,machineB的优先级是13,配置了/root/1.sh这个来检测服务是否正常。

machineA机器配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@iloqg8n3yb9mje ~]# cat /etc/keepalived/keepalived.conf 
! Configuration File for keepalived
global_defs {
router_id iloqg8n3yb9mje
script_user root
enable_script_security
}

vrrp_script check_mysql {
script "/root/1.sh"
interval 10
timeout 5
weight 5
fall 3
}
vrrp_instance VI_1 {
state BACKUP
nopreempt
interface eth0
virtual_router_id 18
priority 15
advert_int 1 #检查间隔,默认1秒 VRRP心跳包的发送周期,单位为s 组播信息发送间隔,两个节点设置必须一样
authentication {
auth_type PASS
auth_pass 1002
}
track_script {
check_mysql
}
virtual_ipaddress {
192.168.0.18 dev eth0 label eth0:0
}
notify_master "/root/2.sh master"
notify_backup "/root/2.sh backup"
notify_fault "/root/2.sh fault"
notify "/root/2.sh notify..."
}

machineB配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
global_defs {
router_id 4n1eq6wnfvdwvj
script_user root
enable_script_security
}

vrrp_script check_mysql {
script "/root/1.sh"
interval 10
timeout 5
weight 5
fall 3
}
vrrp_instance VI_1 {
state BACKUP
nopreempt
interface eth0
virtual_router_id 18
priority 13
advert_int 1
authentication {
auth_type PASS
auth_pass 1002
}
track_script {
check_mysql
}
virtual_ipaddress {
192.168.0.18 dev eth0 label eth0:0
}
notify_master "/root/2.sh master"
notify_backup "/root/2.sh backup"
notify_fault "/root/2.sh fault"
notify "/root/3.sh"
}

/root/1.sh配置如下:这个脚本用来检测服务是否正常,这个为了测试,设置当 /etc/keepalived/down这个文件存在返回值为0,反之为1

1
2
#!/bin/bash
/bin/test -f /etc/keepalived/down && exit 0 || exit 1

vrrp_script

配置

vrrp_script是指通过脚本来检测服务是否正常,通过 man keepalived.conf 查看其参数的意思。

1
2
3
4
5
6
7
8
9
10
11
vrrp_script <SCRIPT_NAME> {
script <STRING>|<QUOTED-STRING> # path of the script to execute,需要运行的脚本,返回值为0表示正常,
interval <INTEGER> # seconds between script invocations, default 1 second ,脚本运行时间,即隔多少秒去检测
timeout <INTEGER> # seconds after which script is considered to have failed,脚本运行的超时时间。
weight <INTEGER:-254..254> # adjust priority by this weight, default 0
rise <INTEGER> # required number of successes for OK transition,配置几次检测成功才认为服务正常
fall <INTEGER> # required number of successes for KO transition,配置几次检测失败才认为服务异常
user USERNAME [GROUPNAME] # user/group names to run script under
# group default to group of user
init_fail # assume script initially is in failed state,配置初始时失败状态
}

以上文的配置:

1
2
3
4
5
6
7
vrrp_script check_mysql {
script "/root/1.sh"
interval 10
timeout 5
weight 5
fall 3
}

我们把/etc/keepalived/down目录删除之后,machineA,17:45:06有第一次检测异常,后面再过了20秒之后,直接提示了failed,同时优先级从20减为了15。说明需要达到fall的次数之后才会切优先级。以下是从message日志里面看到的:

1
2
3
Dec 24 17:45:06 iloqg8n3yb9mje Keepalived_vrrp[109141]: Script `check_mysql` now returning 1
Dec 24 17:45:26 iloqg8n3yb9mje Keepalived_vrrp[109141]: VRRP_Script(check_mysql) failed (exited with status 1)
Dec 24 17:45:26 iloqg8n3yb9mje Keepalived_vrrp[109141]: (VI_1) Changing effective priority from 20 to 15

machineB,17:44:45检测到正常之后,就直接调整优先级了,说明rise的默认值为1。

1
2
3
4
5
Dec 24 17:44:15 4n1eq6wnfvdwvj Keepalived_vrrp[51077]: /root/1.sh exited with status 1
Dec 24 17:44:25 4n1eq6wnfvdwvj Keepalived_vrrp[51077]: /root/1.sh exited with status 1
Dec 24 17:44:35 4n1eq6wnfvdwvj Keepalived_vrrp[51077]: /root/1.sh exited with status 1
Dec 24 17:44:45 4n1eq6wnfvdwvj Keepalived_vrrp[51077]: VRRP_Script(check_mysql) succeeded
Dec 24 17:44:46 4n1eq6wnfvdwvj Keepalived_vrrp[51077]: VRRP_Instance(VI_1) Changing effective priority from 13 to 18

日志显示优先级有做了切换,但是其他事情都没有做,VIP未没有正常切换。这是为什么呢?

原因分析

参考 keepalived之vrrp_script详解 的说法:

vrrp_script 里的script返回值为0时认为检测成功,其它值都会当成检测失败;

  1. weight 为正时脚本检测成功时此weight会加到priority上,检测失败时不加;
    1. 主失败:
      1. 主 priority < 从 priority + weight 时会切换。
    2. 主成功:
      1. 主 priority + weight > 从 priority + weight 时,主依然为主
  2. weight 为负时,脚本检测成功时此weight不影响priority,检测失败时priority – abs(weight)
    1. 主失败:
      1. 主 priority – abs(weight) < 从priority 时会切换主从
    2. 主成功:
      1. 主 priority > 从priority 主依然为主

实测并不是这个结论,比较怀疑是版号不一致导致出现的结论不一样,但不管怎么说,VIP并未发生切换,所以跟想像中的不一样。

突发奇想,如果在vrrp_script不配置weight值,会怎么样呢?以下都是在machineA上面显示的日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 当脚本check_mysql检测失败的时候,VI_1这个实例就进入了FAULT状态
Dec 27 10:42:08 iloqg8n3yb9mje Keepalived_vrrp[120464]: Script `check_mysql` now returning 1
Dec 27 10:42:28 iloqg8n3yb9mje Keepalived_vrrp[120464]: VRRP_Script(check_mysql) failed (exited with status 1)
Dec 27 10:42:28 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Entering FAULT STATE

# 当脚本check_mysql恢复正常时,由于配置了nopreempt,VI_1这个实例就进入了BACKUP状态,注意machineA的优先级更高
Dec 27 10:47:28 iloqg8n3yb9mje Keepalived_vrrp[120464]: Script `check_mysql` now returning 0
Dec 27 10:47:28 iloqg8n3yb9mje Keepalived_vrrp[120464]: VRRP_Script(check_mysql) succeeded
Dec 27 10:47:28 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Entering BACKUP STATE

# machineB失败时,machineA就主动进入了MASTER状态
Dec 27 10:48:40 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Backup received priority 0 advertisement
Dec 27 10:48:41 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Receive advertisement timeout
Dec 27 10:48:41 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Entering MASTER STATE
Dec 27 10:48:41 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) setting VIPs.
Dec 27 10:48:41 iloqg8n3yb9mje Keepalived_vrrp[120464]: Sending gratuitous ARP on eth0 for 192.168.0.18
Dec 27 10:48:41 iloqg8n3yb9mje Keepalived_vrrp[120464]: (VI_1) Sending/queueing gratuitous ARPs on eth0 for 192.168.0.18

由此,可以说明 vrrp_script可以不配置weight值,并且也不需要配置这个值,以避免意外情况发生。

另外,如果有遇到如下报错:

1
2
Dec 24 17:41:50 iloqg8n3yb9mje Keepalived_vrrp[108697]: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Dec 24 17:41:50 iloqg8n3yb9mje Keepalived_vrrp[108697]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.

应该不会影响,但是可以在global配置项里面加上之后就不会有这个提示了。

1
2
script_user root
enable_script_security

那么直接在vrrp_script下面写成 script "test -f /etc/keepalived/down && exit 0 || exit 1"是否可以呢?经测试是有问题的。

notify

notify的用法:

  • notify_master:当当前节点成为master时,通知脚本执行任务(一般用于启动某服务,比如nginx,haproxy等
  • notify_backup:当当前节点成为backup时,通知脚本执行任务(一般用于关闭某服务,比如nginx,haproxy等)
  • notify_fault:当当前节点出现故障,执行的任务;
  • notify表示只要状态切换都会调用的脚本,并且该脚本是在以上三个脚本执行之后再调用的

根据文档所写,notify会自动传以下参数:

1
2
3
4
$1 = "GROUP"|"INSTANCE"
$2 = name of the group or instance
$3 = target state of transition ("MASTER"|"BACKUP"|"FAULT")
$4 = priority value

所以要使用notify时,不需要接参数,跟其他的三个是有所区别的。

1
2
3
4
notify_master "/root/2.sh master"
notify_backup "/root/2.sh backup"
notify_fault "/root/2.sh fault"
notify "/root/3.sh"

脚本内容很简单,只是打印日志出来而出,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@4n1eq6wnfvdwvj ~]# cat 2.sh 
#!/bin/bash

echo "`date +"%F %T"` $1" >>/tmp/fdm.txt

[root@4n1eq6wnfvdwvj ~]# cat 3.sh
#!/bin/bash

TYPE=$1
NAME=$2
STATE=$3
case $STATE in
"MASTER") echo "`date +"%F %T"` notify $1 $2 MASTER..." >>/tmp/fdm.txt
;;
"BACKUP") echo "`date +"%F %T"` notify $1 $2 BACKUP..." >>/tmp/fdm.txt
;;
"FAULT") echo "`date +"%F %T"` notify $1 $2 FAULT..." >>/tmp/fdm.txt
exit 0
;;
*) echo "`date +"%F %T"` NO TYPE:$1 $2" >>/tmp/fdm.txt
exit 1
;;
esac

输出的日志如下:

1
2
3
4
2020-12-27 22:31:12 backup
2020-12-27 22:31:12 notify INSTANCE VI_1 BACKUP...
2020-12-27 22:31:12 fault
2020-12-27 22:31:12 notify INSTANCE VI_1 FAULT...

可以看到,notify的通知在notify_backup的后面。

脑裂问题

上文所述的都是业务服务异常了,导致的切换。那主备2台机器不通的情况下,keepalived会做什么操作呢?

VRRP控制报文只有一种:VRRP通告(advertisement),使用通过advert_int 1这个参数来发送通告包的时延,默认是1秒发一次通告包。使用IP多播数据包进行封装,组地址为224.0.0.18,发布范围只限于同一局域网内。这保证了VRID在不同网络中可以重复使用。为了减少网络带宽消耗只有主控路由器才可以周期性的发送VRRP通告报文。备份路由器在连续三个通告间隔内收不到VRRP或收到优先级为0的通告后启动新的一轮VRRP选举。

一般情况下,只有主服务器会发VRRP的通告。

1
2
3
4
5
6
7
8
[root@4n1eq6wnfvdwvj ~]# tcpdump -i any -nns0 vrrp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
23:00:22.574251 IP 192.168.0.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 18, prio 15, authtype simple, intvl 1s, length 20
23:00:23.574399 IP 192.168.0.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 18, prio 15, authtype simple, intvl 1s, length 20
23:00:24.574420 IP 192.168.0.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 18, prio 15, authtype simple, intvl 1s, length 20
23:00:25.574504 IP 192.168.0.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 18, prio 15, authtype simple, intvl 1s, length 20
23:00:26.574580 IP 192.168.0.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 18, prio 15, authtype simple, intvl 1s, length 20

如果在主服务器上设置iptables规则,date +"%F %T";iptables -I OUTPUT -p vrrp -j DROP将vrrp协议发出的包禁掉,命令运行的时间为 2020-12-27 22:53:50,那么观察下备服务器的进入MASTER的时间:

1
2
3
Dec 27 22:53:54 iloqg8n3yb9mje Keepalived_vrrp[123054]: (VI_1) Receive advertisement timeout
Dec 27 22:53:54 iloqg8n3yb9mje Keepalived_vrrp[123054]: (VI_1) Entering MASTER STATE
Dec 27 22:53:54 iloqg8n3yb9mje Keepalived_vrrp[123054]: (VI_1) setting VIPs.

从上可以看出,vrrp的通告包超时了,节点进入了MASTER状态,那VIP生效的时间会延迟一秒:

1
2
3
4
5
6
7
8
[root@iloqg8n3yb9mje ~]# for i in `seq 1 100`;do ip -4 -o addr |grep 192.168.0.18 -q && echo "`date +"%F %T"` have 192.168.0.18" || echo `date +"%F %T"` no~~~;sleep 1;done
2020-12-27 22:53:50 no~~~
2020-12-27 22:53:51 no~~~
2020-12-27 22:53:52 no~~~
2020-12-27 22:53:53 no~~~
2020-12-27 22:53:54 no~~~
2020-12-27 22:53:55 have 192.168.0.18
2020-12-27 22:53:56 have 192.168.0.18

所以一般脑裂问题的排查思路有:

  • virtual_router_id必须一样
  • 防火墙将vrrp广播包给过滤掉了
  • 机器负载异常,导致机器无法正常发送、或者收到vrrp包之后没有足够的时间进行CPU的处理,这样建议可以尝试增加advert_int时间
  • 网卡异常等

参考资料

keepalived实战

Keepalived基础知识-运维小结

  • 本文作者: wumingx
  • 本文链接: https://www.wumingx.com/linux/keepalived.html
  • 本文主题: keepalived几个配置问题讲解
  • 版权声明: 本站所有文章除特别声明外,转载请注明出处!如有侵权,请联系我删除。
0%