Помогите разобраться с HA кластером

0

2

У меня есть двунодовый кластер на Oracle Linux 6.4 (виртаулки). При падении рабочей ноды он переносил сервисы на другую ноду и все работало. Но тут почему то кластер не сработал. Как ни читал я логи так и не понял причину.

ЛОГ первой ноды:

Jul 29 05:02:57 a-svfeOL corosync[4622]:   [TOTEM ] A processor failed, forming new configuration.
Jul 29 05:03:04 a-svfeOL corosync[4622]:   [QUORUM] Members[1]: 1
Jul 29 05:03:04 a-svfeOL corosync[4622]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:03:04 a-svfeOL corosync[4622]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:2 left:1)
Jul 29 05:03:04 a-svfeOL corosync[4622]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 29 05:03:04 a-svfeOL kernel: dlm: closing connection to node 2
Jul 29 05:03:15 a-svfeOL corosync[4622]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:03:15 a-svfeOL corosync[4622]:   [QUORUM] Members[2]: 1 2
Jul 29 05:03:15 a-svfeOL corosync[4622]:   [QUORUM] Members[2]: 1 2
Jul 29 05:03:15 a-svfeOL corosync[4622]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:1 left:0)
Jul 29 05:03:15 a-svfeOL corosync[4622]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 29 05:03:18 a-svfeOL corosync[4622]: cman killed by node 2 because we were killed by cman_tool or other application
Jul 29 05:03:18 a-svfeOL fenced[4683]: telling cman to remove nodeid 2 from cluster
Jul 29 05:03:22 a-svfeOL rgmanager[4856]: #67: Shutting down uncleanly
Jul 29 05:03:22 a-svfeOL gfs_controld[4753]: cluster is down, exiting
Jul 29 05:03:22 a-svfeOL gfs_controld[4753]: daemon cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL fenced[4683]: daemon cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL fenced[4683]: cluster is down, exiting
Jul 29 05:03:22 a-svfeOL fenced[4683]: daemon cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL fenced[4683]: cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL dlm_controld[4696]: cluster is down, exiting
Jul 29 05:03:22 a-svfeOL dlm_controld[4696]: daemon cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL dlm_controld[4696]: cpg_dispatch error 2
Jul 29 05:03:22 a-svfeOL rgmanager[5457]: [script] Executing /root/sv_run.sh status
Jul 29 05:03:25 a-svfeOL kernel: dlm: closing connection to node 2
Jul 29 05:03:30 a-svfeOL kernel: dlm: closing connection to node 1
Jul 29 05:03:42 a-svfeOL rgmanager[5517]: [script] Executing /root/sv_run.sh stop
Jul 29 05:04:48 a-svfeOL rgmanager[5618]: [ip] Removing IPv4 address 10.10.60.4/24 from eth4
Jul 29 05:05:35 a-svfeOL kernel: INFO: task rgmanager:5448 blocked for more than 120 seconds.
Jul 29 05:05:35 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 05:05:35 a-svfeOL kernel: rgmanager       D ffff8801b415e928     0  5448   4854 0x00000080
Jul 29 05:05:35 a-svfeOL kernel: ffff8801b1c4dc60 0000000000000086 ffffffff8105bbc7 ffff880100000000
Jul 29 05:05:35 a-svfeOL kernel: 0000000000012180 ffff8801b1c4dfd8 ffff8801b1c4c010 0000000000012180
Jul 29 05:05:35 a-svfeOL kernel: ffff8801b1c4dfd8 0000000000012180 ffffffff81791020 ffff8801b415e380
Jul 29 05:05:35 a-svfeOL kernel: Call Trace:
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8105bbc7>] ? find_busiest_group+0x237/0xae0
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffffa02d3f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff81066de3>] ? perf_event_task_sched_out+0x33/0xa0
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffffa02e1841>] device_user_lock+0x131/0x140 [dlm]
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffffa02e1b32>] device_write+0x2e2/0x4f0 [dlm]
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
Jul 29 05:05:35 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90

ЛОГ второй ноды:

Jul 29 05:02:56 b-svfeOL corosync[1456]:   [TOTEM ] A processor failed, forming new configuration.
Jul 29 05:02:58 b-svfeOL corosync[1456]:   [QUORUM] Members[1]: 2
Jul 29 05:02:58 b-svfeOL kernel: dlm: closing connection to node 1
Jul 29 05:02:58 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:02:58 b-svfeOL corosync[1456]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.16) ; members(old:2 left:1)
Jul 29 05:02:58 b-svfeOL corosync[1456]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 29 05:02:58 b-svfeOL rsyslogd-2177: imuxsock begins to drop messages from pid 1456 due to rate-limiting
Jul 29 05:02:58 b-svfeOL fenced[1534]: fencing node a-svfeOL
Jul 29 05:02:58 b-svfeOL fenced[1534]: fence a-svfeOL dev 0.0 agent none result: error no method
Jul 29 05:02:58 b-svfeOL fenced[1534]: fence a-svfeOL failed
Jul 29 05:03:04 b-svfeOL rsyslogd-2177: imuxsock lost 15 messages from pid 1456 due to rate-limiting
Jul 29 05:03:04 b-svfeOL fenced[1534]: fencing node a-svfeOL
Jul 29 05:03:04 b-svfeOL fenced[1534]: fence a-svfeOL dev 0.0 agent none result: error no method
Jul 29 05:03:04 b-svfeOL fenced[1534]: fence a-svfeOL failed
Jul 29 05:03:04 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:03:04 b-svfeOL corosync[1456]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.16) ; members(old:1 left:0)
Jul 29 05:03:04 b-svfeOL corosync[1456]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 29 05:03:10 b-svfeOL fenced[1534]: fencing node a-svfeOL
Jul 29 05:03:10 b-svfeOL fenced[1534]: fence a-svfeOL dev 0.0 agent none result: error no method
Jul 29 05:03:10 b-svfeOL fenced[1534]: fence a-svfeOL failed
Jul 29 05:03:16 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:03:16 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
Jul 29 05:03:16 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
Jul 29 05:03:16 b-svfeOL rsyslogd-2177: imuxsock begins to drop messages from pid 1456 due to rate-limiting
Jul 29 05:03:19 b-svfeOL fenced[1534]: telling cman to remove nodeid 1 from cluster
Jul 29 05:03:22 b-svfeOL rsyslogd-2177: imuxsock lost 170 messages from pid 1456 due to rate-limiting
Jul 29 05:03:39 b-svfeOL corosync[1456]:   [TOTEM ] A processor failed, forming new configuration.
Jul 29 05:03:41 b-svfeOL kernel: dlm: closing connection to node 1
Jul 29 05:03:41 b-svfeOL corosync[1456]:   [QUORUM] Members[1]: 2
Jul 29 05:03:41 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 29 05:03:41 b-svfeOL corosync[1456]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.16) ; members(old:2 left:1)
Jul 29 05:03:41 b-svfeOL rsyslogd-2177: imuxsock begins to drop messages from pid 1456 due to rate-limiting
Jul 29 05:03:42 b-svfeOL rsyslogd-2177: imuxsock lost 17 messages from pid 1456 due to rate-limiting
Jul 29 05:06:29 b-svfeOL kernel: INFO: task rgmanager:14569 blocked for more than 120 seconds.
Jul 29 05:06:29 b-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 05:06:29 b-svfeOL kernel: rgmanager       D ffff8801b31906e8     0 14569   1994 0x00000080
Jul 29 05:06:29 b-svfeOL kernel: ffff8801b15c7c60 0000000000000082 ffffffff8105bbc7 ffff880100000000
Jul 29 05:06:29 b-svfeOL kernel: 0000000000012180 ffff8801b15c7fd8 ffff8801b15c6010 0000000000012180
Jul 29 05:06:29 b-svfeOL kernel: ffff8801b15c7fd8 0000000000012180 ffffffff81791020 ffff8801b3190140
Jul 29 05:06:29 b-svfeOL kernel: Call Trace:
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff8105bbc7>] ? find_busiest_group+0x237/0xae0
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
Jul 29 05:06:29 b-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30

Да, фенси агент не установлен (был случай когда он уходил в цикл и ребутал машину до посинения). Я так понял что что-то произошло и вторая нода удалила первую из кластера но что-то и здесь не получилось. В итоге придя на работу я обнаружил что сервис не перевелся при этом clustat на второй ноде показывал что первая offline и сервис на ней. А на первой ноде clustat не работал так как CMAN был выключен и не включался, помогла птолько перезагрузка.

Ссылка

Для разраливания ситуация со сплит-брейном в двухузловом кластере нужен еще quorum disk.

anonymous
(08.09.14 16:20:19 MSK)

INFO: task rgmanager:14569 blocked for more than 120 seconds.

у меня один сервер с такой ерундой перестает в консоль пускать логиниться. Что за процесс, кстати?

imuxsock begins to drop messages from pid 1456 due to rate-limiting

Это что за пид был?

pianolender ★★★
(09.09.14 08:46:11 MSK)

Похожие темы