Проблема с НА кластером

0

0

У меня такая ситуация: НА Кластер - 2 ноды. машины различны ток в ядрах (1-но и 2-ву ядерные). Делал до этого кластер (пробную модель) на Vbox - все хорошо заработало. Начал переносить на физические машини - пошли траблы. Вот что я делал:

1. Подключение, синхронизация обеих нод - ок

2. Назначение 1-й ноды мастер - ок

3. Создание на обеих нодах файловых систем и точек монтирования - ок

4. Монтирование первой ноде файловую системы - ок

5. Тестирование репликации нод - ок

6. Запуск heartbeat на обеих нодах - ок

7. Запуск httpd на 1-й ноде - ок (но виртуальный хост не пингуется и не работает так как надо. в ifconfig новое оборудование eth1:0 не отображается)

При перезапуске heartbeat на одной из нод на некоторое время включается виртуальный хост и пинг идет.

Это 1 из глюков. 2-й: При перезагрузки главной ноды, вторичная на пару секунд принимает значение мастер и монтирует диск, но потом сново переходит в слейв. После загрузки первой ноды - они нормально друг друга находит и синхронизируются (все хорошо), НО(!) 1-я нода загружается в состоянии слейв. получается слейв/слейв и приходится по новой вручную 1-ю ноду настраивать.

3-й: иногда у первой ноды не изменяя статуса (мастер) отмонтируется файловая система (sdb диск).

ha.cf:

logfacility local0

keepalive 2

deadtime 30

initdead 120

bcast eth0

auto_failback on

node node1.company.ru

node node2.company.ru

respawn hacluster /usr/lib/heartbeat/ipfail

use_logd yes

logﬁle /var/log/ha.log

debugﬁle /var/log/ha-debug.log

haresources:

node1.company.ru IPaddr::192.168.146.4/24 drbddisk::r0 \

Filesystem::/dev/drbd0::/mnt/drbd0::ext3::defaults httpd

drbd.conf:

global { usage-count yes; }

common { syncer { rate 20M; } }

resource r0 {

protocol C;

startup {

}

disk {

on-io-error detach;

}

net {

}

on node1.company.ru {

device /dev/drbd0;

disk /dev/sdb;

address 192.168.146.150:7789;

meta-disk internal;

}

on node2.company.ru {

device /dev/drbd0;

disk /dev/sdb;

address 192.168.146.134:7789;

meta-disk internal;

}

Ссылка

← sendmail клиента не отрабатывает greylist таймаут

Утилизировать весь канал 1Gb/sec? - Построение SAN →

Народ... Что? Никто не сталкивался с такой проблемой? И не знает из-за чего это может быть?

daevaorn
(01.03.10 10:05:17 MSK) автор топика

Ссылка

А что в /var/log/ha.log на обеих нодах?

SlavikSS ★★
(01.03.10 10:14:51 MSK)

Ответ на: комментарий от SlavikSS 01.03.10 10:14:51 MSK

В 1-й 3 повторяющиеся строчки с разным временем

hb_standby[№процесса]: время Going standby [foreing]

На второй ноде всего дохрена:

hb_standby[7374]: 2010/02/19_22:35:21 Going standby [foreign].

heartbeat[6300]: 2010/02/27_16:07:29 info: Version 2 support: false

heartbeat[6300]: 2010/02/27_16:07:30 WARN: Logging daemon is disabled --enabling logging daemon is recommended

heartbeat[6300]: 2010/02/27_16:07:30 info: **************************

heartbeat[6300]: 2010/02/27_16:07:30 info: Configuration validated. Starting heartbeat 2.1.3

heartbeat[6301]: 2010/02/27_16:07:30 ERROR: Cannot chdir to [/var/lib/heartbeat/cores]: No such file or directory

heartbeat[6301]: 2010/02/27_16:07:30 info: heartbeat: version 2.1.3

heartbeat[6301]: 2010/02/27_16:07:30 info: Heartbeat generation: 1266597825

heartbeat[6301]: 2010/02/27_16:07:30 info: glib: UDP Broadcast

heartbeat started on port 694 (694) interface eth1

heartbeat[6301]: 2010/02/27_16:07:30 info: glib: UDP Broadcast

heartbeat closed on port 694 interface eth1 - Status: 1

heartbeat[6301]: 2010/02/27_16:07:30 info: G_main_add_TriggerHandler: Added signal manual handler

heartbeat[6301]: 2010/02/27_16:07:30 info: G_main_add_SignalHandler: Added signal handler for signal 17

heartbeat[6301]: 2010/02/27_16:07:30 info: Local status now set to: 'up'

heartbeat[6303]: 2010/02/27_16:07:31 ERROR: Cannot chdir to [/var/lib /heartbeat/cores]: No such file or directory

heartbeat[6304]: 2010/02/27_16:07:31 ERROR: Cannot chdir to [/var/lib/heartbeat/cores]: No such file or directory

daevaorn
(01.03.10 11:52:58 MSK) автор топика

Ответ на: комментарий от daevaorn 01.03.10 11:52:58 MSK

heartbeat[6305]: 2010/02/27_16:07:31 ERROR: Cannot chdir to [/var/lib/heartbeat/cores]: No such file or directory

heartbeat[6301]: 2010/02/27_16:07:31 info: Link node1.akme:eth1 up.

heartbeat[6301]: 2010/02/27_16:07:31 info: Status update for node node1.akme: status active

heartbeat[6301]: 2010/02/27_16:07:31 info: Link node2.akme:eth1 up.

harc[6307]: 2010/02/27_16:07:31 info: Running /etc/ha.d/rc.d/status status

heartbeat[6301]: 2010/02/27_16:07:31 info: Comm_now_up(): updating status to active

heartbeat[6301]: 2010/02/27_16:07:31 info: Local status now set to: 'active'

heartbeat[6301]: 2010/02/27_16:07:31 info: Starting child client «/usr/lib/heartbeat/ipfail» (501,501)

heartbeat[6324]: 2010/02/27_16:07:31 info: Starting «/usr/lib/heartbeat/ipfail» as uid 501 gid 501 (pid 6324)

heartbeat[6301]: 2010/02/27_16:07:32 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:07:32 info: Local Resource acquisition completed. (none)

heartbeat[6301]: 2010/02/27_16:07:32 info: node1.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:07:33 info: standby: acquire [foreign] resources from node1.akme

heartbeat[6327]: 2010/02/27_16:07:33 info: acquire local HA resources (standby).

ResourceManager[10175]: 2010/02/27_16:09:45 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd0/ ext3 defaults start

Filesystem[10706]: 2010/02/27_16:09:45 INFO: Running start for /dev/drbd0 on /mnt/drbd0

Filesystem[10695]: 2010/02/27_16:09:45 INFO: Success

ResourceManager[10175]: 2010/02/27_16:09:46 info: Running /etc/rc.d/init.d/httpd start

ResourceManager[10175]: 2010/02/27_16:09:46 ERROR: Return code 1 from /etc/rc.d/init.d/httpd

ResourceManager[10175]: 2010/02/27_16:09:46 CRIT: Giving up resources due to failure of httpd

ResourceManager[10175]: 2010/02/27_16:09:46 info: Releasing resource group: node1.akme IPaddr::192.168.1.4/24 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd0/::ext3::defaults httpd

heartbeat[6301]: 2010/02/27_16:09:46 WARN: node node1.akme: is dead

heartbeat[6301]: 2010/02/27_16:09:46 info: Dead node node1.akme gave up resources.

heartbeat[6301]: 2010/02/27_16:09:46 info: Link node1.akme:eth1 dead.

ipfail[6324]: 2010/02/27_16:09:46 info: Status update: Node node1.akme now has status dead

ResourceManager[10175]: 2010/02/27_16:09:46 info: Running /etc/rc.d/init.d/httpd stop

ResourceManager[10175]: 2010/02/27_16:09:46 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd0/ ext3 defaults stop

Filesystem[11240]: 2010/02/27_16:09:46 INFO: Running stop for /dev/drbd0 on /mnt/drbd0

Filesystem[11240]: 2010/02/27_16:09:47 INFO: Trying to unmount /mnt/drbd0

Filesystem[11240]: 2010/02/27_16:09:47 INFO: unmounted /mnt/drbd0 successfully

Filesystem[11229]: 2010/02/27_16:09:47 INFO: Success

ResourceManager[10175]: 2010/02/27_16:09:47 info: Running /etc/ha.d/resource.d/drbddisk r0 stop

ResourceManager[10175]: 2010/02/27_16:09:47 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.4/24 stop

IPaddr[11428]: 2010/02/27_16:09:47 INFO: ifconfig eth1:0 down

daevaorn
(01.03.10 11:53:36 MSK) автор топика

Ответ на: комментарий от daevaorn 01.03.10 11:53:36 MSK

IPaddr[11402]: 2010/02/27_16:09:47 INFO: Success

mach_down[10143]: 2010/02/27_16:09:47 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired

mach_down[10143]: 2010/02/27_16:09:47 info: mach_down takeover complete for node node1.akme.

heartbeat[6301]: 2010/02/27_16:09:47 info: mach_down takeover complete.

ipfail[6324]: 2010/02/27_16:09:47 info: NS: We are dead. :<

ipfail[6324]: 2010/02/27_16:09:47 info: Link Status update: Link node1.akme/eth1 now has status dead

ipfail[6324]: 2010/02/27_16:09:48 info: We are dead. :<

ipfail[6324]: 2010/02/27_16:09:48 info: Asking other side for ping node count.

hb_standby[11686]: 2010/02/27_16:10:17 Going standby [foreign].

heartbeat[6301]: 2010/02/27_16:10:17 info: node2.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:10:28 WARN: No reply to standby request. Standby request cancelled.

heartbeat[6301]: 2010/02/27_16:10:31 info: Heartbeat restart on node node1.akme

heartbeat[6301]: 2010/02/27_16:10:31 info: Link node1.akme:eth1 up.

heartbeat[6301]: 2010/02/27_16:10:31 info: Status update for node node1.akme: status init

heartbeat[6301]: 2010/02/27_16:10:31 info: Status update for node node1.akme: status up

ipfail[6324]: 2010/02/27_16:10:31 info: Link Status update: Link node1.akme/eth1 now has status up

ipfail[6324]: 2010/02/27_16:10:31 info: Status update: Node node1.akme now has status init

ipfail[6324]: 2010/02/27_16:10:31 info: Status update: Node node1.akme now has status up

harc[12195]: 2010/02/27_16:10:31 info: Running /etc/ha.d/rc.d/status status

harc[12211]: 2010/02/27_16:10:31 info: Running /etc/ha.d/rc.d/status status

daevaorn
(01.03.10 11:54:10 MSK) автор топика

Ответ на: комментарий от daevaorn 01.03.10 11:54:10 MSK

heartbeat[6301]: 2010/02/27_16:10:31 info: Status update for node node1.akme: status active

ipfail[6324]: 2010/02/27_16:10:31 info: Status update: Node node1.akme now has status active

harc[12227]: 2010/02/27_16:10:31 info: Running /etc/ha.d/rc.d/status status

heartbeat[6301]: 2010/02/27_16:10:32 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:10:32 info: node2.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:10:32 info: standby: node1.akme can take our foreign resources

heartbeat[12243]: 2010/02/27_16:10:32 info: give up foreign HA resources (standby).

ResourceManager[12256]: 2010/02/27_16:10:32 info: Releasing resource group: node1.akme IPaddr::192.168.1.4/24 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd0/::ext3::defaults httpd

ResourceManager[12256]: 2010/02/27_16:10:32 info: Running /etc/rc.d/init.d/httpd stop

ResourceManager[12256]: 2010/02/27_16:10:32 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd0/ ext3 defaults stop

Filesystem[12335]: 2010/02/27_16:10:33 INFO: Running stop for /dev/drbd0 on /mnt/drbd0

Filesystem[12324]: 2010/02/27_16:10:33 INFO: Success

ResourceManager[12256]: 2010/02/27_16:10:33 info: Running /etc/ha.d/resource.d/drbddisk r0 stop

ResourceManager[12256]: 2010/02/27_16:10:33 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.4/24 stop

IPaddr[12435]: 2010/02/27_16:10:33 INFO: Success

heartbeat[12243]: 2010/02/27_16:10:33 info: foreign HA resource release completed (standby).

heartbeat[6301]: 2010/02/27_16:10:33 info: Local standby process completed [foreign].

heartbeat[6301]: 2010/02/27_16:10:44 WARN: 1 lost packet(s) for [node1.akme] [29:31]

heartbeat[6301]: 2010/02/27_16:10:44 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:10:44 info: No pkts missing from node1.akme!

heartbeat[6301]: 2010/02/27_16:10:44 info: Other node completed standby takeover of foreign resources.

heartbeat[6301]: 2010/02/27_16:11:15 info: node1.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:11:15 info: standby: acquire [foreign] resources from node1.akme

heartbeat[12611]: 2010/02/27_16:11:15 info: acquire local HA resources (standby).

heartbeat[12611]: 2010/02/27_16:11:15 info: local HA resource acquisition completed (standby).

heartbeat[6301]: 2010/02/27_16:11:15 info: Standby resource acquisition done [foreign].

heartbeat[6301]: 2010/02/27_16:11:16 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:16:35 info: Heartbeat shutdown in progress. (6301)

heartbeat[13350]: 2010/02/27_16:16:35 info: Giving up all HA resources.

ResourceManager[13363]: 2010/02/27_16:16:35 info: Releasing resource group: node1.akme IPaddr::192.168.1.4/24 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd0/::ext3::defaults httpd

ResourceManager[13363]: 2010/02/27_16:16:35 info: Running /etc/rc.d/init.d/httpd stop

ResourceManager[13363]: 2010/02/27_16:16:35 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd0/ ext3 defaults stop

Filesystem[13442]: 2010/02/27_16:16:35 INFO: Running stop for /dev/drbd0 on /mnt/drbd0

Filesystem[13431]: 2010/02/27_16:16:35 INFO: Success

ResourceManager[13363]: 2010/02/27_16:16:35 info: Running /etc/ha.d/resource.d/drbddisk r0 stop

ResourceManager[13363]: 2010/02/27_16:16:35 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.4/24 stop

IPaddr[13542]: 2010/02/27_16:16:35 INFO: Success

heartbeat[13350]: 2010/02/27_16:16:35 info: All HA resources relinquished.

heartbeat[6301]: 2010/02/27_16:16:36 info: killing /usr/lib/heartbeat/ipfail process group 6324 with signal 15

heartbeat[6301]: 2010/02/27_16:16:38 info: killing HBREAD process 6305 with signal 15

heartbeat[6301]: 2010/02/27_16:16:38 info: killing HBFIFO process 6303 with signal 15

heartbeat[6301]: 2010/02/27_16:16:38 info: killing HBWRITE process 6304 with signal 15

heartbeat[6301]: 2010/02/27_16:16:38 info: Core process 6303 exited. 3 remaining

heartbeat[6301]: 2010/02/27_16:16:38 info: Core process 6304 exited. 2 remaining

heartbeat[6301]: 2010/02/27_16:16:38 info: Core process 6305 exited. 1 remaining

heartbeat[6301]: 2010/02/27_16:16:38 info: node2.akme Heartbeat shutdown complete.

ipfail[6324]: 2010/02/27_16:07:33 ERROR: Cannot chdir to [/var/lib/heartbeat/cores]: No such file or directory

heartbeat[6327]: 2010/02/27_16:07:33 info: local HA resource acquisition completed (standby).

heartbeat[6301]: 2010/02/27_16:07:33 info: Standby resource acquisition done [foreign].

heartbeat[6301]: 2010/02/27_16:07:33 info: Initial resource acquisition complete (auto_failback)

heartbeat[6301]: 2010/02/27_16:07:33 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:07:53 info: node1.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:07:54 info: standby: acquire [foreign] resources from node1.akme

heartbeat[6423]: 2010/02/27_16:07:54 info: acquire local HA resources (standby).

heartbeat[6423]: 2010/02/27_16:07:54 info: local HA resource acquisition completed (standby).

heartbeat[6301]: 2010/02/27_16:07:54 info: Standby resource acquisition done [foreign].

heartbeat[6301]: 2010/02/27_16:07:54 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:08:04 info: node1.akme wants to go standby [foreign]

heartbeat[6301]: 2010/02/27_16:08:04 info: standby: acquire [foreign] resources from node1.akme

heartbeat[6481]: 2010/02/27_16:08:04 info: acquire local HA resources (standby).

heartbeat[6481]: 2010/02/27_16:08:04 info: local HA resource acquisition completed (standby).

heartbeat[6301]: 2010/02/27_16:08:04 info: Standby resource acquisition done [foreign].

heartbeat[6301]: 2010/02/27_16:08:05 info: remote resource transition completed.

heartbeat[6301]: 2010/02/27_16:09:39 info: Received shutdown notice from 'node1.akme'.

heartbeat[6301]: 2010/02/27_16:09:39 info: Resources being acquired from node1.akme.

heartbeat[10101]: 2010/02/27_16:09:39 info: acquire local HA resources (standby).

heartbeat[10101]: 2010/02/27_16:09:39 info: local HA resource acquisition completed (standby).

heartbeat[6301]: 2010/02/27_16:09:39 info: Standby resource acquisition done [foreign].

heartbeat[10102]: 2010/02/27_16:09:39 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys node2.akme] to acquire.

harc[10127]: 2010/02/27_16:09:39 info: Running /etc/ha.d/rc.d/status status

mach_down[10143]: 2010/02/27_16:09:39 info: Taking over resource group IPaddr::192.168.1.4/24

ResourceManager[10175]: 2010/02/27_16:09:39 info: Acquiring resource group: node1.akme IPaddr::192.168.1.4/24 drbddisk::r0 Filesystem::/dev /drbd0::/mnt/drbd0/::ext3::defaults httpd

IPaddr[10234]: 2010/02/27_16:09:40 INFO: Resource is stopped

ResourceManager[10175]: 2010/02/27_16:09:40 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.4/24 start

IPaddr[10326]: 2010/02/27_16:09:40 INFO: Using calculated nic for 192.168.1.4: eth1

IPaddr[10326]: 2010/02/27_16:09:40 INFO: Using calculated netmask for 192.168.1.4: 255.255.255.0

IPaddr[10326]: 2010/02/27_16:09:40 INFO: eval ifconfig eth1:0 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255

IPaddr[10300]: 2010/02/27_16:09:40 INFO: Success

ResourceManager[10175]: 2010/02/27_16:09:40 info: Running /etc/ha.d/resource.d/drbddisk r0 start

Filesystem[10623]: 2010/02/27_16:09:45 INFO: Resource is stopped

daevaorn
(01.03.10 11:55:25 MSK) автор топика

Ответ на: комментарий от daevaorn 01.03.10 11:55:25 MSK

При этом настройки на обеих нодах абсолютно идентичны.

daevaorn
(01.03.10 11:57:05 MSK) автор топика

Ссылка

Ответ на: комментарий от daevaorn 01.03.10 11:53:36 MSK

1) heartbeat[6305]: 2010/02/27_16:07:31 ERROR: Cannot chdir to [/var/lib/heartbeat/cores]: No such file or directory
Что с этим каталогом?

2)
ResourceManager[10175]: 2010/02/27_16:09:46 info: Running /etc/rc.d/init.d/httpd start
ResourceManager[10175]: 2010/02/27_16:09:46 ERROR: Return code 1 from /etc/rc.d/init.d/httpd
ResourceManager[10175]: 2010/02/27_16:09:46 CRIT: Giving up resources due to failure of httpd

Нужно проверить все символические ссылки и настройки.

SlavikSS ★★
(01.03.10 12:38:26 MSK)

Ответ на: комментарий от SlavikSS 01.03.10 12:38:26 MSK

По первому: проверил - данных папок нету на обоих нодах. Так же проверил тестовую модель кластера - там тоже этих папок нет.

По второму: Проверил, ошибку нашел, исправил.

daevaorn
(01.03.10 12:49:43 MSK) автор топика

Ответ на: комментарий от daevaorn 01.03.10 12:49:43 MSK

Частично разобрался.. проблемка упростилась. Теперь у меня:

1. при работе обеих нод не работает апач (httpd) как было и раньше.

2. при перезагрузке 1-й ноды: 2-я спокойна принимает все процессы и спокойно все пингует. при загрузке 1-й ноды - спокойно отдает все права и переходит в слейв. Но первая нода почему то загружается в слейве и не переходит в мастер. Что малучается - слейв/слейв.

daevaorn
(01.03.10 14:58:27 MSK) автор топика

Ссылка

Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.

← sendmail клиента не отрабатывает greylist таймаут

Admin

Утилизировать весь канал 1Gb/sec? - Построение SAN →

Похожие темы