[Linux-ha-jp] pm_diskdの起動でエラーが発生?

Back to archive index

和田 伸一朗 wada.****@jp*****
2012年 3月 12日 (月) 21:59:52 JST


こんにちは。
和田です。

いつもお世話になっております。

Active/Passive構成でpm_diskdを用いて監視を行っていますが、
エラーが発生したようで起動できませんでした。

ログを見たところ、(あまりわかってないのですが。。。)
起動が完了する前にmoniterが実行され、「not running」と判定されたように見えます。

また、ソースを見たところ、diskdのデーモンがforkしたあとにexitしているようでしたので、
diskdのスクリプトのstartでpidファイルの作成を待ち合わせをしないと、タイミングによっては
このようなすれ違いが発生しないかという点が気になるのですが、monitorはstart完了後、
interval経過時間を待たずに実行されるのでしょうか?

この状態となった場合、diskdが起動することなく、また、Failed Actionsにも
表示されませんでした。
この動作は正しい動作なのでしょうか?
また、メールの最後の記載しておりますが設定で見直す点がないかご教示願います。

pm_diskdのバージョンは1.0.1になります。
起動時のログを抜粋すると以下になります。

-------------------------------------------------------------------------

Mar 12 18:00:03 it13 diskd: [21118]: info: Invoked: /usr/lib64/heartbeat/diskd -D -p /var/run//diskd-diskd_set -a diskd_set -i 30 -N /dev/sda1
Mar 12 18:00:03 it13 crmd: [20769]: info: process_lrm_event: LRM operation prmDiskd:0_start_0 (call=16, rc=0, cib-update=44, confirmed=true) ok
Mar 12 18:00:03 it13 crmd: [20769]: info: match_graph_event: Action prmDiskd:0_start_0 (53) confirmed on it13 (rc=0)
Mar 12 18:00:03 it13 crmd: [20769]: info: te_rsc_command: Initiating action 54: monitor prmDiskd:0_monitor_10000 on it13 (local)
Mar 12 18:00:03 it13 crmd: [20769]: info: do_lrm_rsc_op: Performing key=54:1:0:334535ec-732d-47d4-ac94-98cc23fd5911 op=prmDiskd:0_monitor_10000 )
Mar 12 18:00:03 it13 lrmd: [20766]: info: rsc:prmDiskd:0:19: monitor
Mar 12 18:00:03 it13 crmd: [20769]: info: process_lrm_event: LRM operation prmDiskd:0_monitor_10000 (call=19, rc=7, cib-update=45, confirmed=false) not running
Mar 12 18:00:03 it13 crmd: [20769]: WARN: status_from_rc: Action 54 (prmDiskd:0_monitor_10000) on it13 failed (target: 0 vs. rc: 7): Error
Mar 12 18:00:03 it13 crmd: [20769]: WARN: update_failcount: Updating failcount for prmDiskd:0 on it13 after failed monitor: rc=7 (update=value++, time=1331542803)
Mar 12 18:00:03 it13 crmd: [20769]: info: abort_transition_graph: match_graph_event:291 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=prmDiskd:0_monitor_10000, magic=0:7;54:1:0:334535ec-732d-47d4-ac94-98cc23fd5911, cib=0.64.32) : Event failed
Mar 12 18:00:03 it13 crmd: [20769]: info: update_abort_priority: Abort priority upgraded from 0 to 1
Mar 12 18:00:03 it13 crmd: [20769]: info: update_abort_priority: Abort action done superceeded by restart
Mar 12 18:00:03 it13 attrd: [20767]: info: find_hash_entry: Creating hash entry for fail-count-prmDiskd:0
Mar 12 18:00:03 it13 crmd: [20769]: info: match_graph_event: Action prmDiskd:0_monitor_10000 (54) confirmed on it13 (rc=4)
Mar 12 18:00:03 it13 attrd: [20767]: info: attrd_local_callback: Expanded fail-count-prmDiskd:0=value++ to 1
Mar 12 18:00:03 it13 attrd: [20767]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prmDiskd:0 (1)
Mar 12 18:00:03 it13 diskd: [21144]: info: attrd_lazy_update: Connecting to cluster... 5 retries remaining
Mar 12 18:00:03 it13 diskd: [21144]: info: main: Starting diskd
Mar 12 18:00:03 it13 crmd: [20769]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=0, tag=nvpair, id=status-it13-fail-count-prmDiskd:0, magic=NA, cib=0.64.33) : Transient attribute: update
Mar 12 18:00:03 it13 crmd: [20769]: info: update_abort_priority: Abort priority upgraded from 1 to 1000000
Mar 12 18:00:03 it13 crmd: [20769]: info: update_abort_priority: 'Event failed' abort superceeded
Mar 12 18:00:03 it13 attrd: [20767]: info: attrd_perform_update: Sent update 20: last-failure-prmDiskd:0=1331542803
Mar 12 18:00:03 it13 attrd: [20767]: info: find_hash_entry: Creating hash entry for diskd_set
Mar 12 18:00:03 it13 attrd: [20767]: info: attrd_trigger_update: Sending flush op to all hosts for: diskd_set (normal)
Mar 12 18:00:03 it13 crmd: [20769]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=0, tag=nvpair, id=status-it13-last-failure-prmDiskd:0, magic=NA, cib=0.64.34) : Transient attribute: update
Mar 12 18:00:03 it13 attrd: [20767]: info: attrd_perform_update: Sent update 23: diskd_set=normal
Mar 12 18:00:03 it13 crmd: [20769]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=0, tag=nvpair, id=status-it13-diskd_set, magic=NA, cib=0.64.35) : Transient attribute: update
Mar 12 18:00:03 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:03 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:03 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:04 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:04 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:04 it13 lrmd: [20766]: info: RA output: (prmDrbd:0:start:stdout)
Mar 12 18:00:04 it13 crmd: [20769]: info: process_lrm_event: LRM operation prmDrbd:0_start_0 (call=18, rc=0, cib-update=46, confirmed=true) ok
Mar 12 18:00:04 it13 crmd: [20769]: info: match_graph_event: Action prmDrbd:0_start_0 (25) confirmed on it13 (rc=0)
 ・
 ・
 ・
(省略)
 ・
 ・
 ・
Mar 12 18:00:17 it13 crmd: [20769]: info: do_lrm_rsc_op: Performing key=111:2:0:334535ec-732d-47d4-ac94-98cc23fd5911 op=prmDrbd:0_notify_0 )
Mar 12 18:00:17 it13 lrmd: [20766]: info: rsc:prmDrbd:0:24: notify
Mar 12 18:00:17 it13 crmd: [20769]: info: te_rsc_command: Initiating action 113: notify prmDrbd:1_pre_notify_promote_0 on it14
Mar 12 18:00:17 it13 crmd: [20769]: info: te_rsc_command: Initiating action 73: stop prmDiskd:0_stop_0 on it13 (local)
Mar 12 18:00:17 it13 lrmd: [20766]: info: cancel_op: operation monitor[19] on ocf::diskd::prmDiskd:0 for client 20769, its parameters: CRM_meta_clone=[0] device=[/dev/sda1] name=[diskd_set] CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[2] CRM_meta_notify=[false] CRM_meta_globally_unique=[false] crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[60000]  cancelled
Mar 12 18:00:17 it13 crmd: [20769]: info: do_lrm_rsc_op: Performing key=73:2:0:334535ec-732d-47d4-ac94-98cc23fd5911 op=prmDiskd:0_stop_0 )
Mar 12 18:00:17 it13 lrmd: [20766]: info: rsc:prmDiskd:0:25: stop
Mar 12 18:00:17 it13 crmd: [20769]: info: process_lrm_event: LRM operation prmDiskd:0_monitor_10000 (call=19, status=1, cib-update=0, confirmed=true) Cancelled
Mar 12 18:00:17 it13 pengine: [20768]: info: process_pe_message: Transition 2: PEngine Input stored in: /var/lib/pengine/pe-input-13774.bz2
Mar 12 18:00:17 it13 diskd: [21144]: info: diskd_shutdown: Exiting
Mar 12 18:00:17 it13 diskd: [21144]: info: main: Exiting diskd

-------------------------------------------------------------------------

なお、構成は以前質問させて頂いたときとほぼ同様で、Corosync + Pacemaker + DRBDで
以下の構成となっています。

-------------------------------------------------------------------------

primitive drbd_db ocf:linbit:drbd \
         params drbd_resource="pgsql" \
         op start interval="0s" timeout="240s" on-fail="restart" \
         op monitor interval="11s" timeout="60s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" role="Master" \
         op stop interval="0s" timeout="100s" on-fail="fence"

primitive ip_db ocf:heartbeat:IPaddr2 \
         params ip="192.168.1.175" \
                 nic="eth1" \
                 cidr_netmask="24" \
         op start interval="0s" timeout="90s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="100s" on-fail="fence"

primitive prmPing ocf:pacemaker:ping \
         params \
                 name="ping_set" \
                 host_list="192.168.1.1 192.168.2.1" \
                 multiplier="100" \
                 dampen="0" \
         meta \
                 migration-threshold="3" \
                 failure-timeout="60s" \
         op start interval="0s" timeout="90s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="100s" on-fail="ignore"

primitive fs_db ocf:heartbeat:Filesystem \
         params device="/dev/drbd/by-res/pgsql" directory="/data" fstype="ext4" \
         op start interval="0s" timeout="60s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="60s" on-fail="fence"

primitive prmPg ocf:heartbeat:pgsql \
         params pgctl="/usr/bin/pg_ctl" \
         start_opt="-p 5432" \
         psql="/usr/bin/psql" \
         pgdata="/data/" \
         pgdba="postgres" \
         pgport="5432" \
         pgdb="postgres" \
         op start interval="0s" timeout="120s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="120s" on-fail="fence"

primitive apache ocf:heartbeat:apache \
         params configfile="/etc/httpd/conf/httpd.conf" \
         port="80" \
         op start interval="0s" timeout="40s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="60s" on-fail="fence"

primitive prmDiskd ocf:pacemaker:diskd \
         params name="diskd_set" \
         device="/dev/sda1" \
         op start interval="0s" timeout="60s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="restart" \
         op stop interval="0s" timeout="60s" on-fail="ignore"

primitive prmStonith1-1 stonith:external/stonith-helper \
	params \
		priority="1" \
		stonith-timeout="60s" \
		hostlist="it13" \
		dead_check_target="192.168.1.173" \
		run_standby_wait="no" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

primitive prmStonith1-2 stonith:external/ssh \
	params \
		priority="2" \
		stonith-timeout="60s" \
		hostlist="it13" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

primitive prmStonith1-3 stonith:meatware \
	params \
		priority="3" \
		stonith-timeout="600" \
		hostlist="it13" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

primitive prmStonith2-1 stonith:external/stonith-helper \
	params \
		priority="1" \
		stonith-timeout="60s" \
		hostlist="it14" \
		dead_check_target="192.168.1.174" \
		run_standby_wait="no" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

primitive prmStonith2-2 stonith:external/ssh \
	params \
		priority="2" \
		stonith-timeout="60s" \
		hostlist="it14" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

primitive prmStonith2-3 stonith:meatware \
	params \
		priority="3" \
		stonith-timeout="600" \
		hostlist="it14" \
	op start interval="0s" timeout="60s" \
	op monitor interval="3600s" timeout="60s" \
	op stop interval="0s" timeout="60s"

group group_all fs_db ip_db prmPg apache

group grpStonith1 \
	prmStonith1-1 \
	prmStonith1-2 \
	prmStonith1-3

group grpStonith2 \
	prmStonith2-1 \
	prmStonith2-2 \
	prmStonith2-3

ms ms_drbd_db drbd_db \
         meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"

clone clnPing prmPing \
         meta clone-max="2" clone-node-max="1"

clone clnDiskd prmDiskd \
         meta clone-max="2" clone-node-max="1"

location group_all-location group_all \
         rule 200: #uname eq it13 \
         rule 100: #uname eq it14 \
         rule -INFINITY: defined ping_set and ping_set lt 200 \
         rule -INFINITY: defined diskd_set and diskd_set eq SUCCESS

location master-location_db ms_drbd_db \
         rule 200: #uname eq it13 \
         rule 100: #uname eq it14 \
         rule role=master -INFINITY: defined ping_set and ping_set lt 200 \
         rule role=master -INFINITY: defined diskd_set and diskd_set eq SUCCESS \
         rule role=master -INFINITY: defined fail-count-fs_db \
         rule role=master -INFINITY: defined fail-count-ip_db \
         rule role=master -INFINITY: defined fail-count-prmPg \
         rule role=master -INFINITY: defined fail-count-apache

location rsc_location-grpStonith1-1 grpStonith1 \
	rule -INFINITY: #uname eq it13

location rsc_location-grpStonith2-1 grpStonith2 \
	rule -INFINITY: #uname eq it14

colocation db_on_drbd INFINITY: group_all ms_drbd_db:Master
colocation clnPing-colocation INFINITY: group_all clnPing
colocation clnDiskd-colocation INFINITY: group_all clnDiskd
order order_db_after_drbd INFINITY: ms_drbd_db:promote group_all:start
order order_clnPing_after_all 0: clnPing group_all symmetrical=false
order order_clnDiskd_after_all 0: clnDiskd group_all symmetrical=false

property no-quorum-policy="ignore" \
	stonith-enabled="true" \
         startup-fencing="false" \
         stonith-timeout="430s"

rsc_defaults resource-stickiness="INFINITY" \
         migration-threshold="1"

-------------------------------------------------------------------------

よろしくお願い致します。





Linux-ha-japan メーリングリストの案内
Back to archive index