IPaddr2を設定したNIC停止時のF/O動作 (Linux-ha-jp) - Linux-HA Japan

山内さん

高舘です。
ご確認ありがとうございます。

別環境ではifdownさせた場合も正常にF/Oするとの事で、
IPaddr2以外のcrm設定か、OS等の環境設定に起因したものになりそうです。
貴重な情報ありがとうございました。

リソースエージェントのバージョンも追加します。
また、ifdown実行後からのha-logを添付致しました。
> [環境]
> CentOS 5.4 (2.6.18-308)
> pacemaker-1.0.12-1.el5
> heartbeat-3.0.5-1.1.el5
resource-agents-3.9.2-90.el5

crm全体は以下のようになっております。
hadoopマスタノードをDRBDを使用して冗長化する構成です。
----------------------------------------------------------------
primitive res_drbd0 ocf:linbit:drbd \
        params drbd_resource="nn00" drbdconf="/etc/drbd.conf" \
        op monitor interval="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive res_filesystem ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" fstype="ext3" directory="/drbd/ram" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive res_jobtracker ocf:hadoop:jobtracker \
        params hadoop_name="jobtracker" hadoop_home="/usr/lib/hadoop" \
        op monitor interval="20s" timeout="30s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s"
primitive res_namenode ocf:hadoop:namenode \
        params hadoop_name="namenode" hadoop_home="/usr/lib/hadoop" \
        op monitor interval="20s" timeout="30s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s"
primitive res_vip ocf:heartbeat:IPaddr2 \
        params ip="172.17.8.10" cidr_netmask="26" nic="eth0" \
        op monitor interval="10s" timeout="60s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s"
primitive res_vip_chk ocf:heartbeat:VIPcheck \
        params target_ip="172.17.8.10" count="1" wait="10" \
        op start interval="0" timeout="90s" start_delay="4s" \
        op stop interval="0" timeout="60s"
group rg_hadoop_nn00 res_vip_chk res_vip res_filesystem res_namenode
res_jobtracker
ms ms_drbd0 res_drbd0 \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation c_rg_hadoop_nn00_on_drbd0 inf: rg_hadoop_nn00 ms_drbd0:Master
order o_drbd_before_rg_hadoop_nn00 inf: ms_drbd0:promote
rg_hadoop_nn00:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.12-066152e" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1342056004"
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="1"ons" \
----------------------------------------------------------------

[crm_mon]
----------------------------------------------------------------
============
Last updated: Fri Jul 13 10:20:58 2012
Stack: Heartbeat
Current DC: node02 (f29d7a86-b8ed-4ac8-8852-3c601ed5330b) - partition with
quorum
Version: 1.0.12-066152e
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ node02 node01 ]

 Master/Slave Set: ms_drbd0
     Masters: [ node01 ]
     Slaves: [ node02 ]
 Resource Group: rg_hadoop_nn00
     res_vip_chk        (ocf::heartbeat:VIPcheck):      Started node01
     res_vip    (ocf::heartbeat:IPaddr2):       Stopped
     res_filesystem     (ocf::heartbeat:Filesystem):    Stopped
     res_namenode       (ocf::hadoop:namenode): Stopped
     res_jobtracker     (ocf::hadoop:jobtracker):       Stopped

Node Attributes:
* Node node02:
    + master-res_drbd0:1                : 10
    + node01-eth1                 : up
* Node node01:
    + master-res_drbd0:0                : 10000
    + node02-eth1                 : up

Failed actions:
    res_vip_monitor_10000 (node=node01, call=51, rc=6, status=complete):
not configured
----------------------------------------------------------------
res_vip_monitorの"not running"と"not configured"の違いがあるようです。
hadoopプロセス(res_namenode, res_jobtracker)は正常に停止されており、
drbdは停止処理まで到達していないので正常に動作している状態です。

その後、stop失敗による2重起動防止措置の不具合の可能性を疑って、
stopのon-failを"restart"等に変更してみましたが、挙動は変わりませんでした。
ha-logを見るとstopは成功していて、後続処理に進まないように見えますので
この設定との関連性はないようです。

以上です。
宜しくお願い致します。

2012/7/12 <renay****@ybb*****>

> 高舘さん
>
> こんにちは、山内です。
>
> 少しバージョンは、違いますが、RHEL6.2上のPacemaker1.0.11で、同じような構成で試してみました。
> ACT(rh62-test1)でリソース起動後、ifdownでeth０を落とすと正常にＦＯしました。
>
> ============
> Last updated: Thu Jul 12 23:33:55 2012
> Stack: Heartbeat
> Current DC: rh62-test2 (47dc4202-35d8-461b-b8d6-2af59eee98e5) - partition
> with quorum
> Version: 1.0.11-unknown
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ rh62-test1 rh62-test2 ]
>
>  Resource Group: rg_group
>      res_vip_chk        (ocf::heartbeat:VIPcheck):      Started rh62-test2
>      res_vip    (ocf::heartbeat:IPaddr2):       Started rh62-test2
>      res_filesystem     (ocf::heartbeat:Dummy): Started rh62-test2
>
> Migration summary:
> * Node rh62-test2:
> * Node rh62-test1:
>    res_vip: migration-threshold=1 fail-count=1
>
> Failed actions:
>     res_vip_monitor_10000 (node=rh62-test1, call=7, rc=7,
> status=complete): not running
>
>
>
> 差し支えなければ、resource-agentやReusableの利用されているバージョンや、その他のログ・利用されたcrmファイルの全体もご提示願えますでしょうか？
>
> ちなみに、私が利用した簡易のcrmファイルは以下になります。
>
> ### Cluster Option ###
> property no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         startup-fencing="false" \
>         stonith-timeout="710s" \
>         crmd-transition-delay="2s"
>
> ### Resource Defaults ###
> rsc_defaults resource-stickiness="INFINITY" \
>         migration-threshold="1"
>
> ### Primitive Configuration ###
> primitive res_filesystem ocf:heartbeat:Dummy \
>         op monitor interval="20s" timeout="40s" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s"
> primitive res_vip ocf:heartbeat:IPaddr2 \
>         params ip="192.168.40.77" cidr_netmask="24" nic="eth0" \
>         op monitor interval="10s" timeout="60s" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="120s"
> primitive res_vip_chk ocf:heartbeat:VIPcheck \
>         params target_ip="192.168.40.77" count="1" wait="10" \
>         op start interval="0" timeout="90s" start_delay="4s" \
>         op stop interval="0" timeout="60s"
> group rg_group res_vip_chk res_vip res_filesystem
>
>
> ### Resource Location ###
> location rsc_location-Dummy1-1 rg_group \
>         rule 200: #uname eq rh62-test1 \
>         rule 100: #uname eq rh62-test2
>
> 以上、宜しくお願いいたします。
>
>
>
> --- On Wed, 2012/7/11, yosuke takadate <taten****@gmail*****> wrote:
>
> >
> > お世話になっております。高舘と申します。
> >
> > pacemakerにてHA構成を組み、障害試験としてNICを停止した場合の
> > 挙動を確認しております。1点確認させて下さい。
> >
> >
> > [事象]
> > IPaddr2にてVIPを設定したNICを停止(ifdown)させた場合、正常にF/Oしませんでした。
> > crm_monでは、IPaddr2の停止処理が完了後、次のリソースを停止処理に入る前に
> > 停まっているように見えます。
> > また、ha-logを見ると、findifコマンドの実行に失敗(NIC停止中の為)しており、
> > その後の処理が継続しません。
> > ifdownではなく、VIPをipコマンドでdeleteした場合はfindifコマンドも正常に
> > 実行され、F/Oする事は確認しております。
> >
> >
> > [確認点]
> > 上記の事象はcrmとRAの設定上、正常な動作になりますでしょうか。
> > また、F/Oを正常に実行させる設定方法はありますでしょうか。
> >
> >
> > [crm_mon]
> >  Resource Group: rg_group
> >      res_vip_chk        (ocf::heartbeat:VIPcheck):      Started node01
> >      res_vip    (ocf::heartbeat:IPaddr2):       Stopped
> >      res_filesystem     (ocf::heartbeat:Filesystem):    Stopped
> >
> >
> > [ha-log]
> > Jul 11 18:15:44 node01 lrmd: [12523]: info: cancel_op: operation
> monitor[22] on res_vip for client 12526, its parameters:
> CRM_meta_interval=[10000] ip=[172.17.8.10] cidr_netmask=[26]
> CRM_meta_timeout=[60000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
> nic=[eth0]  cancelled
> > Jul 11 18:15:44 node01 crmd: [12526]: info: do_lrm_rsc_op: Performing
> key=2:22:0:6ea3b006-bc6c-47d0-92d2-dac7100137b0 op=res_vip_stop_0 )
> > Jul 11 18:15:44 node01 lrmd: [12523]: info: rsc:res_vip stop[43] (pid
> 17340)
> > Jul 11 18:15:44 node01 crmd: [12526]: info: process_lrm_event: LRM
> operation res_vip_monitor_10000 (call=22, status=1, cib-update=0,
> confirmed=true) Cancelled
> > Jul 11 18:15:44 node01 lrmd: [12523]: info: RA output:
> (res_vip:stop:stderr) eth0: unknown interface: No such device
> /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson  Usage:
> /usr/lib64/heartbeat/findif [-C] Options:     -C: Output netmask as the
> number of bits rather than as 4 octets. Environment variables:
> OCF_RESKEY_ip                ip address (mandatory!)
> OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast
> broadcast address for interface OCF_RESKEY_nic          interface to assign
> to
> > Jul 11 18:15:44 node01 IPaddr2(res_vip)[17340]: WARNING:
> [/usr/lib64/heartbeat/findif -C] failed
> > Jul 11 18:15:44 node01 lrmd: [12523]: info: operation stop[43] on
> res_vip for client 12526: pid 17340 exited with return code 0
> > Jul 11 18:15:44 node01 crmd: [12526]: info: process_lrm_event: LRM
> operation res_vip_stop_0 (call=43, rc=0, cib-update=67, confirmed=true) ok
> > (ここで停止する)
> >
> >
> > [crm]
> > primitive res_filesystem ocf:heartbeat:Filesystem \
> >         params device="/dev/drbd0" fstype="ext3" directory="/drbd/ram" \
> >         op monitor interval="20s" timeout="40s" \
> >         op start interval="0" timeout="60s" \
> >         op stop interval="0" timeout="60s"
> > primitive res_vip ocf:heartbeat:IPaddr2 \
> >         params ip="172.17.8.10" cidr_netmask="26" nic="eth0" \
> >         op monitor interval="10s" timeout="60s" \
> >         op start interval="0" timeout="60s" \
> >         op stop interval="0" timeout="120s"
> > primitive res_vip_chk ocf:heartbeat:VIPcheck \
> >         params target_ip="172.17.8.10" count="1" wait="10" \
> >         op start interval="0" timeout="90s" start_delay="4s" \
> >         op stop interval="0" timeout="60s"
> > group rg_group res_vip_chk res_vip res_filesystem
> >
> >
> > [環境]
> > CentOS 5.4 (2.6.18-308)
> > pacemaker-1.0.12-1.el5
> > heartbeat-3.0.5-1.1.el5
> >
> >
> >
> > 以上になります。
> > 何か情報がありましたら頂けると助かります。
> > 宜しくお願い致します。
> >
> >
>
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>
-------------- next part --------------
HTMLの添付ファイルを保管しました...
Download 
-------------- next part --------------
(ifdown実行)

Jul 13 10:12:40 node01 lrmd: [11309]: info: RA output: (res_vip:monitor:stderr) eth0: unknown interface: No such device  /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson  Usage: /usr/lib64/heartbeat/findif [-C] Options:     -C: Output netmask as the number of bits rather than as 4 octets. Environment variables: OCF_RESKEY_ip             ip address (mandatory!) OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast         broadcast address for interface OCF_RESKEY_nic          interface to assign to
Jul 13 10:12:40 node01 IPaddr2(res_vip)[18621]: ERROR: [/usr/lib64/heartbeat/findif -C] failed
Jul 13 10:12:40 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_monitor_10000 (call=51, rc=6, cib-update=73, confirmed=false) not configured
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_ha_callback: Update relayed from pdpnn02x
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_local_callback: Expanded fail-count-res_vip=value++ to 1
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res_vip (1)
Jul 13 10:12:41 node01 lrmd: [11309]: info: cancel_op: operation monitor[57] on res_jobtracker for client 11312, its parameters: CRM_meta_interval=[20000] hadoop_home=[/usr/lib/hadoop] CRM_meta_timeout=[30000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] hadoop_name=[jobtracker]  cancelled
Jul 13 10:12:41 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=46:111:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_jobtracker_stop_0 )
Jul 13 10:12:41 node01 lrmd: [11309]: info: rsc:res_jobtracker stop[58] (pid 18650)
Jul 13 10:12:41 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_jobtracker_monitor_20000 (call=57, status=1, cib-update=0, confirmed=true) Cancelled
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_perform_update: Sent update 50: fail-count-res_vip=1
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_ha_callback: Update relayed from pdpnn02x
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res_vip (1342141961)
Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_perform_update: Sent update 52: last-failure-res_vip=1342141961
Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) Stopping Hadoop jobtracker daemon (hadoop-jobtracker):
Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) stopping jobtracker
Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) [  OK  ]
Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stderr) cat: /var/run/hadoop-0.20/*jobtracker.pid: No such file or directory
Jul 13 10:12:41 node01 lrmd: [11309]: info: operation stop[58] on res_jobtracker for client 11312: pid 18650 exited with return code 0
Jul 13 10:12:41 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_jobtracker_stop_0 (call=58, rc=0, cib-update=74, confirmed=true) ok
Jul 13 10:12:43 node01 lrmd: [11309]: info: cancel_op: operation monitor[55] on res_namenode for client 11312, its parameters: CRM_meta_interval=[20000] hadoop_home=[/usr/lib/hadoop] CRM_meta_timeout=[30000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] hadoop_name=[namenode]  cancelled
Jul 13 10:12:43 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=44:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_namenode_stop_0 )
Jul 13 10:12:43 node01 lrmd: [11309]: info: rsc:res_namenode stop[59] (pid 18703)
Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_namenode_monitor_20000 (call=55, status=1, cib-update=0, confirmed=true) Cancelled
Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) Stopping Hadoop namenode daemon (hadoop-namenode):
Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) stopping namenode
Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) [  OK
Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) ]
Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stderr) cat: /var/run/hadoop-0.20/*namenode.pid: No such file or directory
Jul 13 10:12:43 node01 lrmd: [11309]: info: operation stop[59] on res_namenode for client 11312: pid 18703 exited with return code 0
Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_namenode_stop_0 (call=59, rc=0, cib-update=75, confirmed=true) ok
Jul 13 10:12:43 node01 lrmd: [11309]: info: cancel_op: operation monitor[53] on res_filesystem for client 11312, its parameters: CRM_meta_interval=[20000] directory=[/drbd/ram] fstype=[ext3] device=[/dev/drbd0] CRM_meta_timeout=[40000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Jul 13 10:12:43 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=43:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_filesystem_stop_0 )
Jul 13 10:12:43 node01 lrmd: [11309]: info: rsc:res_filesystem stop[60] (pid 18759)
Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_filesystem_monitor_20000 (call=53, status=1, cib-update=0, confirmed=true) Cancelled
Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: Running stop for /dev/drbd0 on /drbd/ram
Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: Trying to unmount /drbd/ram
Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: unmounted /drbd/ram successfully
Jul 13 10:12:43 node01 lrmd: [11309]: info: operation stop[60] on res_filesystem for client 11312: pid 18759 exited with return code 0
Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_filesystem_stop_0 (call=60, rc=0, cib-update=76, confirmed=true) ok
Jul 13 10:12:45 node01 lrmd: [11309]: info: cancel_op: operation monitor[51] on res_vip for client 11312, its parameters: CRM_meta_interval=[10000] ip=[172.17.8.10] cidr_netmask=[26] CRM_meta_timeout=[60000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] nic=[eth0]  cancelled
Jul 13 10:12:45 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=4:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_vip_stop_0 )
Jul 13 10:12:45 node01 lrmd: [11309]: info: rsc:res_vip stop[61] (pid 18836)
Jul 13 10:12:45 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_monitor_10000 (call=51, status=1, cib-update=0, confirmed=true) Cancelled
Jul 13 10:12:45 node01 lrmd: [11309]: info: RA output: (res_vip:stop:stderr) eth0: unknown interface: No such device  /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson  Usage: /usr/lib64/heartbeat/findif [-C] Options:     -C: Output netmask as the number of bits rather than as 4 octets. Environment variables: OCF_RESKEY_ip                ip address (mandatory!) OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast         broadcast address for interface OCF_RESKEY_nic          interface to assign to
Jul 13 10:12:45 node01 IPaddr2(res_vip)[18836]: WARNING: [/usr/lib64/heartbeat/findif -C] failed
Jul 13 10:12:45 node01 lrmd: [11309]: info: operation stop[61] on res_vip for client 11312: pid 18836 exited with return code 0
Jul 13 10:12:45 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_stop_0 (call=61, rc=0, cib-update=77, confirmed=true) ok

(ここから処理が進まない)

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] IPaddr2を設定したNIC停止時のF/O動作

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] IPaddr2を設定したNIC停止時のF/O動作

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools