yosuke takadate
taten****@gmail*****
2012年 7月 13日 (金) 11:05:49 JST
山内さん 高舘です。 ご確認ありがとうございます。 別環境ではifdownさせた場合も正常にF/Oするとの事で、 IPaddr2以外のcrm設定か、OS等の環境設定に起因したものになりそうです。 貴重な情報ありがとうございました。 リソースエージェントのバージョンも追加します。 また、ifdown実行後からのha-logを添付致しました。 > [環境] > CentOS 5.4 (2.6.18-308) > pacemaker-1.0.12-1.el5 > heartbeat-3.0.5-1.1.el5 resource-agents-3.9.2-90.el5 crm全体は以下のようになっております。 hadoopマスタノードをDRBDを使用して冗長化する構成です。 ---------------------------------------------------------------- primitive res_drbd0 ocf:linbit:drbd \ params drbd_resource="nn00" drbdconf="/etc/drbd.conf" \ op monitor interval="20s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" primitive res_filesystem ocf:heartbeat:Filesystem \ params device="/dev/drbd0" fstype="ext3" directory="/drbd/ram" \ op monitor interval="20s" timeout="40s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive res_jobtracker ocf:hadoop:jobtracker \ params hadoop_name="jobtracker" hadoop_home="/usr/lib/hadoop" \ op monitor interval="20s" timeout="30s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="120s" primitive res_namenode ocf:hadoop:namenode \ params hadoop_name="namenode" hadoop_home="/usr/lib/hadoop" \ op monitor interval="20s" timeout="30s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="120s" primitive res_vip ocf:heartbeat:IPaddr2 \ params ip="172.17.8.10" cidr_netmask="26" nic="eth0" \ op monitor interval="10s" timeout="60s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="120s" primitive res_vip_chk ocf:heartbeat:VIPcheck \ params target_ip="172.17.8.10" count="1" wait="10" \ op start interval="0" timeout="90s" start_delay="4s" \ op stop interval="0" timeout="60s" group rg_hadoop_nn00 res_vip_chk res_vip res_filesystem res_namenode res_jobtracker ms ms_drbd0 res_drbd0 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" colocation c_rg_hadoop_nn00_on_drbd0 inf: rg_hadoop_nn00 ms_drbd0:Master order o_drbd_before_rg_hadoop_nn00 inf: ms_drbd0:promote rg_hadoop_nn00:start property $id="cib-bootstrap-options" \ dc-version="1.0.12-066152e" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1342056004" rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" \ migration-threshold="1"ons" \ ---------------------------------------------------------------- [crm_mon] ---------------------------------------------------------------- ============ Last updated: Fri Jul 13 10:20:58 2012 Stack: Heartbeat Current DC: node02 (f29d7a86-b8ed-4ac8-8852-3c601ed5330b) - partition with quorum Version: 1.0.12-066152e 2 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ node02 node01 ] Master/Slave Set: ms_drbd0 Masters: [ node01 ] Slaves: [ node02 ] Resource Group: rg_hadoop_nn00 res_vip_chk (ocf::heartbeat:VIPcheck): Started node01 res_vip (ocf::heartbeat:IPaddr2): Stopped res_filesystem (ocf::heartbeat:Filesystem): Stopped res_namenode (ocf::hadoop:namenode): Stopped res_jobtracker (ocf::hadoop:jobtracker): Stopped Node Attributes: * Node node02: + master-res_drbd0:1 : 10 + node01-eth1 : up * Node node01: + master-res_drbd0:0 : 10000 + node02-eth1 : up Failed actions: res_vip_monitor_10000 (node=node01, call=51, rc=6, status=complete): not configured ---------------------------------------------------------------- res_vip_monitorの"not running"と"not configured"の違いがあるようです。 hadoopプロセス(res_namenode, res_jobtracker)は正常に停止されており、 drbdは停止処理まで到達していないので正常に動作している状態です。 その後、stop失敗による2重起動防止措置の不具合の可能性を疑って、 stopのon-failを"restart"等に変更してみましたが、挙動は変わりませんでした。 ha-logを見るとstopは成功していて、後続処理に進まないように見えますので この設定との関連性はないようです。 以上です。 宜しくお願い致します。 2012/7/12 <renay****@ybb*****> > 高舘さん > > こんにちは、山内です。 > > 少しバージョンは、違いますが、RHEL6.2上のPacemaker1.0.11で、同じような構成で試してみました。 > ACT(rh62-test1)でリソース起動後、ifdownでeth0を落とすと正常にFOしました。 > > ============ > Last updated: Thu Jul 12 23:33:55 2012 > Stack: Heartbeat > Current DC: rh62-test2 (47dc4202-35d8-461b-b8d6-2af59eee98e5) - partition > with quorum > Version: 1.0.11-unknown > 2 Nodes configured, unknown expected votes > 1 Resources configured. > ============ > > Online: [ rh62-test1 rh62-test2 ] > > Resource Group: rg_group > res_vip_chk (ocf::heartbeat:VIPcheck): Started rh62-test2 > res_vip (ocf::heartbeat:IPaddr2): Started rh62-test2 > res_filesystem (ocf::heartbeat:Dummy): Started rh62-test2 > > Migration summary: > * Node rh62-test2: > * Node rh62-test1: > res_vip: migration-threshold=1 fail-count=1 > > Failed actions: > res_vip_monitor_10000 (node=rh62-test1, call=7, rc=7, > status=complete): not running > > > > 差し支えなければ、resource-agentやReusableの利用されているバージョンや、その他のログ・利用されたcrmファイルの全体もご提示願えますでしょうか? > > ちなみに、私が利用した簡易のcrmファイルは以下になります。 > > ### Cluster Option ### > property no-quorum-policy="ignore" \ > stonith-enabled="false" \ > startup-fencing="false" \ > stonith-timeout="710s" \ > crmd-transition-delay="2s" > > ### Resource Defaults ### > rsc_defaults resource-stickiness="INFINITY" \ > migration-threshold="1" > > ### Primitive Configuration ### > primitive res_filesystem ocf:heartbeat:Dummy \ > op monitor interval="20s" timeout="40s" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="60s" > primitive res_vip ocf:heartbeat:IPaddr2 \ > params ip="192.168.40.77" cidr_netmask="24" nic="eth0" \ > op monitor interval="10s" timeout="60s" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="120s" > primitive res_vip_chk ocf:heartbeat:VIPcheck \ > params target_ip="192.168.40.77" count="1" wait="10" \ > op start interval="0" timeout="90s" start_delay="4s" \ > op stop interval="0" timeout="60s" > group rg_group res_vip_chk res_vip res_filesystem > > > ### Resource Location ### > location rsc_location-Dummy1-1 rg_group \ > rule 200: #uname eq rh62-test1 \ > rule 100: #uname eq rh62-test2 > > 以上、宜しくお願いいたします。 > > > > --- On Wed, 2012/7/11, yosuke takadate <taten****@gmail*****> wrote: > > > > > お世話になっております。高舘と申します。 > > > > pacemakerにてHA構成を組み、障害試験としてNICを停止した場合の > > 挙動を確認しております。1点確認させて下さい。 > > > > > > [事象] > > IPaddr2にてVIPを設定したNICを停止(ifdown)させた場合、正常にF/Oしませんでした。 > > crm_monでは、IPaddr2の停止処理が完了後、次のリソースを停止処理に入る前に > > 停まっているように見えます。 > > また、ha-logを見ると、findifコマンドの実行に失敗(NIC停止中の為)しており、 > > その後の処理が継続しません。 > > ifdownではなく、VIPをipコマンドでdeleteした場合はfindifコマンドも正常に > > 実行され、F/Oする事は確認しております。 > > > > > > [確認点] > > 上記の事象はcrmとRAの設定上、正常な動作になりますでしょうか。 > > また、F/Oを正常に実行させる設定方法はありますでしょうか。 > > > > > > [crm_mon] > > Resource Group: rg_group > > res_vip_chk (ocf::heartbeat:VIPcheck): Started node01 > > res_vip (ocf::heartbeat:IPaddr2): Stopped > > res_filesystem (ocf::heartbeat:Filesystem): Stopped > > > > > > [ha-log] > > Jul 11 18:15:44 node01 lrmd: [12523]: info: cancel_op: operation > monitor[22] on res_vip for client 12526, its parameters: > CRM_meta_interval=[10000] ip=[172.17.8.10] cidr_netmask=[26] > CRM_meta_timeout=[60000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] > nic=[eth0] cancelled > > Jul 11 18:15:44 node01 crmd: [12526]: info: do_lrm_rsc_op: Performing > key=2:22:0:6ea3b006-bc6c-47d0-92d2-dac7100137b0 op=res_vip_stop_0 ) > > Jul 11 18:15:44 node01 lrmd: [12523]: info: rsc:res_vip stop[43] (pid > 17340) > > Jul 11 18:15:44 node01 crmd: [12526]: info: process_lrm_event: LRM > operation res_vip_monitor_10000 (call=22, status=1, cib-update=0, > confirmed=true) Cancelled > > Jul 11 18:15:44 node01 lrmd: [12523]: info: RA output: > (res_vip:stop:stderr) eth0: unknown interface: No such device > /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson Usage: > /usr/lib64/heartbeat/findif [-C] Options: -C: Output netmask as the > number of bits rather than as 4 octets. Environment variables: > OCF_RESKEY_ip ip address (mandatory!) > OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast > broadcast address for interface OCF_RESKEY_nic interface to assign > to > > Jul 11 18:15:44 node01 IPaddr2(res_vip)[17340]: WARNING: > [/usr/lib64/heartbeat/findif -C] failed > > Jul 11 18:15:44 node01 lrmd: [12523]: info: operation stop[43] on > res_vip for client 12526: pid 17340 exited with return code 0 > > Jul 11 18:15:44 node01 crmd: [12526]: info: process_lrm_event: LRM > operation res_vip_stop_0 (call=43, rc=0, cib-update=67, confirmed=true) ok > > (ここで停止する) > > > > > > [crm] > > primitive res_filesystem ocf:heartbeat:Filesystem \ > > params device="/dev/drbd0" fstype="ext3" directory="/drbd/ram" \ > > op monitor interval="20s" timeout="40s" \ > > op start interval="0" timeout="60s" \ > > op stop interval="0" timeout="60s" > > primitive res_vip ocf:heartbeat:IPaddr2 \ > > params ip="172.17.8.10" cidr_netmask="26" nic="eth0" \ > > op monitor interval="10s" timeout="60s" \ > > op start interval="0" timeout="60s" \ > > op stop interval="0" timeout="120s" > > primitive res_vip_chk ocf:heartbeat:VIPcheck \ > > params target_ip="172.17.8.10" count="1" wait="10" \ > > op start interval="0" timeout="90s" start_delay="4s" \ > > op stop interval="0" timeout="60s" > > group rg_group res_vip_chk res_vip res_filesystem > > > > > > [環境] > > CentOS 5.4 (2.6.18-308) > > pacemaker-1.0.12-1.el5 > > heartbeat-3.0.5-1.1.el5 > > > > > > > > 以上になります。 > > 何か情報がありましたら頂けると助かります。 > > 宜しくお願い致します。 > > > > > > _______________________________________________ > Linux-ha-japan mailing list > Linux****@lists***** > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan > -------------- next part -------------- HTMLの添付ファイルを保管しました... Download -------------- next part -------------- (ifdown実行) Jul 13 10:12:40 node01 lrmd: [11309]: info: RA output: (res_vip:monitor:stderr) eth0: unknown interface: No such device /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson Usage: /usr/lib64/heartbeat/findif [-C] Options: -C: Output netmask as the number of bits rather than as 4 octets. Environment variables: OCF_RESKEY_ip ip address (mandatory!) OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast broadcast address for interface OCF_RESKEY_nic interface to assign to Jul 13 10:12:40 node01 IPaddr2(res_vip)[18621]: ERROR: [/usr/lib64/heartbeat/findif -C] failed Jul 13 10:12:40 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_monitor_10000 (call=51, rc=6, cib-update=73, confirmed=false) not configured Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_ha_callback: Update relayed from pdpnn02x Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_local_callback: Expanded fail-count-res_vip=value++ to 1 Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res_vip (1) Jul 13 10:12:41 node01 lrmd: [11309]: info: cancel_op: operation monitor[57] on res_jobtracker for client 11312, its parameters: CRM_meta_interval=[20000] hadoop_home=[/usr/lib/hadoop] CRM_meta_timeout=[30000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] hadoop_name=[jobtracker] cancelled Jul 13 10:12:41 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=46:111:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_jobtracker_stop_0 ) Jul 13 10:12:41 node01 lrmd: [11309]: info: rsc:res_jobtracker stop[58] (pid 18650) Jul 13 10:12:41 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_jobtracker_monitor_20000 (call=57, status=1, cib-update=0, confirmed=true) Cancelled Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_perform_update: Sent update 50: fail-count-res_vip=1 Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_ha_callback: Update relayed from pdpnn02x Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res_vip (1342141961) Jul 13 10:12:41 node01 attrd: [11311]: info: attrd_perform_update: Sent update 52: last-failure-res_vip=1342141961 Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) Stopping Hadoop jobtracker daemon (hadoop-jobtracker): Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) stopping jobtracker Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stdout) [ OK ] Jul 13 10:12:41 node01 lrmd: [11309]: info: RA output: (res_jobtracker:stop:stderr) cat: /var/run/hadoop-0.20/*jobtracker.pid: No such file or directory Jul 13 10:12:41 node01 lrmd: [11309]: info: operation stop[58] on res_jobtracker for client 11312: pid 18650 exited with return code 0 Jul 13 10:12:41 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_jobtracker_stop_0 (call=58, rc=0, cib-update=74, confirmed=true) ok Jul 13 10:12:43 node01 lrmd: [11309]: info: cancel_op: operation monitor[55] on res_namenode for client 11312, its parameters: CRM_meta_interval=[20000] hadoop_home=[/usr/lib/hadoop] CRM_meta_timeout=[30000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] hadoop_name=[namenode] cancelled Jul 13 10:12:43 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=44:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_namenode_stop_0 ) Jul 13 10:12:43 node01 lrmd: [11309]: info: rsc:res_namenode stop[59] (pid 18703) Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_namenode_monitor_20000 (call=55, status=1, cib-update=0, confirmed=true) Cancelled Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) Stopping Hadoop namenode daemon (hadoop-namenode): Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) stopping namenode Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) [ OK Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stdout) ] Jul 13 10:12:43 node01 lrmd: [11309]: info: RA output: (res_namenode:stop:stderr) cat: /var/run/hadoop-0.20/*namenode.pid: No such file or directory Jul 13 10:12:43 node01 lrmd: [11309]: info: operation stop[59] on res_namenode for client 11312: pid 18703 exited with return code 0 Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_namenode_stop_0 (call=59, rc=0, cib-update=75, confirmed=true) ok Jul 13 10:12:43 node01 lrmd: [11309]: info: cancel_op: operation monitor[53] on res_filesystem for client 11312, its parameters: CRM_meta_interval=[20000] directory=[/drbd/ram] fstype=[ext3] device=[/dev/drbd0] CRM_meta_timeout=[40000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] cancelled Jul 13 10:12:43 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=43:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_filesystem_stop_0 ) Jul 13 10:12:43 node01 lrmd: [11309]: info: rsc:res_filesystem stop[60] (pid 18759) Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_filesystem_monitor_20000 (call=53, status=1, cib-update=0, confirmed=true) Cancelled Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: Running stop for /dev/drbd0 on /drbd/ram Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: Trying to unmount /drbd/ram Jul 13 10:12:43 node01 Filesystem(res_filesystem)[18759]: INFO: unmounted /drbd/ram successfully Jul 13 10:12:43 node01 lrmd: [11309]: info: operation stop[60] on res_filesystem for client 11312: pid 18759 exited with return code 0 Jul 13 10:12:43 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_filesystem_stop_0 (call=60, rc=0, cib-update=76, confirmed=true) ok Jul 13 10:12:45 node01 lrmd: [11309]: info: cancel_op: operation monitor[51] on res_vip for client 11312, its parameters: CRM_meta_interval=[10000] ip=[172.17.8.10] cidr_netmask=[26] CRM_meta_timeout=[60000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] nic=[eth0] cancelled Jul 13 10:12:45 node01 crmd: [11312]: info: do_lrm_rsc_op: Performing key=4:112:0:968670d1-af60-45a2-8d9b-f02cc64583cc op=res_vip_stop_0 ) Jul 13 10:12:45 node01 lrmd: [11309]: info: rsc:res_vip stop[61] (pid 18836) Jul 13 10:12:45 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_monitor_10000 (call=51, status=1, cib-update=0, confirmed=true) Cancelled Jul 13 10:12:45 node01 lrmd: [11309]: info: RA output: (res_vip:stop:stderr) eth0: unknown interface: No such device /usr/lib64/heartbeat/findif version 2.99.1 Copyright Alan Robertson Usage: /usr/lib64/heartbeat/findif [-C] Options: -C: Output netmask as the number of bits rather than as 4 octets. Environment variables: OCF_RESKEY_ip ip address (mandatory!) OCF_RESKEY_cidr_netmask netmask of interface OCF_RESKEY_broadcast broadcast address for interface OCF_RESKEY_nic interface to assign to Jul 13 10:12:45 node01 IPaddr2(res_vip)[18836]: WARNING: [/usr/lib64/heartbeat/findif -C] failed Jul 13 10:12:45 node01 lrmd: [11309]: info: operation stop[61] on res_vip for client 11312: pid 18836 exited with return code 0 Jul 13 10:12:45 node01 crmd: [11312]: info: process_lrm_event: LRM operation res_vip_stop_0 (call=61, rc=0, cib-update=77, confirmed=true) ok (ここから処理が進まない)