マスター側のvipが停止した原因と対処方法について (Linux-ha-jp) - Linux-HA Japan

大渕さま
松尾です。

> master/slave構成で3週間ほど正常稼働していましたが、再発時にスレーブが昇格してくれると思っていたところ、昇格してくれませんでした。

回答間違っていたようで申し訳ありません。
エラーコードの種類を考慮していませんでした。

通常の故障(エラーコード1(=ERR GENERIC))ならばフェイルオーバしてくれるはずなのですが、
今回のエラーコードは 6(=ERR_CONFIGURED) なので、クラスタ全体でvip-masterを起動できないと判断し、
フェイルオーバに失敗したようです。

インタフェースが消える原因はわかりませんが、vip-master故障発生後、vip-masterの停止は正常にできているようなので、
故障発生後にフェイルオーバできるように、暫定対処としてリターンコードを変更してみてはいかがでしょうか。

 /usr/lib/ocf/resource.d/heartbeat/IPaddr2 ファイル
--------------------
 440     else
 441         # findif couldn't find the interface
 442         if ocf_is_probe; then
 443             ocf_log info "[$FINDIF] failed"
 444             exit $OCF_NOT_RUNNING
 445         elif [ "$__OCF_ACTION" = stop ]; then
 446             ocf_log warn "[$FINDIF] failed"
 447             exit $OCF_SUCCESS
 448         else
 449             ocf_log err "[$FINDIF] failed"
 450             exit $rc ★ここ
 451         fi
 452     fi
--------------------

exit $rc
を
exit $OCF_ERR_GENERIC

に変更してみてはいかがでしょうか。
※ すみません、実機での動作は未確認です。

あと、原因追及されたいなら、上記 exit する前に、関係するOSの状態をいろいろ取得してファイルに
書きだしておくと、後で何かわかるかもしれません。
/proc/net/dev の情報とか、ip addr show の結果等々。

以上よろしくお願い致します。


2013年7月8日 11:18 大渕昭夫 <butch****@gmail*****>:
> 大渕と申します。
>
> 以前質問させていただいた事象が再発してしまったので、またアドバイスいただきたくメールいたしました。
>
> 7月7日の2時49分ごろにeth0がいないという状態が再発し、vip-masterがnotconfiguredという表示になりアクセス出来なくなりました。
>
> 表示は前回と同じで
>
> Failed actions:
>     vip-master_monitor_10000 (node=ptdb02.localdomain, call=19, rc=6,
> status=complete): not configured
> となっておりました。
>
> master/slave構成で3週間ほど正常稼働していましたが、再発時にスレーブが昇格してくれると思っていたところ、昇格してくれませんでした。
>
> これはどう対処すればよろしいでしょうか。
>
> 発生時のha-logの内容は以下の通りです。
>
> Jul  7 02:49:53 ptdb02 IPaddr2(vip-master)[24562]: ERROR: Unknown interface
> [eth0] No such device.
> IPaddr2(vip-master)[24562]: 2013/07/07_02:49:53 ERROR: Unknown interface
> [eth0] No such device.
> Jul  7 02:49:53 ptdb02 IPaddr2(vip-master)[24562]: ERROR: [findif] failed
> IPaddr2(vip-master)[24562]: 2013/07/07_02:49:53 ERROR: [findif] failed
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: process_lrm_event: LRM operation
> vip-master_monitor_10000 (call=19, rc=6, cib-update=1482, confirmed=false)
> not configured
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: process_lrm_event:
> LRM operation vip-master_monitor_10000 (call=19, rc=6, cib-update=1482,
> confirmed=false) not configured
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: process_graph_event:
> Detected action vip-master_monitor_10000 from a different transition: 5 vs.
> 1411
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: process_graph_event: Detected
> action vip-master_monitor_10000 from a different transition: 5 vs. 1411
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
> abort_transition_graph: process_graph_event:489 - Triggered transition abort
> (complete=1, tag=lrm_rsc_op, id=vip-master_monitor_10000,
> magic=0:6;43:5:0:a92e7b56-467f-4f65-8fe8-1fd20bf85003, cib=0.33.10) : Old
> event
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
> process_graph_event:489 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=vip-master_monitor_10000,
> magic=0:6;43:5:0:a92e7b56-467f-4f65-8fe8-1fd20bf85003, cib=0.33.10) : Old
> event
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: WARN: update_failcount:
> Updating failcount for vip-master on ptdb02.localdomain after failed
> monitor: rc=6 (update=value++, time=1373132993)
> Jul  7 02:49:53 ptdb02 crmd: [8860]: WARN: update_failcount: Updating
> failcount for vip-master on ptdb02.localdomain after failed monitor: rc=6
> (update=value++, time=1373132993)
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: find_hash_entry: Creating hash
> entry for fail-count-vip-master
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: log-rotate detected
> on logfile /var/log/ha-debug
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_state_transition:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: log-rotate detected on logfile
> /var/log/ha-log
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: log-rotate detected
> on logfile /var/log/ha-log
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_state_transition:
> All 2 cluster nodes are eligible to run resources.
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: find_hash_entry:
> Creating hash entry for fail-count-vip-master
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: log-rotate detected on logfile
> /var/log/ha-debug
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
> attrd_local_callback: Expanded fail-count-vip-master=value++ to 1
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
> 1483: Requesting the current CIB: S_POLICY_ENGINE
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-vip-master (1)
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_local_callback: Expanded
> fail-count-vip-master=value++ to 1
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1483:
> Requesting the current CIB: S_POLICY_ENGINE
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-vip-master (1)
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
> attrd_perform_update: Sent update 71: fail-count-vip-master=1
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_perform_update: Sent
> update 71: fail-count-vip-master=1
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: find_hash_entry:
> Creating hash entry for last-failure-vip-master
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: find_hash_entry: Creating hash
> entry for last-failure-vip-master
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-vip-master (1373132993)
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-vip-master (1373132993)
> Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
> attrd_perform_update: Sent update 74: last-failure-vip-master=1373132993
> Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_perform_update: Sent
> update 74: last-failure-vip-master=1373132993
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
> do_pe_invoke_callback: Invoking the PE: query=1483,
> ref=pe_calc-dc-1373132993-1470, seq=2, quorate=1
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke_callback: Invoking
> the PE: query=1483, ref=pe_calc-dc-1373132993-1470, seq=2, quorate=1
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
> abort_transition_graph: te_update_diff:150 - Triggered transition abort
> (complete=1, tag=nvpair,
> id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-fail-count-vip-master,
> name=fail-count-vip-master, value=1, magic=NA, cib=0.33.11) : Transient
> attribute: update
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
> te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-fail-count-vip-master,
> name=fail-count-vip-master, value=1, magic=NA, cib=0.33.11) : Transient
> attribute: update
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
> abort_transition_graph: te_update_diff:150 - Triggered transition abort
> (complete=1, tag=nvpair,
> id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-last-failure-vip-master,
> name=last-failure-vip-master, value=1373132993, magic=NA, cib=0.33.12) :
> Transient attribute: update
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
> te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-last-failure-vip-master,
> name=last-failure-vip-master, value=1373132993, magic=NA, cib=0.33.12) :
> Transient attribute: update
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1484:
> Requesting the current CIB: S_POLICY_ENGINE
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
> 1484: Requesting the current CIB: S_POLICY_ENGINE
> Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
> 1485: Requesting the current CIB: S_POLICY_ENGINE
> Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1485:
> Requesting the current CIB: S_POLICY_ENGINE
> Jul  7 02:49:53 ptdb02 pengine: [8863]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: notice: unpack_config:
> On loss of CCM Quorum: Ignore
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info: unpack_config:
> Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jul  7 02:49:53 ptdb02 pengine: [8863]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jul  7 02:49:53 ptdb02 pengine: [8863]: info: determine_online_status: Node
> ptdb02.localdomain is online
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info:
> determine_online_status: Node ptdb02.localdomain is online
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info:
> determine_online_status: Node ptdb01.localdomain is online
> Jul  7 02:49:53 ptdb02 pengine: [8863]: info: determine_online_status: Node
> ptdb01.localdomain is online
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: ERROR: unpack_rsc_op:
> Preventing vip-master from re-starting anywhere in the cluster : operation
> monitor failed 'not configured' (rc=6)
> Jul  7 02:49:53 ptdb02 pengine: [8863]: ERROR: unpack_rsc_op: Preventing
> vip-master from re-starting anywhere in the cluster : operation monitor
> failed 'not configured' (rc=6)
> Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: WARN: unpack_rsc_op:
> Processing failed op vip-master_monitor_10000 on ptdb02.localdomain: not
> configured (6)
> Jul  7 02:49:53 ptdb02 pengine: [8863]: WARN: unpack_rsc_op: Processing
> failed op vip-master_monitor_10000 on ptdb02.localdomain: not configured (6)
> Jul  7 02:49:53 ptdb02 pengine: [8863]: info: find_clone: Internally renamed
> pgsql:0 on ptdb01.localdomain to pgsql:1
> 以上
>
> 取り急ぎ、# crm_resource -C -r vip-master -N
> ptdb02.localdomain　のコマンドを実行したところ、復旧いたしましたが、先ほどまた本事象が発生してしまいました。
>
> ptdb02が現在のマスターになっているのですが、ptdb02サーバーのeth0だけにこの事象が発生するのであれば、ptdb01をマスターにすればptdb02で再発してもvip-masterが消えることはなくなりますでしょうか。
>
> 以上、お忙しいところ恐縮ですが、よろしくお願いいたします。
>
>
>
>
>
> 2013年6月18日 10:18 大渕昭夫 <butch****@gmail*****>:
>
>> なるほどですね。
>>
>> 省電力など特別な設定はしていないとは思いますが、不具合報告などはメーカーに問い合わせてみます。
>>
>> ありがとうございます。
>>
>>
>> 2013年6月18日 5:17 mlus <mlus****@39596*****>:
>>
>>> 外しているかもしれませんが・・・・・。
>>>
>>> 使われているハードウエアのNICのチップの不具合報告とかはないでしょうか？
>>> また、もしかしたら、BIOSやOSの省電力の設定を確認されて見るのも、有効対策が見つからないでしょうか？
>>>
>>>
>>>
>>> 2013年6月17日 14:30 大渕昭夫 <butch****@gmail*****>:
>>> > 初めまして。
>>> > 大渕昭夫と申します。
>>> >
>>> > アドバイス等いただきたくメールさせていただきました。
>>> >
>>> > 内容としましては、マスター側のvipが停止してしまったことの原因と対処方法についてです。
>>> > あまり技術的に詳しくなく、原因がわからず困っております。
>>> >
>>> > こちらを参考にさせていただき、PostgreSQLを冗長化すべく作業をしております。設定も構成も同じで構築しております。
>>> >
>>> > https://github.com/t-matsuo/resource-agents/wiki/PostgreSQL-9.1-%E3%82%B9%E3%83%88%E3%83%AA%E3%83%BC%E3%83%9F%E3%83%B3%E3%82%B0%E3%83%AC%E3%83%97%E3%83%AA%E3%82%B1%E3%83%BC%E3%82%B7%E3%83%A7%E3%83%B3%E5%AF%BE%E5%BF%9C-%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9%E3%82%A8%E3%83%BC%E3%82%B8%E3%82%A7%E3%83%B3%E3%83%88
>>> >
>>> >
>>> >
>>> > 現在、本番稼働中のサーバー（ptdb01）はそのままで、新サーバー（ptdb02）をMaster機として構築、しばらくptdb02のみで稼働させて、問題なければptdb01を停止し、ptdb01に同環境をインストールした後にスレーブ機として追加して、最終的に上記参考のようなMater/Slave構成にしたいと考えております。
>>> >
>>> >
>>> > ptdb02にPacemaker1.0.13-1.1とPostgreSQL9.2.4をインストールし、6月13日に無事に稼働したのを確認いたしました。
>>> > OSはCentOS5です。
>>> > また、pacemaker稼働中にcrm configure のedit
>>> > でvip-masterを変更するテストをしたのですが、その時はきちんと変更されて稼働しました。
>>> >
>>> > vip-masterからのデータベースへのアクセスも問題なくできていました。
>>> >
>>> > ところが、今朝モニターしてみると以下のような表示になり、vip-masterにアクセスできなくなっていました。
>>> >
>>> > ============
>>> > Last updated: Mon Jun 17 09:29:32 2013
>>> > Stack: Heartbeat
>>> > Current DC: ptdb02.localdomain (2dfbfb70-566a-400c-b378-62917dee7e9e) -
>>> > partition with quorum
>>> > Version: 1.0.13-30bb726
>>> > 1 Nodes configured, unknown expected votes
>>> > 4 Resources configured.
>>> > ============
>>> > Online: [ ptdb02.localdomain ]
>>> > vip-slave       (ocf::heartbeat:IPaddr2):       Started
>>> > ptdb02.localdomain
>>> >  Master/Slave Set: msPostgresql
>>> >      Masters: [ ptdb02.localdomain ]
>>> >      Stopped: [ pgsql:1 ]
>>> >  Clone Set: clnPingCheck
>>> >      Started: [ ptdb02.localdomain ]
>>> > Node Attributes:
>>> > * Node ptdb02.localdomain:
>>> >     + default_ping_set                  : 100
>>> >     + master-pgsql:0                    : 1000
>>> >     + pgsql-data-status                 : LATEST
>>> >     + pgsql-master-baseline             : 0000000755000080
>>> >     + pgsql-status                      : PRI
>>> > Failed actions:
>>> >     vip-master_monitor_10000 (node=ptdb02.localdomain, call=19, rc=6,
>>> > status=complete): not configured
>>> >
>>> >
>>> > ha-logを確認したところ6月15日の20:22にvip-masterが止まっていました。
>>> > 該当箇所は以下の通りです。
>>> >
>>> > Jun 15 20:22:48 ptdb02 cib: [19850]: info: cib_stats: Processed 2169
>>> > operations (3416.00us average, 1% utilization) in the last 10min
>>> > Jun 15 20:23:28 ptdb02 IPaddr2(vip-master)[30902]: ERROR: Unknown
>>> > interface
>>> > [eth0] No such device.
>>> > IPaddr2(vip-master)[30902]: 2013/06/15_20:23:28 ERROR: Unknown
>>> > interface
>>> > [eth0] No such device.
>>> > Jun 15 20:23:28 ptdb02 IPaddr2(vip-master)[30902]: ERROR: [findif]
>>> > failed
>>> > IPaddr2(vip-master)[30902]: 2013/06/15_20:23:28 ERROR: [findif] failed
>>> > Jun 15 20:23:28 ptdb02 crmd: [19854]: info: process_lrm_event: LRM
>>> > operation
>>> > vip-master_monitor_10000 (call=19, rc=6, cib-update=250,
>>> > confirmed=false)
>>> > not configured
>>> > Jun 15 20:23:28 ptdb02.localdomain crmd: [19854]: info:
>>> > process_lrm_event:
>>> > LRM operation vip-master_monitor_10000 (call=19, rc=6, cib-update=250,
>>> > confirmed=false) not configured
>>> >
>>> > 以上です。
>>> >
>>> > なお、6月14日から6月17日の朝までは誰もptdb02にアクセスはしておりません。
>>> >
>>> > お忙しいところ恐縮ですが、こちらの原因と対処方法などについてご教授いただけますとありがたいです。
>>> >
>>> > ほかに必要な情報等あれば、ご指示いただければと思います。
>>> >
>>> > 以上、よろしくお願い申し上げます。
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Linux-ha-japan mailing list
>>> > Linux****@lists*****
>>> > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>>> >
>>> _______________________________________________
>>> Linux-ha-japan mailing list
>>> Linux****@lists*****
>>> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>>
>>
>
>
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] マスター側のvipが停止した原因と対処方法について

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] マスター側のvipが停止した原因と対処方法について

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools