マスター側のvipが停止した原因と対処方法について (Linux-ha-jp) - Linux-HA Japan

大渕と申します。

以前質問させていただいた事象が再発してしまったので、またアドバイスいただきたくメールいたしました。

7月7日の2時49分ごろにeth0がいないという状態が再発し、vip-masterがnotconfiguredという表示になりアクセス出来なくなりました。

表示は前回と同じで

Failed actions:
    vip-master_monitor_10000 (node=ptdb02.localdomain, call=19, rc=6,
status=complete): not configured
となっておりました。

master/slave構成で3週間ほど正常稼働していましたが、再発時にスレーブが昇格してくれると思っていたところ、昇格してくれませんでした。

これはどう対処すればよろしいでしょうか。

発生時のha-logの内容は以下の通りです。

Jul  7 02:49:53 ptdb02 IPaddr2(vip-master)[24562]: ERROR: Unknown interface
[eth0] No such device.
IPaddr2(vip-master)[24562]: 2013/07/07_02:49:53 ERROR: Unknown interface
[eth0] No such device.
Jul  7 02:49:53 ptdb02 IPaddr2(vip-master)[24562]: ERROR: [findif] failed
IPaddr2(vip-master)[24562]: 2013/07/07_02:49:53 ERROR: [findif] failed
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: process_lrm_event: LRM operation
vip-master_monitor_10000 (call=19, rc=6, cib-update=1482, confirmed=false)
not configured
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: process_lrm_event:
LRM operation vip-master_monitor_10000 (call=19, rc=6, cib-update=1482,
confirmed=false) not configured
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: process_graph_event:
Detected action vip-master_monitor_10000 from a different transition: 5 vs.
1411
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: process_graph_event: Detected
action vip-master_monitor_10000 from a different transition: 5 vs. 1411
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
abort_transition_graph: process_graph_event:489 - Triggered transition
abort (complete=1, tag=lrm_rsc_op, id=vip-master_monitor_10000,
magic=0:6;43:5:0:a92e7b56-467f-4f65-8fe8-1fd20bf85003, cib=0.33.10) : Old
event
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
process_graph_event:489 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=vip-master_monitor_10000,
magic=0:6;43:5:0:a92e7b56-467f-4f65-8fe8-1fd20bf85003, cib=0.33.10) : Old
event
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: WARN: update_failcount:
Updating failcount for vip-master on ptdb02.localdomain after failed
monitor: rc=6 (update=value++, time=1373132993)
Jul  7 02:49:53 ptdb02 crmd: [8860]: WARN: update_failcount: Updating
failcount for vip-master on ptdb02.localdomain after failed monitor: rc=6
(update=value++, time=1373132993)
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: find_hash_entry: Creating hash
entry for fail-count-vip-master
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: log-rotate detected
on logfile /var/log/ha-debug
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_state_transition:
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: log-rotate detected on logfile
/var/log/ha-log
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: log-rotate detected
on logfile /var/log/ha-log
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_state_transition:
All 2 cluster nodes are eligible to run resources.
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: find_hash_entry:
Creating hash entry for fail-count-vip-master
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: log-rotate detected on logfile
/var/log/ha-debug
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
attrd_local_callback: Expanded fail-count-vip-master=value++ to 1
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
1483: Requesting the current CIB: S_POLICY_ENGINE
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
attrd_trigger_update: Sending flush op to all hosts for:
fail-count-vip-master (1)
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_local_callback: Expanded
fail-count-vip-master=value++ to 1
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1483:
Requesting the current CIB: S_POLICY_ENGINE
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-vip-master (1)
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
attrd_perform_update: Sent update 71: fail-count-vip-master=1
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_perform_update: Sent
update 71: fail-count-vip-master=1
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info: find_hash_entry:
Creating hash entry for last-failure-vip-master
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: find_hash_entry: Creating hash
entry for last-failure-vip-master
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
attrd_trigger_update: Sending flush op to all hosts for:
last-failure-vip-master (1373132993)
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-vip-master (1373132993)
Jul 07 02:49:53 ptdb02.localdomain attrd: [8859]: info:
attrd_perform_update: Sent update 74: last-failure-vip-master=1373132993
Jul  7 02:49:53 ptdb02 attrd: [8859]: info: attrd_perform_update: Sent
update 74: last-failure-vip-master=1373132993
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
do_pe_invoke_callback: Invoking the PE: query=1483,
ref=pe_calc-dc-1373132993-1470, seq=2, quorate=1
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke_callback: Invoking
the PE: query=1483, ref=pe_calc-dc-1373132993-1470, seq=2, quorate=1
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
abort_transition_graph: te_update_diff:150 - Triggered transition abort
(complete=1, tag=nvpair,
id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-fail-count-vip-master,
name=fail-count-vip-master, value=1, magic=NA, cib=0.33.11) : Transient
attribute: update
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair,
id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-fail-count-vip-master,
name=fail-count-vip-master, value=1, magic=NA, cib=0.33.11) : Transient
attribute: update
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info:
abort_transition_graph: te_update_diff:150 - Triggered transition abort
(complete=1, tag=nvpair,
id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-last-failure-vip-master,
name=last-failure-vip-master, value=1373132993, magic=NA, cib=0.33.12) :
Transient attribute: update
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: abort_transition_graph:
te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair,
id=status-2dfbfb70-566a-400c-b378-62917dee7e9e-last-failure-vip-master,
name=last-failure-vip-master, value=1373132993, magic=NA, cib=0.33.12) :
Transient attribute: update
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1484:
Requesting the current CIB: S_POLICY_ENGINE
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
1484: Requesting the current CIB: S_POLICY_ENGINE
Jul 07 02:49:53 ptdb02.localdomain crmd: [8860]: info: do_pe_invoke: Query
1485: Requesting the current CIB: S_POLICY_ENGINE
Jul  7 02:49:53 ptdb02 crmd: [8860]: info: do_pe_invoke: Query 1485:
Requesting the current CIB: S_POLICY_ENGINE
Jul  7 02:49:53 ptdb02 pengine: [8863]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: notice: unpack_config:
On loss of CCM Quorum: Ignore
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info: unpack_config:
Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jul  7 02:49:53 ptdb02 pengine: [8863]: info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jul  7 02:49:53 ptdb02 pengine: [8863]: info: determine_online_status: Node
ptdb02.localdomain is online
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info:
determine_online_status: Node ptdb02.localdomain is online
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: info:
determine_online_status: Node ptdb01.localdomain is online
Jul  7 02:49:53 ptdb02 pengine: [8863]: info: determine_online_status: Node
ptdb01.localdomain is online
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: ERROR: unpack_rsc_op:
Preventing vip-master from re-starting anywhere in the cluster : operation
monitor failed 'not configured' (rc=6)
Jul  7 02:49:53 ptdb02 pengine: [8863]: ERROR: unpack_rsc_op: Preventing
vip-master from re-starting anywhere in the cluster : operation monitor
failed 'not configured' (rc=6)
Jul 07 02:49:53 ptdb02.localdomain pengine: [8863]: WARN: unpack_rsc_op:
Processing failed op vip-master_monitor_10000 on ptdb02.localdomain: not
configured (6)
Jul  7 02:49:53 ptdb02 pengine: [8863]: WARN: unpack_rsc_op: Processing
failed op vip-master_monitor_10000 on ptdb02.localdomain: not configured (6)
Jul  7 02:49:53 ptdb02 pengine: [8863]: info: find_clone: Internally
renamed pgsql:0 on ptdb01.localdomain to pgsql:1
以上

取り急ぎ、# crm_resource -C -r vip-master -N ptdb02.localdomain
のコマンドを実行したところ、復旧いたしましたが、先ほどまた本事象が発生してしまいました。

ptdb02が現在のマスターになっているのですが、ptdb02サーバーのeth0だけにこの事象が発生するのであれば、ptdb01をマスターにすればptdb02で再発してもvip-masterが消えることはなくなりますでしょうか。

以上、お忙しいところ恐縮ですが、よろしくお願いいたします。

2013年6月18日 10:18 大渕昭夫 <butch****@gmail*****>:

> なるほどですね。
>
> 省電力など特別な設定はしていないとは思いますが、不具合報告などはメーカーに問い合わせてみます。
>
> ありがとうございます。
>
>
> 2013年6月18日 5:17 mlus <mlus****@39596*****>:
>
> 外しているかもしれませんが・・・・・。
>>
>> 使われているハードウエアのNICのチップの不具合報告とかはないでしょうか？
>> また、もしかしたら、BIOSやOSの省電力の設定を確認されて見るのも、有効対策が見つからないでしょうか？
>>
>>
>>
>> 2013年6月17日 14:30 大渕昭夫 <butch****@gmail*****>:
>> > 初めまして。
>> > 大渕昭夫と申します。
>> >
>> > アドバイス等いただきたくメールさせていただきました。
>> >
>> > 内容としましては、マスター側のvipが停止してしまったことの原因と対処方法についてです。
>> > あまり技術的に詳しくなく、原因がわからず困っております。
>> >
>> > こちらを参考にさせていただき、PostgreSQLを冗長化すべく作業をしております。設定も構成も同じで構築しております。
>> >
>> https://github.com/t-matsuo/resource-agents/wiki/PostgreSQL-9.1-%E3%82%B9%E3%83%88%E3%83%AA%E3%83%BC%E3%83%9F%E3%83%B3%E3%82%B0%E3%83%AC%E3%83%97%E3%83%AA%E3%82%B1%E3%83%BC%E3%82%B7%E3%83%A7%E3%83%B3%E5%AF%BE%E5%BF%9C-%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9%E3%82%A8%E3%83%BC%E3%82%B8%E3%82%A7%E3%83%B3%E3%83%88
>> >
>> >
>> >
>> 現在、本番稼働中のサーバー（ptdb01）はそのままで、新サーバー（ptdb02）をMaster機として構築、しばらくptdb02のみで稼働させて、問題なければptdb01を停止し、ptdb01に同環境をインストールした後にスレーブ機として追加して、最終的に上記参考のようなMater/Slave構成にしたいと考えております。
>> >
>> >
>> ptdb02にPacemaker1.0.13-1.1とPostgreSQL9.2.4をインストールし、6月13日に無事に稼働したのを確認いたしました。
>> > OSはCentOS5です。
>> > また、pacemaker稼働中にcrm configure のedit
>> > でvip-masterを変更するテストをしたのですが、その時はきちんと変更されて稼働しました。
>> >
>> > vip-masterからのデータベースへのアクセスも問題なくできていました。
>> >
>> > ところが、今朝モニターしてみると以下のような表示になり、vip-masterにアクセスできなくなっていました。
>> >
>> > ============
>> > Last updated: Mon Jun 17 09:29:32 2013
>> > Stack: Heartbeat
>> > Current DC: ptdb02.localdomain (2dfbfb70-566a-400c-b378-62917dee7e9e) -
>> > partition with quorum
>> > Version: 1.0.13-30bb726
>> > 1 Nodes configured, unknown expected votes
>> > 4 Resources configured.
>> > ============
>> > Online: [ ptdb02.localdomain ]
>> > vip-slave       (ocf::heartbeat:IPaddr2):       Started
>> ptdb02.localdomain
>> >  Master/Slave Set: msPostgresql
>> >      Masters: [ ptdb02.localdomain ]
>> >      Stopped: [ pgsql:1 ]
>> >  Clone Set: clnPingCheck
>> >      Started: [ ptdb02.localdomain ]
>> > Node Attributes:
>> > * Node ptdb02.localdomain:
>> >     + default_ping_set                  : 100
>> >     + master-pgsql:0                    : 1000
>> >     + pgsql-data-status                 : LATEST
>> >     + pgsql-master-baseline             : 0000000755000080
>> >     + pgsql-status                      : PRI
>> > Failed actions:
>> >     vip-master_monitor_10000 (node=ptdb02.localdomain, call=19, rc=6,
>> > status=complete): not configured
>> >
>> >
>> > ha-logを確認したところ6月15日の20:22にvip-masterが止まっていました。
>> > 該当箇所は以下の通りです。
>> >
>> > Jun 15 20:22:48 ptdb02 cib: [19850]: info: cib_stats: Processed 2169
>> > operations (3416.00us average, 1% utilization) in the last 10min
>> > Jun 15 20:23:28 ptdb02 IPaddr2(vip-master)[30902]: ERROR: Unknown
>> interface
>> > [eth0] No such device.
>> > IPaddr2(vip-master)[30902]: 2013/06/15_20:23:28 ERROR: Unknown interface
>> > [eth0] No such device.
>> > Jun 15 20:23:28 ptdb02 IPaddr2(vip-master)[30902]: ERROR: [findif]
>> failed
>> > IPaddr2(vip-master)[30902]: 2013/06/15_20:23:28 ERROR: [findif] failed
>> > Jun 15 20:23:28 ptdb02 crmd: [19854]: info: process_lrm_event: LRM
>> operation
>> > vip-master_monitor_10000 (call=19, rc=6, cib-update=250,
>> confirmed=false)
>> > not configured
>> > Jun 15 20:23:28 ptdb02.localdomain crmd: [19854]: info:
>> process_lrm_event:
>> > LRM operation vip-master_monitor_10000 (call=19, rc=6, cib-update=250,
>> > confirmed=false) not configured
>> >
>> > 以上です。
>> >
>> > なお、6月14日から6月17日の朝までは誰もptdb02にアクセスはしておりません。
>> >
>> > お忙しいところ恐縮ですが、こちらの原因と対処方法などについてご教授いただけますとありがたいです。
>> >
>> > ほかに必要な情報等あれば、ご指示いただければと思います。
>> >
>> > 以上、よろしくお願い申し上げます。
>> >
>> >
>> >
>> > _______________________________________________
>> > Linux-ha-japan mailing list
>> > Linux****@lists*****
>> > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>> >
>> _______________________________________________
>> Linux-ha-japan mailing list
>> Linux****@lists*****
>> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>>
>
>
-------------- next part --------------
HTMLの添付ファイルを保管しました...
Download 

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] マスター側のvipが停止した原因と対処方法について

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] マスター側のvipが停止した原因と対処方法について

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools