Why has Red Hat Cluster Suite stopped working?

Question

I've been testing the Cluster Suite on CentOS 6.4, and had it working fine, but I noticed today [8th August, when this question was originally asked] that it's not liking the config that was previously working. I tried to recreate a configuration from scratch using CCS, but that gave validation errors.

Edited 21st August:

I've now reinstalled the box completely from CentOS 6.4 x86_64 minimal install, adding the following packages and their dependencies:

yum install bind-utils dhcp dos2unix man man-pages man-pages-overrides nano nmap ntp rsync tcpdump unix2dos vim-enhanced wget

and

yum install rgmanager ccs

The following commands all worked:

ccs -h ha-01 --createcluster test-ha
ccs -h ha-01 --addnode ha-01
ccs -h ha-01 --addnode ha-02
ccs -h ha-01 --addresource ip address=10.1.1.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.1.1.4 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.0.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.8.3 monitor_link=1
ccs -h ha-01 --addservice routing-a autostart=1 recovery=restart
ccs -h ha-01 --addservice routing-b autostart=1 recovery=restart
ccs -h ha-01 --addsubservice routing-a ip ref=10.1.1.3
ccs -h ha-01 --addsubservice routing-a ip ref=10.110.0.3
ccs -h ha-01 --addsubservice routing-b ip ref=10.1.1.4
ccs -h ha-01 --addsubservice routing-b ip ref=10.110.8.3

and resulted in the following config:

<?xml version="1.0"?>
<cluster config_version="13" name="test-ha">
    <fence_daemon/>
    <clusternodes>
        <clusternode name="ha-01" nodeid="1"/>
        <clusternode name="ha-02" nodeid="2"/>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm>
        <failoverdomains/>
        <resources>
            <ip address="10.1.1.3" monitor_link="1"/>
            <ip address="10.1.1.4" monitor_link="1"/>
            <ip address="10.110.0.3" monitor_link="1"/>
            <ip address="10.110.8.3" monitor_link="1"/>
        </resources>
        <service autostart="1" name="routing-a" recovery="restart">
            <ip ref="10.1.1.3"/>
            <ip ref="10.110.0.3"/>
        </service>
        <service autostart="1" name="routing-b" recovery="restart">
            <ip ref="10.1.1.4"/>
            <ip ref="10.110.8.3"/>
        </service>
    </rm>
</cluster>

However, if I use ccs_config_validate or try to start the cman service, it fails with:

Relax-NG validity error : Extra element rm in interleave
tempfile:10: element rm: Relax-NG validity error : Element cluster failed to validate content
Configuration fails to validate

What's going on? This used to work!

Is your cluster up and running? Is `cman` started on the nodes, what does `clustat` and `cman_tool status` say? (asking because you say that you've recreated the config on a previously running cluster. — Petter H, Aug 08 '13 at 11:07
Same result whether the cluster is started or stopped. Doesn't ccs just modify a configuration ready for it to be pushed to the cluster? — Iain Hallam, Aug 08 '13 at 11:26
Just ran `ccs_config_validate` on the old config, and got: `Relax-NG validity error : Extra element rm in interleave` / `tempfile:10: element rm: Relax-NG validity error : Element cluster failed to validate content` / `Configuration fails to validate` — Iain Hallam, Aug 08 '13 at 11:36
Just in case ... can you remove any standalone section ? Like , , , — Nikolaidis Fotis, Aug 27 '13 at 16:46
what version of cman do you have? what is in your `/var/lib/cluster/cluster.rng` ? — Petter H, Aug 28 '13 at 12:39
Hi, Petter. Was away for a week; sorry. cman is version 3.0.12.1, release 49.el6_4.2; my cluster.rng is at http://pastebin.com/br4pQ5nS, though it's probably never changed from the one installed by yum. — Iain Hallam, Sep 04 '13 at 16:50

c4f4t0r · Answer 1 · 2013-08-27T12:51:36.483

0

I think you are missing the failover domains, if you wanna define a service on redhat cluster, first you need to define a failoverdomain, you can use a failoverdomain for many services or one per service.

If you need more information about the failover domain "man clurgmgrd"

A failover domain is an ordered subset of members to which a service may be bound. The following

is a list of semantics governing the options as to how the different configuration options affect the behavior of a failover domain:

edited Aug 27 '13 at 12:51

answered Aug 27 '13 at 10:22

c4f4t0r

5,149
3
28
41

More info will make this an answer. Can you add to your answer? – Dave M Aug 27 '13 at 12:22
Thanks, unfortunately failoverdomains is an optional element, and this configuration used to work without any. – Iain Hallam Aug 27 '13 at 17:15
1

at this point i think you have a problem with your ha software version, i valited your xml cluster config and i can tell it works. I use xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng cluster.xml for validate – c4f4t0r Aug 27 '13 at 18:46
I don't have a cluster.ng on my system - presumably because it's CentOS 6.4 and system-config-cluster isn't available. Using /var/lib/cluster/cluster.rng fails validation with the "extra element rm in interleave" error. – Iain Hallam Sep 04 '13 at 16:52
@c4f4t0r, would it be helpful to compare the contents of your /usr/share/system-config-cluster/misc/cluster.ng with /var/lib/cluster/cluster.rng on my system, or are they likely to be very different? – Iain Hallam Sep 10 '13 at 09:40
@Iain Hallam good idea, but in this moment i don't have access to my redhat cluster – c4f4t0r Sep 10 '13 at 22:06

score 0 · Accepted Answer · answered Sep 20 '13 at 13:01

It's just started working again, after more yum update dancing. I have compared the old and new /var/lib/cluster/cluster.rng and, surprise, surprise, there's a difference. The one on the systems that didn't work was missing any definitions for the <ip> element.

The current incarnation of the system was installed from the same minimal CD, and I have a step-by-step procedure of commands to cut and paste, which worked several times while I was developing it, then failed for nearly two months, now starts working again. I've built the box about half a dozen times, so I guess it's not the procedure.

A slip up on Red Hat's part, perhaps, but I'm not sure how to find out what changes were checked into this file in the last two months.

Why has Red Hat Cluster Suite stopped working?

2 Answers2