2

I've been testing the Cluster Suite on CentOS 6.4, and had it working fine, but I noticed today [8th August, when this question was originally asked] that it's not liking the config that was previously working. I tried to recreate a configuration from scratch using CCS, but that gave validation errors.


Edited 21st August:

I've now reinstalled the box completely from CentOS 6.4 x86_64 minimal install, adding the following packages and their dependencies:

yum install bind-utils dhcp dos2unix man man-pages man-pages-overrides nano nmap ntp rsync tcpdump unix2dos vim-enhanced wget

and

yum install rgmanager ccs

The following commands all worked:

ccs -h ha-01 --createcluster test-ha
ccs -h ha-01 --addnode ha-01
ccs -h ha-01 --addnode ha-02
ccs -h ha-01 --addresource ip address=10.1.1.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.1.1.4 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.0.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.8.3 monitor_link=1
ccs -h ha-01 --addservice routing-a autostart=1 recovery=restart
ccs -h ha-01 --addservice routing-b autostart=1 recovery=restart
ccs -h ha-01 --addsubservice routing-a ip ref=10.1.1.3
ccs -h ha-01 --addsubservice routing-a ip ref=10.110.0.3
ccs -h ha-01 --addsubservice routing-b ip ref=10.1.1.4
ccs -h ha-01 --addsubservice routing-b ip ref=10.110.8.3

and resulted in the following config:

<?xml version="1.0"?>
<cluster config_version="13" name="test-ha">
    <fence_daemon/>
    <clusternodes>
        <clusternode name="ha-01" nodeid="1"/>
        <clusternode name="ha-02" nodeid="2"/>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm>
        <failoverdomains/>
        <resources>
            <ip address="10.1.1.3" monitor_link="1"/>
            <ip address="10.1.1.4" monitor_link="1"/>
            <ip address="10.110.0.3" monitor_link="1"/>
            <ip address="10.110.8.3" monitor_link="1"/>
        </resources>
        <service autostart="1" name="routing-a" recovery="restart">
            <ip ref="10.1.1.3"/>
            <ip ref="10.110.0.3"/>
        </service>
        <service autostart="1" name="routing-b" recovery="restart">
            <ip ref="10.1.1.4"/>
            <ip ref="10.110.8.3"/>
        </service>
    </rm>
</cluster>

However, if I use ccs_config_validate or try to start the cman service, it fails with:

Relax-NG validity error : Extra element rm in interleave
tempfile:10: element rm: Relax-NG validity error : Element cluster failed to validate content
Configuration fails to validate

What's going on? This used to work!

kubanczyk
  • 13,502
  • 5
  • 40
  • 55
Iain Hallam
  • 447
  • 2
  • 6
  • 21
  • Is your cluster up and running? Is `cman` started on the nodes, what does `clustat` and `cman_tool status` say? (asking because you say that you've recreated the config on a previously running cluster. – Petter H Aug 08 '13 at 11:07
  • Same result whether the cluster is started or stopped. Doesn't ccs just modify a configuration ready for it to be pushed to the cluster? – Iain Hallam Aug 08 '13 at 11:26
  • Just ran `ccs_config_validate` on the old config, and got: `Relax-NG validity error : Extra element rm in interleave` / `tempfile:10: element rm: Relax-NG validity error : Element cluster failed to validate content` / `Configuration fails to validate` – Iain Hallam Aug 08 '13 at 11:36
  • Same when just adding `` to the newly generated config. – Iain Hallam Aug 08 '13 at 11:42
  • Does it work without the `` in `` ? –  Aug 22 '13 at 10:51
  • No; same failure to validate. – Iain Hallam Aug 22 '13 at 14:30
  • Just in case ... can you remove any standalone section ? Like , , , – Nikolaidis Fotis Aug 27 '13 at 16:46
  • what version of cman do you have? what is in your `/var/lib/cluster/cluster.rng` ? – Petter H Aug 28 '13 at 12:39
  • Hi, Petter. Was away for a week; sorry. cman is version 3.0.12.1, release 49.el6_4.2; my cluster.rng is at http://pastebin.com/br4pQ5nS, though it's probably never changed from the one installed by yum. – Iain Hallam Sep 04 '13 at 16:50

2 Answers2

0

I think you are missing the failover domains, if you wanna define a service on redhat cluster, first you need to define a failoverdomain, you can use a failoverdomain for many services or one per service.

If you need more information about the failover domain "man clurgmgrd"

A failover domain is an ordered subset of members to which a service may be bound. The following

is a list of semantics governing the options as to how the different configuration options affect the behavior of a failover domain:

c4f4t0r
  • 5,149
  • 3
  • 28
  • 41
  • More info will make this an answer. Can you add to your answer? – Dave M Aug 27 '13 at 12:22
  • Thanks, unfortunately failoverdomains is an optional element, and this configuration used to work without any. – Iain Hallam Aug 27 '13 at 17:15
  • 1
    at this point i think you have a problem with your ha software version, i valited your xml cluster config and i can tell it works. I use xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng cluster.xml for validate – c4f4t0r Aug 27 '13 at 18:46
  • I don't have a cluster.ng on my system - presumably because it's CentOS 6.4 and system-config-cluster isn't available. Using /var/lib/cluster/cluster.rng fails validation with the "extra element rm in interleave" error. – Iain Hallam Sep 04 '13 at 16:52
  • @c4f4t0r, would it be helpful to compare the contents of your /usr/share/system-config-cluster/misc/cluster.ng with /var/lib/cluster/cluster.rng on my system, or are they likely to be very different? – Iain Hallam Sep 10 '13 at 09:40
  • @Iain Hallam good idea, but in this moment i don't have access to my redhat cluster – c4f4t0r Sep 10 '13 at 22:06
0

It's just started working again, after more yum update dancing. I have compared the old and new /var/lib/cluster/cluster.rng and, surprise, surprise, there's a difference. The one on the systems that didn't work was missing any definitions for the <ip> element.

The current incarnation of the system was installed from the same minimal CD, and I have a step-by-step procedure of commands to cut and paste, which worked several times while I was developing it, then failed for nearly two months, now starts working again. I've built the box about half a dozen times, so I guess it's not the procedure.

A slip up on Red Hat's part, perhaps, but I'm not sure how to find out what changes were checked into this file in the last two months.

Iain Hallam
  • 447
  • 2
  • 6
  • 21