(you should be on mailing of any software (have you tried there?) you use, really, and looking for and offering help there first!)
Below is excerpted from clusterlab mailing list, answers are from Redhat guys (although it my vary from cluster version to version)........
When I call “pcs resource cleanup Res1” this will result in an
interruption of service at the side of Res2 (i.e. stop Res2 …)
My – unconfirmed – assumption was, that pacemaker would first detect
the current state of the resource(s) by calling monitor and then
decide if there are any actions to be performed.
But from reading the logfiles I would interpret that Res1 is
temporarily removed from the cib and re-inserted again. And this
results in stopping Res2 until Res1 has confirmed state “started”.
Correct, removing the resource's operation history is how pacemaker
triggers a re-probe of the current status.
As I interpret the documentation it would be possible to avoid this
behaviour by configuring the order constraint with kind=Optional.
But I am not sure if this would result in any other undeserved side
effects. (e.g on reverse order when stopping)
kind=Optional constraints only apply when both actions need to be done
in the same transition. I.e. if a single cluster check finds that both
Res1 and Res2 need to be started, Res1 will be started before Res2. But
it is entirely possible that Res2 can be started in an earlier
transition, with Res1 still stopped, and a later transition starts
Res1. Similarly when stopping, Res2 will be stopped first, if both need
to be stopped.
In your original scenario, if your master/slave resource will only bind
to the IP after it is up, kind=Optional won't be reliable. But if the
master/slave resource binds to the wildcard IP, then the order really
doesn't matter -- you could keep the colocation constraint and drop the
ordering.
Another work-around seems to be setting the dependent resource to
unmanaged, perform the cleanup and then set it back to managed.
This is what i would recommend, if you have to keep the mandatory
ordering.
And I wonder if “pcs resource failcount reset” would do the trick
WITHOUT any actions being performed if no change in state is
necessary.
But I think to remember that we already tried this now and then and
sometimes such a failed resource was not started after the failcount
reset. (But I am not sure and had not yet time to try to reproduce.)
No, in newer pacemaker versions, crm_failcount --delete is equivalent
to a crm_resource --cleanup. (pcs calls these to actually perform the
work)
Is there any deeper insight which might help with a sound
understanding of this issue?
It's a side effect of the current CIB implementation. Pacemaker's
policy engine determines the current state of a resource by checking
its operation history in the CIB. Cleanups remove the operation
history, thus making the current state unknown, forcing a re-probe. As
a side effect, any dependencies no longer have their constraints
satisfied until the re-probe completes.
It would be theoretically possible to implement a "cleanup old
failures" option that would clear a resource's fail count and remove
only its operation history entries for failed operations, as long as
doing so does not change the current state determination. But that
would be quite complicated, and setting the resource unmanaged is an
easy workaround.
>