In my environment, we manage multiple services running on top of drbd devices (traditional, lxc containers, docker containers, databases, ...). We use the opensvc stack (https://www.opensvc.com) which is free and opensource, and provides automatic failover features. Below is a test service with drbd, and a redis application (disabled in the example)
First at the cluster level, we can see in the svcmon
output that :
- 2 nodes opensvc cluster (node-1-1 and node-1-2)
- service servdrbd is up (uppercase green O) on node-1-1, and standby (lowercase green o) on node-1-2
- node-1-1 is the preferred master node for this service (circumflex accent close to uppercase O)
At the service level svcmgr -s servdrbd print status
, we can see :
- on the primary node (on the left) : we can see that all ressources are up (or standby up; meaning they must remain up when service is running on the other node). And concerning drbd device, it is reported as Primary
- on the secondary node (on the right) : we can see that only standby ressources are up, and the drbd device is in Secondary state.
To simulate an issue, I disconnected the drbd device on the secondary node, and that produce the following warnings
It is important to see that the service availability status is still up, but the overall service status is degraded to warn, meaning "ok, production is still running fine, but something goes wrong, have a look"
As soon as you are aware that all opensvc commands can be used with the json output selector (nodemgr daemon status --format json
or svcmgr -s servdrbd print status --format json
), it is easy to plug it into a NRPE script, and just monitor the service states. And as you saw, any issue on primary or secondary is trapped.
The nodemgr daemon status
is better because it is the same output on all cluster nodes, and all opensvc services informations are displayed in a single command call.
If you are interested in service configuration file for this setup, I posted it on pastebin here