1

Are there any recommended values for "mon osd down out"? The cluster has 15 nodes and 300 osds. It's running ceph Jewl.

The default value in Jewel is 300 seconds. I was wondering, are there any benefits or downsides to changing the default value?

1 Answers1

1

The default has been increased to 600 seconds, I just don't know since which release. Changing the value only makes sense if you experienced problems with the current configuration. There are many things that can happen to a server besides a planned maintenance where you (should) set the noout flag to prevent rebalancing when a server is down (or single OSDs). For example, an OSD can fail for whatever reason and is marked down. Then with your current config it has 5 minutes to get up again or it will also be marked out which will trigger a remapping of the PGs from that OSD, it will be drained. Now bringing a single service back up within 5 minutes doesn't sound that bad, but if you need to bring back a whole server with multiple OSDs it can take much longer. But with a low mon_osd_down_out_interval value this can lead to "flapping OSDs" causing way more trouble on the cluster. I assume this was the main reason to increase the default to 10 minutes.

On the other hand, if your cluster is large enough and the crush rules and replicas are configured reasonably you should be able to lose a few OSDs without the clients noticing it. It's basically up to the cluster admin to get a feeling for the possible scenarios because it also depends on the actual hardware. I have a customer whose servers need 15 or more minutes to reboot. Now imagine a kernel crash or something that triggers a reboot, the server wouldn't come back within the interval leading to massive remapping. These cases can be controlled better with an adjusted mon_osd_down_out_interval.

eblock
  • 215
  • 1
  • 4