This question is more on etiquette rather than an actual server issue.
2 servers owned by a client of mine frequently stop responding (Fast, then really sluggish as in it takes 1 minute to execute ls
, then stop altogether). I propose that we bring it down for maintenance but he wants me to set up a script to reboot it every time it locks up. What's worse is that the servers monitor each other, and the reboot commands have a mandatory 60 second delay (shutdown, wait 60 seconds, then startup). The problem with this is that there is a very good chance both of them lock up within 60 seconds of each other. Both shutdown commands can be sent to each other and both servers shut down at the same time. No one will be able to send the startup command. Just a few moments ago, we had a 2-hour downtime because of this exact reason.
Now, my client wants me to "set a flag" so the shutdown commands don't get repeatedly sent. But, that doesn't eliminate the hang-at-the-same-time problem and the servers will still go down together eventually.
Rebooting the servers isn't a good solution at all IMHO. I've suggested we find the root cause and fix it. I've also suggested he use watchdog
but he denied it put it on hold. I even gave up and said he should fire up a small dedicated server for monitoring but still he wants me to do it.
My dilemma right now is if I should do what he asks (reboot the server every time it hangs) or simply log into his servers without permission and apply the needed fixes to get it over with. We can't move forward at all because of this. What do you guys suggest?