I have recently started monitoring some server instances at digitalocean, OVH and an independent provider at biznesshosting.
The application servers' stacks are based on: CentOS NGINX with Passenger Rails/Ruby Three background job servers with Fedora, sidekiq, Rails stack
one server act as CMS, two act as the core application, and a fourth one act as API. There is also separate server for Search, secured site and database (based on postgresql and redis)
Now, come to the questions/suggestions I am asking for:
It happened several times that the some dynamic websites hosted at CMS and Core application servers went down suddenly. Those downtime usually lasted for maximum 5 minutes, and I get notification via pingdom, cloudstats.me etc.
Since sites are mostly database driven, it was once I found the reason that db server was rebooted by server hosting company. But for other cases, I had hard time finding the reason why sites went down. I could easily get in server through SSH, can ping, even cloudstats.me didn't report high CPU, memory or disk I/O usage.
Sometime it happened, I even couldn't SSH into the server, even though can do the same from digitalocean's web based console. Those servers have firewall rules to block SSH access for anyone except allowed IP list. So need a solution for this one too.
Later I came across this sites:
What To Do When Your Website Goes Down
I am, however looking for a more robust, guided information to find out the bottleneck on my servers, how to fix, why sites go down even servers are totally alright.
Also looking for help on how to find if some bad code execution causing the downtime or not. I am not the developer at my team, I work as a part-time sysadmin, so whenever thing goes wrong - my development team rush at my phone and messagebox, they want me to fix in shortest possible time and also ask me to report why the sites go down.
Hope to hear from some experts here.