Looking for some useful resources to find out site and server downtime reasons

Question

I have recently started monitoring some server instances at digitalocean, OVH and an independent provider at biznesshosting.

The application servers' stacks are based on: CentOS NGINX with Passenger Rails/Ruby Three background job servers with Fedora, sidekiq, Rails stack

one server act as CMS, two act as the core application, and a fourth one act as API. There is also separate server for Search, secured site and database (based on postgresql and redis)

Now, come to the questions/suggestions I am asking for:

It happened several times that the some dynamic websites hosted at CMS and Core application servers went down suddenly. Those downtime usually lasted for maximum 5 minutes, and I get notification via pingdom, cloudstats.me etc.

Since sites are mostly database driven, it was once I found the reason that db server was rebooted by server hosting company. But for other cases, I had hard time finding the reason why sites went down. I could easily get in server through SSH, can ping, even cloudstats.me didn't report high CPU, memory or disk I/O usage.

Sometime it happened, I even couldn't SSH into the server, even though can do the same from digitalocean's web based console. Those servers have firewall rules to block SSH access for anyone except allowed IP list. So need a solution for this one too.

Later I came across this sites:

What To Do When Your Website Goes Down

how to view log and use it

I am, however looking for a more robust, guided information to find out the bottleneck on my servers, how to fix, why sites go down even servers are totally alright.

Also looking for help on how to find if some bad code execution causing the downtime or not. I am not the developer at my team, I work as a part-time sysadmin, so whenever thing goes wrong - my development team rush at my phone and messagebox, they want me to fix in shortest possible time and also ask me to report why the sites go down.

Hope to hear from some experts here.

Scientific method is your friend. – user9517 Nov 19 '15 at 07:32 — user9517, Nov 19 '15 at 07:32

score 2 · Answer 1 · answered Nov 19 '15 at 05:28

2

A good monitoring setup will need to distinguish between connectivity problems and server issues, even though from an end-user perspective the result is the same: a site that is down.

Because there is a difference in what you can do as a system administrator between your site/services being unavailable for part of or even the whole internet when there is a network/routing/connectivity problem/hiccup or when the problem is with your actual servers/services.

A fairly typical first approach is to monitor the default gateway your servers use in addition to the servers themselves.

answered Nov 19 '15 at 05:28

HBruijn

72,524
21
127
192

Thanks for the feedback, but I have already checked them, they are good. Even used netstat iptraf tools. I usually monitor using cloudstats.me which reports that servers are fine, but the URL monitoring at cloudstats report sites are down. And yes as other guy posted, scientific method may help. By the I am not sure who is giving negative feedback on my every post at different stackexchange sub-directory! I cannot make any new post, be it at serverfault or at Unix/Linux. – Manjurul Islam Nov 19 '15 at 13:57

Looking for some useful resources to find out site and server downtime reasons

1 Answers1