0

We had a circuit trip last night (Sunday evening 22:00) which killed all external COMMs ... so the alerting on our servers, inside the building, could not communicate out. Is there a solution to this? perhaps a SAAS which monitors/PINGs our servers and then alerts if there is a COMMs failure (in addition to alerting for actual faults logged by our server monitoring)

(We aren't a big company, so unlikely to want to spend money on a means of communicating when both primary and secondary internet connections go down at the same time, like last night)

This event was unusual for us, we don't normally get Support out of bed (e.g. for single point failure) on a Sunday night ... but total comms failure is a bit different as we have people wanting to connect at 05:00 (local time) on a Monday morning ... and they couldn't, nor reach Support, until IT staff arrived at office at 08:00

We have servers at 4 sites so an option would be to use each site to alert if it cannot communicate with one of the others. I'd prefer something a little more sophisticated so that we can create a Critical Alert if all 3 sites fail to communicate with the 4th site (and in fact the key critical failure is "no subsidiary sites can communicate with Primary HQ site")

We use Servers Alive for some of our monitoring, so one option would be to use Servers Alive to create a webpage at each site so that Support could view them to see status, and timestamp of failure, as-seen by each site. That would also give the ability to alert if a PING from Site-A to Site-B failed, but we are at a rural location and get quite a lot of intermittent single-site-A-to-B PING failures ...

My ideal would be a remote monitoring service that could be config'd to escalate to critical only when certain combinations of test fail - e.g. all remote sites fail to PING the Primary HQ site.

Kristen
  • 187
  • 8

2 Answers2

1

I see you have a few valid ideas already, but here's another one:

A combination of something like https://datadoghq.com and https://pagerduty.com could probably solve this problem for a few dollars per month.

Mikael H
  • 4,868
  • 2
  • 8
  • 15
1

Your challenge here is that your monitoring solution (ServersAlive) is dependent upon the infrastructure that it's monitoring. You can approach this a number of ways, one of which you've already suggested.

Set up a ServersAlive check at each site to check a component at each of the other sites (website, ping, etc). Then set up an external monitor (Uptime Robot, etc.) to monitor a component at each site (website, ping, etc.). Then, based on the alerts you get, you should be able to determine if the issue is internal, or the internet connection, etc.

Another option would be to configure ServersAlive at each site to monitor all components at the other sites. So SiteA monitors SiteB, SiteB monitors SiteC, etc. That way your monitoring at each site isn't dependent upon the infrastructure that's being monitored.

joeqwerty
  • 108,377
  • 6
  • 80
  • 171
  • Thanks. My thinking was nearly at that point ... so now I think I will have Servers Alive create a web page at each site and copy it to a shared folder at each other site (and/or to Cloud). Also create some simple PING tests between sites to alert if primary COMMs goes down, and then armed with that Alert we can look at the Web Status Pages at any site that is reachable, and from that determinate What/When – Kristen Dec 23 '19 at 16:15