0

I manage a webservice and it is very important for my company to detect and notify if any of the services is down, and also if any of the operations it does is taking too long to respond. Until now there was a separated web application (including front and backend) just to request random operations to those endpoints every 15 minutes, but I found it convoluted as it requires maintaing a whole webapp just for this prupose, and I know many free web services out that should do the job.

I've set up AWS Healthchecks for replacing the polling webapp and works perfect for the uptime part, now my question comes with the response time part.

All those API healthcheck services seem to be prepared for not very complex requests so, should be responsability of the API to offer a "status" endpoint for the healthcheck services and include in that "OK" stuff like database latency or should be the "healthchecker" the one reponsible of performing complex requests? What approach is more correct?

Thanks!

namelivia
  • 103
  • 5

1 Answers1

3

You probably shouldn't monitor the database performance via the application's health check paths - some dangerous cases can occur. Say you use ASG's within AWS and use the LB healthchecks for determining if the ASG should rotate machines. If you start to have database contention (unrelated to your app), your ASG will start removing nodes. So not only will you have a poor performing database, but you will also have a depleted ASG.

Typically performance should be monitored out of band of healthiness. We use statsd heavily and pump all of our metrics, application and database, into it so we can graph and alert based on that.

Also keep in mind as you scale, that your health check velocity will scale as well - we have some services that receive thousands of health check requests a second, and if each of those is performing a synthetic expensive query, our data layer would go offline.

The logic also gets more complex as you add caching layers - what should the health check endpoint return if the database is healthy but your KV cache isn't?

Overall, while end to end monitoring is critical for an effective monitoring strategy, I would highly recommend Out of Band monitoring for the existing query metrics that are flowing to the database - those are representative of real user performance and will provide you a quantifiable metric for how your application health is actually performing.

Brennen Smith
  • 1,638
  • 7
  • 11