1

In addition to traditional logging from applications going into e.g. Elasticsearch, an organisation may have an alerting system "Sentry" that receives log messages/exception events sent by applications over HTTP, and notifies developers of potential problems.

Suppose that Sentry now contains not only "actionable" events (e.g. error connecting to the database. Devops should investigate), but has been polluted with a lot of "non-actionable" events (e.g. user input could not be processed - expecting the user to try again, nothing for devops to do).

What are some options for going from a system full of mixed good and bad event data, to a clean system with only good data so that the alerts become meaningful again and don't get ignored?

Examples: 1) Gradually work through each event, starting with the low hanging fruit/most common events, deciding whether or not it's actionable. 2) Create a new system and gradually transfer actionable events to it.

2 Answers2

1

Every alert must require intelligent action. No action required alerts guarantee alert fatigue, and eventually missing real problems. Real problems result in status reports about degraded services, or open issues with software developers.

Creating sane altering from a noisy system is toil. Most likely the backlog will not be worked fast enough.

Consider declaring alert bankruptcy and removing all alerts. Add back the most basic essentials, like the error ratio on your API servers and median user response time. See for inspiration the four golden signals from the Google SRE book.

Going forward, do a root cause analysis on unplanned events and near misses. Where you have data that predicts the problem, add an alert. Schedule the alert for removal when the root cause is resolved and the alert has not fired in a long time.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
0

If your event data has classification levels you can work your way from high severity down to low. Generally the highest severity should be far less output (e.g. Fatal), and hopefully more important.

You can then start to work your way down to lower severity, and stop when you hit diminished returns.

Another option if the events group in high volume is to alert on time series metrics derived from the logs.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • The problem I have with the work the "red" high priority alerts first is that all alerts should be actionable, even the low priority ones. A storage system filling up in 5 weeks is actionable, but does not need to wake someone up. A data input error that tells the user what they did wrong is hardly of interest to software developers, let alone worth an event for IT ops to track down. Log everything, sure, but *alerts* are for when the system is at risk of not delivering its services. – John Mahowald Mar 15 '19 at 12:41
  • @JohnMahowald I agree. I'm not saying to make them all alerts, I'm saying you can sift through the data in logging severity order to clean up logs and events and then start to make alerts on that data (in aggregation or specific filtered events) if it makes sense. – Kyle Brandt Mar 15 '19 at 16:45