0

In the second SLO alerting example of the Site Reliability Engineering workbook, the following statement is made:

To keep the rate of alerts manageable, you decide to be notified only if an event consumes 5% of the 30-day error budget—a 36-hour window

It seems they are implying that a 36-hour window is derived from 5% of the 30-day error budget. I see that 36 hours is 5% of 30 days, but why are these two things linked? For example, an event could potentially consume any amount of an error budget over any window size, it completely depends on what the error budget is.

In addition, it then states the following formula for detection time:

(1−SLO/error ratio)×alerting window size

Why is the detection time proportional to the alerting window size? If there is a sudden spike in errors that triggers an alert, as long as the alerting window covers the period over which the errors happened then the detection will be the same for any alerting window size.

I feel the thing I am missing is the same for both of these statements which is why I am asking about them together.

dippynark
  • 233
  • 2
  • 12

0 Answers0