0

I am trying to configure azure monitor alerts for a Kubernetes cluster so the administrator is alerted when something is not running. The three conditions, in decreasing order of priority, I need to monitor are:

  • There are Services with no endpoints for more than x minutes.
  • There are Pods that are not Ready for more than y minutes.
  • There are more than z Pods in Evicted state.

I tried setting this up on the log workspace by selecting

KubeEvents
| where KubeEventType == 'Warning'

but the problems are that

  1. I don't see any events related to Services losing last endpoint. But that should be the most important thing to watch for, because that means that something isn't running.

  2. The events arrive aggregated. When a pod is not ready, there is a message that looks something like

    Reason: Unhealthy
    Message: Liveness probe failed: …
    KubeEventType: Warning
    FirstSeen: 2022-06-06T17:07:52Z
    LastSeen: 2022-07-11T22:31:32Z
    Count: 12

    Which means the liveness probe failed 12 times … in the last month. So let's find the previous entry. Well, it looks like this

    Reason: Unhealthy
    Message: Liveness probe failed: …
    KubeEventType: Warning
    FirstSeen: 2022-06-06T17:07:52Z
    LastSeen: 2022-06-28T11:32:15Z
    Count: 10

    Ok, so by search through the history we can find the previous value. It was long ago. So the 2 failures were “recently”. Unfortunately I still don't know over what time that is.

Is there a better way to monitor that there are no failing components in a Kubernetes cluster in Azure? Either via logs (we have some Kubernetes clusters outside Azure that log into Azure logws using manually installed omsagent), or via anything else available for at least AKS?

Jan Hudec
  • 265
  • 3
  • 11

0 Answers0