8

I'm trying to write a general rule to fire alert when a discovered target goes missing. In particular kubernetes pods annotated for scraping and auto-discovered using kubernetes_sd_configs.

Expressions of the form: absent(up{job="kubernetes-pods"}==1) do not return any additional labels which were available as part of the up time series. If a pod is deleted (say by mistake), it disappears as a target from prometheus. An alert based on absent() is fired, but I have no information about what pod has gone missing.

I think the same happens for auto-discovered kubernetes services. If it's deleted by mistake, it just disappears as a monitored target. I'm not sure if the behavior is the same for target_groups (https://prometheus.io/blog/2015/06/01/advanced-service-discovery/) with ip range - that is if the physical node is turned off the metrics just stop and up == 0 is not available.

What is the correct way to detect when an auto-discovered target is gone in a general way? Or do I need to hard code rules for each service/node/pod explicitly, even though it was auto discovered?

Budric
  • 203
  • 1
  • 3
  • 6

2 Answers2

9

Or do I need to hard code rules for each service/node/pod explicitly, even though it was auto discovered?

Yes, you need a rule for every individual thing to you to alert on being missing as Prometheus doesn't know about their labels from anywhere - service discovery is not returning it.

The usual alert is absent(up{job="kubernetes-pods"})

brian-brazil
  • 3,904
  • 1
  • 20
  • 15
  • Thank you for your response. I'm not sure I understand the statement "Prometheus doesn't know about their labels". Querying and displaying a graph of up{job="kubernetes-pods"} does show legend entries with labels, for example `up{app="xx",instance="xx:xx",job="kubernetes-pods",kubernetes_namespace="xx",kubernetes_pod_name="xx"}` then line then goes from 1 to nothing. Is there no way to express a query: find those time series like shown in the graph with their labels that disappear? If not possible, will it be in the future? I'm not sure the purpose of auto discovery when alerts are manual. – Budric Nov 01 '18 at 19:18
  • This is the only alert that's manual, as you're alerting on the auto-discoveey not returning anything. You need to tell Prometheus the nothing it is looking for. – brian-brazil Nov 02 '18 at 06:28
  • 1
    It seems I need an alert: `absent(up{job="kubernetes-pods", app="foo"})` and another one `absent(up{job="kubernetes-pods", app="bar"})` and so on and so forth for each pod/service/node. Because the single alert `absent(up{job="kubernetes-pods"})` has none of the labels such as "app" or "instance" or "namespace" to report in the alert text. – Budric Nov 02 '18 at 14:21
6

We've been solving something similar. Our setup: when some service starts somewhere, some metrics appear with a non-zero value. Then, if any of those metrics go missing, we want an alert.

In our case, the proper expression to achieve that is

count (our_metric offset 1h > 0) by (some_name) unless count(our_metric) by (some_name)

This returns a vector which contains metrics which have been present an hour ago, but aren't present now. The values of the metrics are the count(...) from the LHS (which can even be useful).

You can use any LHS/RHS. Read more about the unless operator.

David
  • 387
  • 1
  • 6
  • 16
  • Thank you very much for this! I've been looking for quite some time to find a query that helps me track down Docker containers that disappeared. If anybody else needs the same: `count by (name) (container_cpu_user_seconds_total{image!=""} offset 1h > 0) unless count by (name) (container_cpu_user_seconds_total)` – Jan Grewe Aug 10 '19 at 16:14
  • Unfortunately, it seems the expression also makes the alert reset itself after some time. Thus, if the alert was not catched when it triggered, no one realizes the alert has triggered. Is there a solution for the alert not to reset itself ? – Alex F Oct 18 '19 at 08:08