1

Our product lives in a Kubernetes cluster on our server. It is not in production yet, so there are multiple instances running in the cluster for different purposes, each in its own namespace. I need to run some load tests on one of the namespaces and I need to monitor CPU usage meanwhile. We have Prometheus and Grafana for monitoring.
One of the objectives of these tests is to learn what load drives CPU usage to its maximum.

So I'm looking for a way to query the CPU usage of a namespace as a percentage.

Here is what I put together based on examples:

sum (rate (container_cpu_usage_seconds_total{namespace="$Namespace"}[1m])) / sum(kube_pod_container_resource_limits{resource="cpu", unit="core", namespace="$Namespace"}) * 100

However, something must be wrong with this solution because occasionally values over 100% show up on the dashboard. Thinking the units must be different, I tried to look up the exact specification of these metrics but I didn't succeed.

(Sadly, I don't even know much about how CPU usage is calculated and what a 100% actually means.)

I searched for metrics that could be used for this problem through a few exporters: cAdvisor, Node, kube-state-metrics and more. Even in this seemingly exhaustive article, which was brought to my attention, it is stated that the metric I'm looking for is an important one but no way is provided to query it.

Any help would be appreciated, thank you.

zslim
  • 121
  • 1
  • 1
  • 5

2 Answers2

1

I found out why I couldn't use the metric I cited above. It's because usually there are only a few pods that even have a CPU limit setting. It is not needed in general and it would make the cluster clumsy.

So

sum(kube_pod_container_resource_limits{resource="cpu", unit="core", namespace="$Namespace"})

does sum all the existing limits on the pods of the namespace but that's not the theoretical 100% CPU usage of the namespace. This is why percentages over 100% appear sometimes.

However, I learned that theoretically the namespace could use up all the resources delegated to the nodes of the cluster. I also learned that our product would likely run on machines very similar to this test server in production. So to get the CPU usage as a percentage, it is valid to calculate namespace CPU usage / available CPU in cluster in my lucky case.

Here is how I do that:

sum (rate (container_cpu_usage_seconds_total{namespace="$Namespace"}[1m])) / sum(machine_cpu_cores) * 100

where $Namespace is the name of the namespace.

(The same applies to memory usage.)

So this is what I'm going to monitor while running load and stress tests.

zslim
  • 121
  • 1
  • 1
  • 5
0

You can check the CPU usage of a namespace by using arbitrary labels with Prometheus. In that article you have fully describe what you need to do. Formula will look similar to that:

namespace:container_cpu_usage_seconds_total:sum_rate =
   sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (namespace)

namespace:container_memory_usage_bytes:sum =
   sum(container_memory_usage_bytes{image!=""}) by (namespace)

Also another approach is to use Prometheus exporter which allows you to easily get the CPU usage by namespace, node or nodepool.

aga
  • 128
  • 3
  • Hey, thanks for your answer. Both queries you cited give the current CPU usage of the namespaces in cores or CPU time (would be nice to know which), but that's not what I need. I need CPU usage as the proportion of the maximum CPU usage. Thank you for the exporter recommendation, I think it has the thing that I need but unfortunately I have a fixed set of exporters and kube-eagle is not in it. :( – zslim Aug 13 '19 at 12:17