5

Question relates to Linux cgroups cpu accounting.

I noticed this at the container level, but it persists up to top level. For instance:

# cat /sys/fs/cgroup/cpu/cpuacct.stat /sys/fs/cgroup/cpu/cpuacct.usage
user 34618
system 18038
743932863030

The units for the former are 100ths of a second and for the latter nanoseconds, i.e. 346.18, 180.38 and 743.932863030

My question is: why do the first two not add up to the latter?

You might think "ah, they start from a different origin", so here are the same metrics a few minutes later:

# cat /sys/fs/cgroup/cpu/cpuacct.stat /sys/fs/cgroup/cpu/cpuacct.usage
user 40028
system 22098
818501029494

The deltas are 54.1, 40.6 and 74.57.

Bryan
  • 334
  • 2
  • 8
  • Maybe this disclaimer at the end of the respective Linux [kernel docs section](https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt) is helpful: `It is possible to see slightly outdated values for user and system times due to the batch processing nature of percpu_counter`. – Michael Hausenblas Jan 05 '18 at 11:25
  • 1
    Yeah, I saw that, it just seems the differences I'm seeing are much larger than I'd expect from "slightly outdated values" – Bryan Jan 05 '18 at 11:32

2 Answers2

3

I'm not a kernel developer, but, digging through the kernel source code, cpuacct.usage (updated via cgroup_account_cputime) and cpuacct.stat (updated via cgroup_account_cputime_field seem to be calculated by different kernel components.

From what I understand the output of cpu.stat seems to heavily depend on kernel configuration, in particular CONFIG_VIRT_CPU_ACCOUNTING_GEN, CONFIG_VIRT_CPU_ACCOUNTING_NATIVE and CONFIG_VIRT_CPU_ACCOUNTING. From their descriptions they seem to be more precise. A relevant file is kernel/sched/cputime.c, where timing updates seems to be caused by some kernel events(irqs etc.)

The output of cpuacct.usage seems to be calculated by the scheduler when switching between tasks. For example update_curr, which calls cgroup_account_cputime is called from enqueue_entity and dequeue_entity which seem to schedule tasks. This does not seem as affected by configuration.

user690623
  • 31
  • 3
1

cpuacct.stat contains CPU usage accumulated by process(es) in the cgroup expressed in ticks of 1/100th of a second, also called "user jiffies" (USER_HZ). It may not be as precise as the CPU times accounted in nanoseconds.

You can obtain the USER_HZ from shell (typically 100)

$ getconf CLK_TCK
100

This should be mapped to a number of scheduler ticks per second, unless you are on a real-time or tickless kernel.

cpuacct.usage gives the overall CPU time in nanoseconds, measured as precisely as the kernel can report usage times.

cpuacct.usage_all or cpuacct.usage_percpu will report usage per CPU core (thread) again measured in nanoseconds.

Note that the cpuacct subsystem was originally written as a demonstration of cgroups capabilities. It wasn't meant for precise reporting.

Tombart
  • 2,013
  • 3
  • 27
  • 47