12

top shows an average CPU usage during peak times of about 20% while CloudWatch monitoring shows an average CPU usage of 40%. What causes this discrepancy?

1 Answers1

17

A very good observation and we have run into this as well. Here's what I found:

Be careful measuring CPU usage from within an EC2 instance. It’s possible to see CPU usage well below 100%—and yet be completely maxed out. Trust me: been there, done that. (CloudWatch CPUUtilization, by the way, is measured from outside the instance and is always correct.)

There’s a very good description of the whole thing here: https://axibase.com/news/ec2-monitoring-the-case-of-stolen-cpu/

In the example above, the m1.small EC2 instance was allo­cated 0.4 proces­sor units and so 40% CPU busy means the per­cent­age usage of the under­lying core. How­ever because 40% is the max­i­mum CPU share that can be allo­cated to this VM, the effec­tive CPU usage is 40%/40% = 100%. Which is the num­ber dis­played by CloudWatch.

If you’re won­der­ing where does 40% comes from, the math is pretty simple. The m1.small linux sys­tem is enti­tled to 1 EC2 com­pute unit which pro­vides the equiv­a­lent CPU capac­ity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon proces­sor. Since the VM runs on a machine with 2.6 GHz clock speed, it’s entitled to 38.4%—46.2% proces­sor share on this par­tic­u­lar XEN node. You can run cat /proc/cpuinfo com­mand to find out CPU archi­tec­ture behind your EC2 instances.

Pay special attention to the hint about how to deal with tools that don’t know about the special math:

Another option that can used to retro­fit the exist­ing agent–or SNMP–based mon­i­tor­ing tools, that don’t inte­grate with Cloud­Watch, is to use the CPU idle metric. All you need to do is to re-write rules to mea­sure CPU idle instead of CPU busy. E.g. if you have a >75% thresh­old defined for CPU busy, create a <25% rule for CPU idle. If CPU idle is 0, then your server is CPU bound.

Very simple. Very nice.

When you run top within the EC2 instance, it is measuring the CPU usage of the physical core machine that is running your instance and others. This usage is incorrect if you want to be measuring cpu usage of your instance alone (the EC2 compute unit assigned to your instance).

Which is why cloudwatch metrics is real since it is measured external to the instance for the EC2 compute unit(s) assigned to your instance alone.

See here -- https://forums.aws.amazon.com/thread.jspa?threadID=99993

Johano Fierra
  • 175
  • 1
  • 5
Chida
  • 2,471
  • 1
  • 16
  • 29
  • In other words, they're both right but measuring different things. – bahamat Aug 22 '12 at 18:39
  • 1
    You could put it that way. However, the OP is concerned that what he thinks he sees is not what amazon says he sees. So, in his case, top data is incorrect for him. But, if you would measure the cpu usage of the underlying core to debug performance issues, it's very useful to run top. If you are concerned only about the usage of your instance, cloudwatch is the way to go. So, yes, they both measure different things. – Chida Aug 22 '12 at 18:42
  • 1
    I guess I should have followed my statement with "the former is what you *think* you want, the latter is what you *really* want", but I thought that had already been covered. – bahamat Aug 22 '12 at 19:44
  • +1 for what you just said :) – Chida Aug 22 '12 at 19:47
  • The Metamul link is dead, good task for someone is to find an archive and update the link. – Elijah Lynn Jun 27 '17 at 06:30
  • 1
    I retrieved the content of the dead link from wayback machine and added it to the post directly. – Johano Fierra Sep 13 '18 at 09:27