9

Server: Poweredge r620
OS: RHEL 6.4
Kernel: 2.6.32-358.18.1.el6.x86_64

I'm experiencing application alarms in my production environment. Critical CPU hungry processes are being starved of resources and causing a processing backlog. The problem is happening on all the 12th Generation Dell servers (r620s) in a recently deployed cluster. As near as I can tell, instances of this happening are matching up to peak CPU utilization, accompanied by massive amounts of "power limit notification" spam in dmesg. An excerpt of one of these events:

Nov  7 10:15:15 someserver [.crit] CPU12: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU0: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU6: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU14: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU18: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU2: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU4: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU16: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU0: Package power limit notification (total events = 11)
Nov  7 10:15:15 someserver [.crit] CPU6: Package power limit notification (total events = 13)
Nov  7 10:15:15 someserver [.crit] CPU14: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU18: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU20: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU8: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU2: Package power limit notification (total events = 12)
Nov  7 10:15:15 someserver [.crit] CPU10: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU22: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU4: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU16: Package power limit notification (total events = 13)
Nov  7 10:15:15 someserver [.crit] CPU20: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU8: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU10: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU22: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU15: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU3: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU1: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU5: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU17: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU13: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU15: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU3: Package power limit notification (total events = 374)
Nov  7 10:15:15 someserver [.crit] CPU1: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU5: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU7: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU19: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU17: Package power limit notification (total events = 377)
Nov  7 10:15:15 someserver [.crit] CPU9: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU21: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU23: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU11: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU13: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU7: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU19: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU9: Package power limit notification (total events = 374)
Nov  7 10:15:15 someserver [.crit] CPU21: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU23: Package power limit notification (total events = 374)

A little Google Fu reveals that this is typically associated with the CPU running hot, or voltage regulation kicking in. I don't think that's what is happening though. Temperature sensors for all servers in the cluster are running fine, Power Cap Policy is disabled in the iDRAC, and my System Profile is set to "Performance" on all of these servers:

# omreport chassis biossetup | grep -A10 'System Profile'
System Profile Settings
------------------------------------------
System Profile                                    : Performance
CPU Power Management                              : Maximum Performance
Memory Frequency                                  : Maximum Performance
Turbo Boost                                       : Enabled
C1E                                               : Disabled
C States                                          : Disabled
Monitor/Mwait                                     : Enabled
Memory Patrol Scrub                               : Standard
Memory Refresh Rate                               : 1x
Memory Operating Voltage                          : Auto
Collaborative CPU Performance Control             : Disabled
  • A Dell mailing list post describes the symptoms almost perfectly. Dell suggested that the author try using the Performance profile, but that didn't help. He ended up applying some settings in Dell's guide for configuring a server for low latency environments and one of those settings (or a combination thereof) seems to have fixed the problem.
  • Kernel.org bug #36182 notes that power-limit interrupt debugging was enabled by default, which is causing performance degradation in scenarios where CPU voltage regulation is kicking in.
  • A RHN KB article (RHN login required) mentions a problem impacting PE r620 and r720 servers not running the Performance profile, and recommends an update to a kernel released two weeks ago. ...Except we are running the Performance profile...

Everything I can find online is running me in circles here. What's the heck is going on?

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • 1
    FYI, this issue [has been corrected](http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6bb2ff846f24fa6efee756e5e2a2b8433d65671e) in mainline kernel 3.11. It is due to the kernel interrupt handler triggering for this "normal" non-critical event. The commit linked above disables this handler. – Totor Apr 17 '15 at 06:37

1 Answers1

8

It's not the voltage regulation that causes the performance problem, but the debugging kernel interrupts that are being triggered by it.

Despite some misinformation on Redhat's part, all of the linked pages are referring to the same phenomenon. The voltage regulation happens with or without the Performance profile, likely due to the Turbo Boost feature being enabled. Regardless of reason, these voltage fluctuations are interacting poorly with the power-limit kernel interrupts that are enabled by default in kernel 2.6.32-358.18.1.el6.x86_64.

Confirmed Workarounds:

  • Upgrading to the most recently released Redhat kernel (2.6.32-358.23.2.el6) disables this debugging and eliminates the performance problem.
  • Adding the following kernel parameters to grub.conf will disable PLNs: clearcpuid=229

Flaky Workarounds:

  • Setting a System Profile of "Performance". This by itself was not enough to disable PLNs on our servers. Your mileage may vary.

Bad Workarounds:

  • Blacklisting ACPI related modules. I've seen this in a few forum threads. Ill-advised, so don't.
Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Were you not running updates on newly-deployed systems? – ewwhite Nov 08 '13 at 07:38
  • @ewwhite These servers were deployed just before those kernel updates went live. The new RPM was made available on [October 16](http://rhn.redhat.com/errata/RHSA-2013-1436.html). – Andrew B Nov 08 '13 at 07:39
  • Grrr to Red Hat. Nice find. – ewwhite Nov 08 '13 at 08:01
  • Even after the update this issue resurfaced for me after a few weekd (on kernel 2.6.32-431.17.1.el6.x86_64). We had to disable PLNs using clearcpuid to get rid of it this time. This issue has caused me so many headaches already! And we only have one 12G Dell server (and it will remain the only one because of this). – Martijn May 12 '14 at 21:34
  • 1
    @Martijn We're currently up to `2.6.32-431.11.2.el6.x86_64` and not experiencing the problem. Many clusters, high loads, etc. It's possible that a regression may have creeped in when Redhat released that update five days ago. I will let you know what I find and update the answer if I discover that to be the case. – Andrew B May 12 '14 at 21:40
  • Forgot to update this: we did not see this bug manifest again. – Andrew B Oct 22 '14 at 09:05