6

We're a small shop, running a Dell T420 (dual CPU, only one present, 6 cores) w/32GB RAM as our main server. We have only 5 VMs, one of which is our WSE 2012 DC.

From time to time, and at a rate for which we've not been able to establish a reliable pattern, all of our VMs concurrently spike to 100% CPU. The host remains quiet at 4-5%. A host warm boot doesn't provide relief, but a cold boot at least puts things back in the box until the problem reoccurs.

Sometimes we can get a week or more of calm seas out of it; sometimes only a day. An unreliable pattern seems to be that it kicks off sometime during an extended idle period, i.e. overnight. An examination of the server's temperature logs first led us to suspect overheating, but further investigation into recent incidents have spoiled that lead.

We also found descriptions of similar problems on the Dell forums, with claims of resolution by installing the latest round of Dell updates. We recently engaged in a project to do just that (as an aside, it was quite an adventure getting ~700GB of VHDs safely off of and then back onto that machine), but to our utter dismay it didn't help.

We're absolutely befuddled. So is Microsoft support (or at least first tier support is, even though they try not to act like it). I'm including below our SystemInfo output.

Does anyone know where to start looking?

Thanks

===================================

Host Name:                 SERVER1
OS Name:                   Microsoft Hyper-V Server 2012 R2
OS Version:                6.3.9600 N/A Build 9600
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Server
OS Build Type:             Multiprocessor Free
Registered Owner:          Windows User
Registered Organization:   
Product ID:                06401-029-0000043-76293
Original Install Date:     4/3/2014, 4:07:15 PM
System Boot Time:          5/4/2014, 1:56:47 PM
System Manufacturer:       Dell Inc.
System Model:              PowerEdge T420
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 45 Stepping 7 GenuineIntel ~2200 Mhz
                           [Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20 GHz] (manually added)
BIOS Version:              Dell Inc. 2.1.2, 1/20/2014
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC-09:00) Alaska
Total Physical Memory:     32,723 MB
Available Physical Memory: 12,716 MB
Virtual Memory: Max Size:  37,587 MB
Virtual Memory: Available: 17,129 MB
Virtual Memory: In Use:    20,458 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    OIT
Logon Server:              \\SERVER1
Hotfix(s):                 31 Hotfix(s) Installed.
                           [01]: KB2843630
                           [02]: KB2862152
                           [03]: KB2868626
                           [04]: KB2876331
                           [05]: KB2883200
                           [06]: KB2884846
                           [07]: KB2887595
                           [08]: KB2892074
                           [09]: KB2893294
                           [10]: KB2894179
                           [11]: KB2898514
                           [12]: KB2898871
                           [13]: KB2901101
                           [14]: KB2901128
                           [15]: KB2903939
                           [16]: KB2904266
                           [17]: KB2908174
                           [18]: KB2909210
                           [19]: KB2911106
                           [20]: KB2913760
                           [21]: KB2916036
                           [22]: KB2917929
                           [23]: KB2919394
                           [24]: KB2919442
                           [25]: KB2922229
                           [26]: KB2923300
                           [27]: KB2923768
                           [28]: KB2928193
                           [29]: KB2928680
                           [30]: KB2930275
                           [31]: KB2939087
Network Card(s):           3 NIC(s) Installed.
                           [01]: Broadcom NetXtreme Gigabit Ethernet
                                 Connection Name: NIC1
                                 DHCP Enabled:    No
                                 IP address(es)
                           [02]: Broadcom NetXtreme Gigabit Ethernet
                                 Connection Name: NIC2
                                 DHCP Enabled:    Yes
                                 DHCP Server:     192.168.1.12
                                 IP address(es)
                                 [01]: 192.168.1.135
                                 [02]: fe80::915b:8de0:712e:29f1
                           [03]: Hyper-V Virtual Ethernet Adapter
                                 Connection Name: vEthernet (External NIC 1_Internal)
                                 DHCP Enabled:    No
                                 IP address(es)
                                 [01]: 192.168.1.11
                                 [02]: fe80::2d35:f582:4958:9eb2
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

== EDIT ======================

I've found the solution to this issue; I waited for over a year to make sure we didn't encounter any more instances of the problem.

Moderators: I'd like to request a reopening of the question, so that I can post the answer.

InteXX
  • 713
  • 13
  • 31
  • Would the reader who down-voted my question come out of hiding and explain why he did so? I tried very hard to provide a thorough explanation of the problem and to demonstrate details of our attempts to resolve it. A foundational tenet of American jurisprudence is the right to confront one's accusers; as an American I greatly value this right and I expect to be allowed to exercise it. Granted I'm new on this forum, but at this moment I'll boldly state that a down-vote without accompanying constructive criticism--especially against a newcomer--is cowardly. – InteXX May 08 '14 at 00:05
  • Despite the effort placed into troubleshooting the matter, this question falls into the *too broad* category. As presented, this is a wild goose chase that could very likely be a malware infection. Even that is a shot in the dark. Regardless of the cause, we can't guide you through the entire process of isolating this. If you can narrow things down a bit, there are plenty of people who would be happy to help you. – Andrew B May 08 '14 at 00:24
  • Regarding the downvote (which I did not leave), we can be overzealous in our voting due to the sheer number of minimal effort *do my research for me* questions that we get daily. You're a well-intentioned newcomer who at least put some effort in, so this probably warrants being put on hold for being "too broad" more than a downvote. – Andrew B May 08 '14 at 00:26
  • @AndrewB - Thank you, Andrew, for your candor (as well as for the reversing up-vote, if that was you). I had no idea. Also, if I knew how to narrow things down a bit, I would be absolutely pleased to do so. Alas, as it is I'm a reluctant SysAdmin who feels lucky to know even this little bit that he does. It isn't much farther into the woods than this that I get lost. – InteXX May 08 '14 at 01:40
  • One thing you can try doing is ruling out malware. It's not fool proof, but try deploying a brand new VM with no network interfaces. (generally this is extreme, but your VM peers are suspect) Leave it running alongside the others. If the CPU does not spike along with the others, then you have at the very least isolated this to traits of the previously deployed VMs. – Andrew B May 08 '14 at 01:46
  • Also, I would deploy this VM from a CD/ISO, not any images you've previously created. – Andrew B May 08 '14 at 01:52
  • Very good suggestion! OK, I'll do exactly that. Tonight. – InteXX May 08 '14 at 01:55
  • It's a complete shot in the dark, but check that your CPU isn't randomly getting throttled down (use cpu-z or something). – devicenull May 08 '14 at 02:04
  • @devicenull - OK, thanks, I'll look into that as well. So far I'm not sure exactly what that means, but often Google can be our friend ;-) – InteXX May 08 '14 at 02:13
  • @devicenull - OK, got it. Will investigate further. – InteXX May 08 '14 at 02:51
  • Is perfmon showing any disk queuing? Do you the OpenManage packages installed? – SpacemanSpiff May 08 '14 at 02:57
  • @SpacemanSpiff - If you're referring to PerfMon on the host, I've been trying to find that little gem for a LONG time now. Are you aware of how to find/install that on the host? I don't have OM installed, no, but thanks for reminding me--I've been meaning to do that. Is there some diagnostic in there that might help identify this? FWIW, it's PerfMon on the guests that's hogging the CPU (when it tailspins). – InteXX May 08 '14 at 03:11
  • @devicenull - Would a random CPU down-throttle be something that'd show up somewhere in WMI counters? If so I could use a [PS script](http://blogs.technet.com/b/heyscriptingguy/archive/2011/09/26/use-powershell-and-wmi-to-get-processor-information.aspx) to flush it out. – InteXX May 08 '14 at 04:24
  • It's possible the CPU speed is shown in a WMI counter, but I couldn't tell you what one. – devicenull May 08 '14 at 13:14
  • @SpacemanSpiff - I found a good standin for PerfMon on the host, [here](http://blogs.technet.com/b/bruce_adamczak/archive/2013/04/15/windows-2012-core-survival-guide-perfmon-capturing.aspx). – InteXX May 08 '14 at 20:47
  • @devicenull - OK, that tells me what I need to know. Thanks. – InteXX May 08 '14 at 23:21
  • @AndrewB: I want to wait another month or so before I completely shut the door on this issue, just to be sure, but I believe I've discovered the cause. At that time, will there be a possibility of temporarily reopening the question, so that I may elaborate on the solution as a proper answer rather than just place a comment buried deep in a list? My intent with this is for the benefit of future readers who may encounter the same problem. – InteXX Jul 16 '14 at 08:02
  • By all means. Reopen, contribute your answer, and leave it open for a week or so for vetting since open questions tend to see more votes. Accept when you're comfortable. – Andrew B Jul 16 '14 at 12:58
  • @AndrewB: Very good, thank you. As stated, I want to wait at least another month before I make my decision. – InteXX Jul 16 '14 at 19:02
  • Just remember, your edit will trigger the reopen vote. I suggest you flag this post for comment cleanup immediately prior to that, as a comment trail this long is likely to dissuade people from taking a deeper look after the vote. – Andrew B Jul 16 '14 at 20:01
  • @AndrewB: As it's been over a year now with not a single recurrence of the issue, I'd like to provide the solution at which I've arrived. I see your suggestion to simply edit the question, but on second thought wouldn't it be a more appropriate use of the StackExchange system to post it as a separate answer? Please correct me. – InteXX Jul 28 '15 at 01:11
  • Right now your question is closed. You can't answer a closed question. Editing your question pops it back into the reopen queue, but it is unlikely to be voted for re-opening if the question in its current form remains as broad as it is. If there are certain indicators that led you to this solution that were not in the original question, you will need to add them. Without anything to narrow the scope your question remains too broad. – Andrew B Jul 28 '15 at 01:16
  • @AndrewB: I must admit I'm having trouble seeing how the scope is too broad. It was a very specific problem I was having--which I described clearly, as well as explained what solutions I tried. Isn't that why the forum is here, so we can discuss technical problems and their solutions publicly, for the benefit of all? Why keep an environment so strict that it prevents helping others? As it turns out, this is a common problem with these servers--people need to be able to find and implement the fix. Let's not turn them away simply because my question is "too broad." What is that... "too broad?" – InteXX Jul 28 '15 at 01:29
  • If the question as worded has too many potential answers to narrow down (basically every answer is a shot in the dark), it's too broad in a SE context. On a more simple level: the community voted that it was too broad, and they're the ones you'll need to convince that it wasn't. It's generally unwise to enter into this situation expecting a different result and I'm trying to steer you toward a better one. I don't have any more time I can spend on this topic tonight and am nominating it for reopening; what happens happens. If you're confused by the result, inquire politely on meta.SF. – Andrew B Jul 28 '15 at 01:33
  • @AndrewB: Thanks for your vote to reopen. And also for your better explanation of the rules. The rules are the rules... "When in StackLand, do as the StackLanders do." But that doesn't mean I have to like it, nor that I have to stay on board. I'm not yet to the point of jumping ship, but if the rules are so harsh that I can't post a simple question and its answer... well, much more and I might be looking for more fertile and less stony ground. No offense please; you've been very helpful and gracious. I appreciate your candor and your patience with my complaints. – InteXX Jul 28 '15 at 01:38
  • It's not a simple Q&A from our perspective, is the thing. SE was designed for posting questions that can be guided toward an answer, not for questions where everyone throws things at random until they find what sticks. Beyond that, I really recommend that you give the meta site a shot and ask for input on this Q&A...keep it polite like you've been and you'll either get a better feel for this nuance or determine that the sites aren't right for you. – Andrew B Jul 28 '15 at 01:47
  • @AndrewB: "SE was designed for posting questions that can be guided toward an answer, not for questions where everyone throws things at random until they find what sticks." Gotcha. No need for meta, that's good enough for me... busy with other things, as you are. We'll see what the forum gods turn up and be happy with it one way or another. – InteXX Jul 28 '15 at 01:50

3 Answers3

9

After over a year of waiting so as to prove the solution as valid, I'm finally able to post this answer.

Dell's default BIOS settings have C-States enabled, which puts the computer in low-power mode during idle times. This is what causes the VMs to spiral into 100% CPU usage on a Hypervisor host (VMWare, Citrix included).

The solution is to set the System Profile setting in the BIOS to Performance, as opposed to Performance per watt [OS] or Performance per watt [DAPC] (the latter being the default).

The relevant Dell documentation, pp3:

http://en.community.dell.com/techcenter/extras/m/white_papers/20161975/download

And this reply from one of the few Dell support engineers who's familiar with the issue:

The short version is: C-States disable additional processor cores during idling times. For VMs that are tied to a core (this is OS controlled, I do not believe it's configurable), this could result in them locking up, as they're attemping to perform actions with resources that no longer exist in their eyes.

Generally speaking, C-States are generally used on items like backup servers, secondary role servers (Backup dns, dhcp, Domain controllers, etc) so that way the backup servers can remain on, but in a low power mode to save energy.

Addtional Documentation can be found here:

http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface

In a nutshell, power idling on a Dell server should always be turned off (set to Performance) for Hypervisor hosts.

Thanks to Eddy Simons at Kitsap Bank for helping me to find this solution.

InteXX
  • 713
  • 13
  • 31
  • In a nutshell, C3 state (BIOS setup) should always be turned off on server that host Hypervisor. This behaviour is not exclusive only to dell server. – Pierre Sioui Oct 22 '15 at 19:49
1

It's unclear as to what the problem is; you already know that. We have no chance of telling you what the cause is.

However, you can run some tests:

  • Build VM 1

    • Run a CPU intensive task on this VM constantly
      (Perform millions of complex mathematical calculations per second)
  • Build VM 2

    • Run a RAM intensive task on this VM constantly
      (Create a giant array in memory, delete it, repeat)
  • Build VM 3

    • Run a DISK intensive task on this VM constantly
      (Read/write/delete millions of lines to/from a file)
  • Build VM 4

    • Run a NETWORK intensive task on this VM constantly
      (Copy files to/from a SMB share)

Wait until the problem occurs again, observe performance data on each of these servers.
Which was most affected?
Were any not affected at all?

My guess is that your disks suck and the CPU is waiting for IO operations to complete before continuing, which can cause some applications to flatline the CPU.

Vasili Syrakis
  • 4,435
  • 3
  • 21
  • 29
  • This sounds like quite a project. But I'm up to it. In my secret life I write software (don't tell anyone), so this shouldn't be too tough. Thanks. FWIW, the 8 2TB WD server-grade disks are running a single RAID 6 array. If that tells you anything. So you suspect a race condition? – InteXX May 08 '14 at 04:21
  • Ah, so that is the term for it. Yes, I think there may be a race condition at play. But again, that is my best *guess* :) RAID 6 is the same as what I run so I think you could rule out the actual RAID config itself, but maybe not the capacity of IO of the disks. – Vasili Syrakis May 08 '14 at 04:35
  • [> ...that is my best guess] Well it's good enough for me. I'm going to build and run these tests. Wonderful lead, thanks so much. – InteXX May 08 '14 at 04:37
0

Glad I found this. I have a 2012R2 server running Hyper-v. AMD, 6-core cpu. It had been running perfectly for over a year. Suddenly I started seeing VMs that could not be connected to - not with RDP, nor with Hyper-V connect. The only option was to TURN OFF the VM. Shut down did not get a response. So... pull the virtual plug out of the wall. Turn on.

The symptom was that the individual machine seemed to be using 100% of it's allocated CPU (ex: a one-core VM on a six-core host was pegged at 16%).

The problem was sporadic. No apparent rhyme or reason.

It finally occurred to me that this was coincident with my failed attempt to upgrade from 32 to 64GB on that mobo. THAT problem was that I could get 1, 2 or 3 sticks of 16GB memory to work for 16, 32 or 48GB, but not four sticks for 64GB. Lots of horsing around with bios settings, etc. No joy on that front. That's when I discovered the wonderful feature on the VM to Enable Dynamic Memory. Turns out I could survive without the 64 gig after all!!

I'm guessing that I turned on power management for the CPU in my tinkering, and then this issue appeared.

I have turned off APM in the bios. It'll take a couple days before I'm 60% confident that this fixed it. A couple weeks to declare victory. But this FEELS like a good reason for the problem.

It's been 24 hours now and so far so good.

Fingers crossed.

Thanks for the information!!