25

We have two Windows Server 2008 SP2 (sadly not 2008 R2) Domain Controllers in a small 150 client domain that are exhibiting very "peaky" CPU usage. The Domain Controllers both exhibit the same behavior and are hosted on vSphere 5.5.0, 1331820. Every two or three seconds the CPU usage jumps up to 80-100% and then quickly drops, remains low for a second or two and then jumps up again.

DC3 Task Manager Performance


Looking at the historical performance data for the virtual machine indicates that this condition has been going on for at least a year but the frequency has increased since March.

DC3 Virtual Machine Performance



The offending process is SVChost.exe which is wrapping the DHCP Client (dhcpcsvc.dll), EventLog (wevtsvc.dll) and LMHOSTS (lmhsvc.dll) services. I'm certainly not a Windows internals expert but I could not seem to find anything especially amiss when viewing the process with Process Explorer other than it appears the EventLog is triggering a ton of RpcBindingUnbind calls.

DC3 Process Explorer for SVCHost.exe



At this point I'm out of coffee and ideas. How should I continue to troubleshoot this issue?

  • Just spitballing here: 1. Do you have a monitoring system that queries the event logs on the DC's? 2. Do you have any type of auditing enabled that may be leading to heavy Event Log activity on the DC's? – joeqwerty Apr 25 '14 at 19:32
  • 1
    Wanted to chime in as this thread popped up on a Google search for High CPU Event Log. This issue is still present on Server 2012. Just resolved the exact same issue on a Server 2012 DC. Check Log File sizes. Default log path is %SystemRoot%\System32\Winevt\Logs\ Overwrite radio option runs into trouble dealing with larger log file sizes. I set mine to Archive the log when full and rollover. – KraigM Nov 10 '14 at 19:20
  • For those coming here from Google, this Event Log service problem applies to non-controller Windows Server machines as well. In my case, having enough users with `mmc.exe` (probably the default "Server manager" window?) open achieved regular spikes as well. – Nickolay Jun 19 '17 at 05:56

4 Answers4

25

TL;DR: EventLog file was full. Overwriting entries is expensive and/or not implemented very well in Windows Server 2008.


At @pk. and @joeqwerty suggestion and after asking around, I decided that it seemed most likely that a forgotten monitoring implementation was scraping the event logs.

I installed Microsoft's Network Monitor on one of the Domain Controllers and started filtering for MSRPC using the ProtocolName == MSRPC filter. There was lots of traffic but it was all between our remote site's RODC and unfortunately did not use same destination port as the listening EventLog process. Darn! There goes that theory.

To simplify things and make it easier to run monitoring software I decided to unwrap the EventLog service from SVCHost. The following command and a reboot of the Domain Controller dedicates one SVCHost process to the EventLog service. This makes investigation a little easier since you do not have multiple services attached to that PID.

SC config EventLog Type= own

I then resorted to ProcMon and setup a filter to exclude everything that did not use that PID. I did not see tons of failed attempts by EventLog to open missing registry keys as indicated as a possible cause here (apparently crappy applications can register as a Event Sources in extremely poor ways). Predictably I saw lots of successful ReadFile entries of the Security Event Log (C:\Windows\System32\WinEvt\Logs\Security.evtx).

ReadFile Security.evtx

Here's a look at the Stack on one of those events: RpcBindingUnbind

You'll notice first the RPCBinding and then RPCBindingUnbind. There were a lot of these. Like thousands per second. Either the Security Log is really busy or something is not working right with the Security.evtx log.

In EventViewer the Security Log was only logging a between 50-100 events per minute which seemed appropriate for a domain of this size. Darn! There goes theory number two that we had some application with very verbose event auditing turned on left in a forgotten corner still dutifully chugging away. There were a still a lot (~250,000) of events recorded even though the rate of events being logged was low. Log size perhaps?

Security Logs - (Right Click) - Properties... and the maximum log size was set for 131,072 KB and log size was currently holding at 131,072 KB. The 'Overwrite events as needed' radio button was checked. I figured that constantly deleting and writing to the log file was probably hard work especially when it was so full so I opted to Clear the Log (I saved the old log just in case we need it for auditing later) and let the EventLog service create a new empty file. The result: CPU usage returned to a sane level around 5%.

  • Nice work. Also, move the TL;DR to the top of the answer? – Zlatko Apr 30 '14 at 19:42
  • Just FYI... this just hit a bunch of our domain controllers, most of which are 2012/2012 R2. So it looks to be equally not-well implemented in newer Windows Server versions. – HopelessN00b Jun 09 '14 at 12:13
  • So this IS my issue, BUT I have set to archive when full and do not over write. Max log size is 1 GB and current size is 639 MB. Stumped on what to do other than maybe clear the log as a test. This is on 2008 R2 Std and is affecting the PDC and secondary DC. Both are VM's. I had to allocate 2 sockets / 1 core for each DC or they would both peg out 1 / 1 allocations and not respond any more. Adding more RAM did nothing. It's constantly using between 60-100% CPU at this point. – Travis Jan 19 '16 at 19:04
  • Saved/cleared the Security log. Still running 74% CPU usage. – Travis Jan 19 '16 at 19:11
5

You may be able to chase this down by creating a small Data Collector Set.

  • Open Performance Monitor and create a new user-defined Data Collector Set.
  • Choose Manual (no template) and select Event trace data only.
  • Add the Active Directory Domain Service: Core data and save the set.
  • Change the Stop Condition under Properties to 1 minute.
  • Start the set and wait.
  • When completed, convert the saved .etl file to a .csv using tracerpt –l “file.etl” –of CSV
  • Analyze the summary.csv and dumpfile.csv data in Excel. You may want to download this Import-DC-Info.xlsm doc to help you with your analysis.

If my hunch is correct, you're going to see some devices (IP:port) hammering your DC.

pk.
  • 6,413
  • 1
  • 41
  • 63
1

Certainly a difficult one. Apart from just leaving it alone (1 CPU / 50% load.. who cares?), you could try to setup a new domain controller and see after some days if this one gives you the same behavior. If it does, you might want to try with a Wireshark trace (obviously, there's something from the Network causing this then)

The next thing which comes to mind is a simple call to microsoft

MichelZ
  • 11,008
  • 4
  • 30
  • 58
-2

Travis, "archive" did not help you. In fact, even clearing the event log when it was 2/3rd grown did not help you. But "archive" did help KraigM.

kce: cleared a 131MB "overwrite" file and saw performance drop from say 55% o 5% but QUESTION: perhaps you eventually you saw high utilization again since this may (a) only be triggered when the overwriting condition is reached or (b) it may get linearly worse as the cleared file increases from 0mb in size to 131MB in size.

Some see this for the security.evtx and one saw it for the Task Scheduler operational log. I suggest completely uninstalling your AV (which one are you using) and trying. Intruders need to hide their tracks and their tracks are made in scheduled tasks they set up or logins they do. So they will hide their tracks by breaking handles to these event logs and rewriting them to skip over their tracks. AVs may may be detecting this in a buggy Way since if it were Microsoft, more of this high utilization would have been reported but I am seeing just a scant few posts when Googling. I am also seeing this on server 2008 R2 for the security.evtx log. No event log subscribers, no external monitors. I observed a couple of AV services (McAfee) running and they had very low total utilization for a server up so many days so I suspect it was uninstalled and only partially so (likely needs McAfee's special uninstaller) and I wonder if there are hooks in the vestige (or even normally installed) McAfee service or McAfee filter drivers running that somehow take a normal write to the event log and decide in their filtering that they need to turn this into a full blown read of the entire event log. Trust me, third party filter drivers from some AV companies are buggy and certainly 10000x buggier than Microsoft's implementation of event logging, which is very likely perfect. In summary, 100% uninstall ALL OF YOUR av AND SEE IF issue resolves. If so, work with your AV company to fix their AV. It's ill-advised to make file exceptions for .EVTX files since you may be losing out on this intrusion detection, which believe me, is important to bad guys.

Also, when using procmon, pay attention to WriteFile calls because the Writefile is what will trigger the filter manager to read the entire file. In my case, the read was initiated approximately 30-seconds after the write completed which might be by design. But it was consistent and in my case the file was 4GB and the file read involved 64K Readfiles each 64KB in length and it utilized 35% of the CPU to accomplish this. Very sad.


Update 03/23/2016 I looked at the filter drivers on this machine after concluding this had to be caused by one of them (the event log mechanism could never be buggy on it's own or the number of reports of this kind would be staggering and it is not). I saw some filter drivers from an AV and from a well known 3rd party company which boosts virtual machine disk performance with look ahead reads and asked their chief architect (who was very kind and gracious) if his product might be over-aggressively reading the entire security event log (which was clearly happening per procmon). This would be helpful for smaller security logs but not the ones of the sizes reported here. No way he said. He agreed it might be the AV.

As I said to the Azure fellow below, we do not have a followup from the original Poster if the problem re-surfaced after clearing the event log because that is a common and mistaken solution since the performance degrades over time yet again. This is called "followup" and I see first-hand the original poster's solution can fool those who don't followup into believing they have solved the problem. I almost got fooled as well. I cleared the event log and performance improved -- but I used procmon and saw the issue will grow and grow slowly over time until it becomes problematic. For some reason, the Azure fellow harshly criticizes me when the original poster didn't followup (may have died, been fired, quit, or gotten busy). The Azure fellow below thinks if the original poster didn't followup, it must be a fixed problem. This is vexing and puzzling because I can't think of anyone who is so highly regarded technically who would take this position. I apologize if I pricked a nerve. Perhaps in my activism elsewhere on the Internet where I call people out I got on his nerve -- here (serverfault) I am simply being kind and sharing deep technical knowledge and the result from Mr. Azure is bullying about whether my technical contribution is even necessary or is for some blog of mine (I have no such blog). I don't yet intend to send this link to around half a dozen key cronies at Microsoft and ask them what the hey is going on with this type of bullying from a key MSFT employee because I am singularly focused on having the best of interest of the community in mind and the responses below from Mr. Azure are, in a few words, unbelievable, vitriolic, unnerving and bullying -- which I am sure some people enjoy doing to others. I was initially offended but am over it and know that, passive or active readers will appreciate what I am saying and appreciate my comments -- I stand 100% behind it without regard for legalistic reasons why it is subtly inappropriate here or not. M. Azure, please practice kindness and refrain from casting my comments in a poor light. Just get over it and show restraint and not comment yet again.

Harry

harry
  • 1
  • 1
  • It appears you are addressing people who have commented, and not the OP and the original question. And you're making suggestions like removing AV. The OP already solved their problem, and identified it as an Event Log issue. I don't see this as a valid answer. – David Makogon Mar 07 '16 at 19:44
  • This was unresolved if you read the posters carefully and my summary. You have to suffer from this issue to parse their words far more carefully then you did and see this. I am sorry you are unable to do so and judged me so harshly. For example, the OP said it returned to a sane 5% but it easily could have returned after clearing the log and he didn't follow up -- in fact this happened to another commenter. Therefore nothing was resolved since he didn't verify the results stayed at 5% permanently. – harry Mar 07 '16 at 21:36
  • Sorry Harry - this isn't an an answer; you're making claims of buggy software and telling the OP to work with their anti-virus company. This is great for your personal blog or an article, but an editorial isn't an answer, to a two-year-old question with an accepted answer, with a root cause unrelated to anti-virus. – David Makogon Mar 08 '16 at 15:24
  • @harry surprisingly I'm back here again trying to figure it out all over again :) No AV on the system. I did a few windows updates and changed the max log file to archive to 500 MB from 1 GB. Even at 1 GB it only rolled over once in 8 months whereas my other DC rolls over quite a bit more. I did follow the "SC config EventLog Type= own" suggestion to break out the log file. After rebooting the evenlog process has dropped to below 1%. The "dhcp and lmhosts" that was attached to the process also are below 1% CPU. I was only registering around 15 security events/second. – Travis Nov 17 '16 at 15:52
  • I suspected an SSO agent I had running had something to do with it as it had many errors but disabling the service didn't result in a drop in the CPU usage even after a reboot. The SSO agent is back up and CPU is still low so who knows. – Travis Nov 17 '16 at 15:52