How to control CPU Usage of ntoskrnl.exe!MiWalkPageTablesRecursively

4

3

Following the post on tracking high CPU usage by the kernel, I thought I had debugged an issue that had been plaguing me, namely 20-30% consistent CPU usage by the System process. See my previous post about it.

I setup Windows Performance Analyzer and was able to trace the process to this: WPA Trace Log I thought it had to do with the Page file guessing by the function names, and disabled my page file, and restarted, but windows instead on having a page file and threw an error. So I created a small pagefile about 100MB - 2048MB.

That seemed to have solved the problem for a few weeks, but now it's here again, even though the pagefile is only 2GB. It seems to happen after the system has been up for a while. Current uptime is 8 days.

If any kernel expert can give advice on what I should try next, I'd be happy to do it.

However Process Explorer shows a different thread under the system image. I don't know how to reconcile this difference:

InitAnsiStringEx

Process explorer typically shows the above, though at other times it can show debug filter state ...etc.

enter image description here

(It's always ThreadID 56 I believe) But the several trace logs always seem to show what we saw above as the issue.

EDIT

Added images as requested for RAM. This is after a fresh restart where the problem doesn't exist.

RAM Usage Process Details working set

The configured symbol paths as recommended by Blog to speed up symbol loading

Process Explorer

procexp symbols path

WPA

WPA symbols path

The file sizes of the cache folders

symbols info

Multiple versions of dbghelp.dll found on system. Currently pointed to system, but don't know which one it should point to.

dbghelp.dll versions


UPDATE

After following the link for finding Zombie Processes, I discovered the following data (truncated to remove minor entries)

374 total zombie processes.
334 zombies held by explorer.exe(1768)
    298 zombies of Fences.exe
    9 zombies of LogonUI.exe
    7 zombies of chrome.exe
10 zombies held by ctfmon.exe(4568)
    2 zombies of chrome.exe
7 zombies held by dopus.exe(27672)
    3 zombies of AcroRd32.exe
2 zombies held by RuntimeBroker.exe(12184)
    2 zombies of WWAHost.exe
1 zombie held by SkypeHost.exe(190152)
    1 zombie of SkypeApp.exe
1 zombie held by SecurityHealthService.exe(4536)
    1 zombie of MsMpEng.exe
1 zombie held by svchost.exe(1988)
    1 zombie of userinit.exe

This implies that FENCES.exe was the cause, so I've updated that program and will check again later. Also disabled synergy to ensure that wasn't the cause.

Update 2

After a fresh restart and update of fences. The problem persists of the zombie processes, so I will have to uninstall fences to resolve the issue.

This is the version of fences

enter image description here

and the list of zombie processes after a fresh restart.

16 total zombie processes.
7 zombies held by explorer.exe(9484)
    5 zombies of Fences.exe
    1 zombie of GoogleUpdateCore.exe
    1 zombie of DropboxUpdate.exe
1 zombie held by svchost.exe(1788)
    1 zombie of userinit.exe

sidenote

Wouldn't it be cool if we had software AI that would be able to help with all these things?

Vijay

Posted 2018-09-21T04:41:31.657

Reputation: 842

you haven't posted your HW specs and your workload. I guess you simply need to install more RAM to avoid that Windows trims working sets – magicandre1981 – 2018-09-21T14:03:53.493

@magicandre1981 I have 32GB RAM. I have updated the post with the details asked by the answer below by jamie hanrahan – Vijay – 2018-09-22T09:22:31.500

@magicandre Windows always runs the working set trimmer. It may not actually trim anything, depending on RAM pressure - but it's always looking to see if it should. – Jamie Hanrahan – 2018-09-22T10:56:58.680

2

the ETL indicates memory issues. I see access to D:\pagefile.sys. look for Zombie Proceses because you run several synergys.exe over time and maybe they don't free memory.

– magicandre1981 – 2018-09-27T14:55:52.440

@magicandre1981 Thanks for the update. I've updated the post with the details found, and trying to see what can be done to eliminate the zombies! yaay – Vijay – 2018-09-28T05:56:14.433

ok, update/remove Fenses and look what happens – magicandre1981 – 2018-09-28T14:14:35.617

@magicandre1981 I believe I recall a previous Q in which synergy was found to leave zombie processes around. otoh, did you mean "Fences" by Stardock? If so, I run that and have seen so no such issues. – Jamie Hanrahan – 2018-09-29T04:52:01.437

@JamieHanrahan Yes it is fences by stardock. I had stopped using it and disabled the fences. Maybe the disabling routine has an error? but this is an old version 3.0.5 I believe. I have updated it and will test and report – Vijay – 2018-09-29T04:58:53.300

new amount of ZombieProcesses is much better. Is the issue now gone? – magicandre1981 – 2018-10-01T14:25:49.563

The count is still steadily increasing. Have reported it to stardock. I will have to uninstall it and see – Vijay – 2018-10-02T13:41:11.303

Ok so now I have 220 zombie fences, but the cpu usage is normal. So I suspect synergy was the cause, as I'd disabled that as well. – Vijay – 2018-10-05T06:10:46.017

Answers

4

The quick answer: Give that routine less work to do. Which I think means either use less virtual address space at one time, or add more RAM.

Details: First, the routine you're seeing, MiWalkPageTablesRecursively, has little to do with the page file directly, but rather with page tables. Page tables are in-memory structures (and are present in all Windows systems regardless of pagefile configuration). Every process has a set of page tables, and there's a set for the OS's address space ("kernel space") as well.

Page tables are composed of page table entries; there is one PTE for for each page - 4K - of the process's defined virtual address space. By "defined", I mean it includes the process's mapped and private committed address space, and AWE regions if any; it doesn't include reserved or free address space - regions which would throw an access violation if you tried to read or write them.

(By the way: Not only will you still have page tables even if you don't have a pagefile. You will also still have paging, and page faults to and from disk, even if you don't have a pagefile.)

The problem here is likely not inherent in MiWalkPageTablesRecursively. After all this function (or an equivalent under another name) has been part of Windows since NT 3.1. It's in the fact that it's having to do a lot of work. This likely means that it's being invoked often.

A clue to why this is the case is seen in the routines that are earlier on the stack. (That is, closer to the top on the WPA display.) It looks like the caller of MiWalkPageTablesRecursively in this scenario is MiWalkPageTables, which in turn is being called by MiAgeWorkingSet, which in turn is being called by MiTrimOrAgeWorkingSet, which in turn is being called by MiProcessWorkingSets, which in turn is being called by ... that's as far as we need to go.

Every process in a Windows system has a structure called a "working set list". This is a list of all of the physical page numbers that have been faulted into RAM as a result of the process's page faults. The thread (the "Balance Set Manager" thread) is awakened once every second to do cleanup and maintenance on every process's working set. So MiProcessWorkingSets iterates through the processes, dealing with each processes' working set in turn.

For each process in the system, MiProcessWorkingSets calls MiTrimOrAgeWorkingSet. This routine name refers to "trimming" a working set (which means identifying long-disused pages and evicting them from the process to make room in RAM for other things), or "aging" the working set, which means incrementing the "age" counter on each working set list entry that hasn't been accessed since the last time it was scanned, or zeroing the counter if it has been. (The name refers to the "aging" task that's done in accounting, usually every month or every day.) The "age" counter is then used by the "trim" function to identity the most-disused pages.

From the fact that MiTrimOrAgeWorkingSet ends up in MiWalkPageTablesRecursively, we can infer that they are scanning the virtual address space as defined by the page tables to find the pages that are in the working set. Now consider: The time needed by MiTrimOrAgeWorkingSet to handle each process will be roughly proportional to the size of the process's virtual address space. And the total time needed for each pass through MiProcessWorkingSets will be roughly proportional to the number of processes.

Either this thing is dealing with a very large number of pages in one process's working set, or else it's having to deal with a lot of processes.

And... why would it be so busy? It doesn't "trim" working sets until they've been aged, and the amount by which it "trims" the working sets depends on RAM pressure - that is, how short you are on RAM.

Is your system short on RAM? Please post snaps of Task Manager's Performance tab | Memory page, plus the Details page sorted by the Working set column; plus Resource Monitor's Memory tab, sorted by the Hard Faults column; and RAMmap's Use Counts page.

Also, please post more of the WPA trace you have, showing more "depth" of the calls. Or post the .etl file on a sharing service somewhere and link to it here. (Zip it first - they compress really well.)

Aside: Why routine names don't match between WPA and Process Explorer

As for the routine names, the real question would be "why routine names displayed in Process Explorer are just plain wrong." There are two reasons for this in your case and you have to fix both of them.

The first problem is that it looks like you don't have symbols configured correctly for Process Explorer. Configuring them for Windows Performance Analyzer isn't enough.

A sure sign that you don't have this right is that all or nearly all of the threads in the "System" process show up with a module name (something.sys or something.exe, usually ntoskrnl.exe) followed by an offset, such as +0x245 - as in your screen cap. It's ok to see a few like that, but you should be seeing a whole bunch of ntoskrnl!routineName followed by no offset.

To fix this, see this page from the Windows Performance Analysis Field Guide. You need to set Process Explorer's symbol search path - you can use the same symbol file path you set up for WPA - and you need to point ProcExp at a DLL that comes with the Windows Debugging Tools. So you will need to have the Debugging Tools installed - not that you're using the debugger directly, but Process Explorer needs that DLL.

The second reason for the discrepancy is that even after you have the symbol files set correctly for Process Explorer, the routine names it displays won't often match the names of an inner-level routine identified by Performance Analyzer. You should find a match, though, on a routine name near the beginning of the stack (displayed at the top of the routine call tree as shown in WPA).

For example - in your case the first routine of interest is KeBalanceSetManager. (The two before that are the same for every thread in the system process, but KeBalanceSetManager is the routine that's the "top level" routine for this thread.) Once you have symbols configured correctly, Process Explorer should show you a thread with that as the "Start Address", as shown here:

here

Process Explorer can't show you MiWalkPageTablesRecursively because that is about six calls into the stack from what's recorded as the thread Start Address, and it isn't even the current innermost routine (ie it's not at the top of the stack). Such information (even if easily available, which it isn't) would change far too rapidly to be useful in a Process Explorer display, so it doesn't try.

Note: Even with correct symbols it is not uncommon to find a few of the threads in the system process showing "Start Address" of e.g. GemCCID.sys+0xd138, as you'll see in my example. The module in question (GemCCID.sys) is evidently not one for which Microsoft provides symbol files, so Process Explorer just has to say "the thread start address is at 0xd138 bytes from the start of the code in this file, and that's all I know about it."

Hope this helps! Please let me know if you have further questions.

Jamie Hanrahan

Posted 2018-09-21T04:41:31.657

Reputation: 19 777

Thanks a ton for detailed answer @jamiehanrahan I have updated the post with the details you sought. According to you it's another process that's the culprit right? I would think it would Google Chrome as that is the most resource hungry app I run with tons of open windows and tabs. How to hook into analyzing system function calls for this? – Vijay – 2018-09-22T09:45:39.760

any suggestions for what to do next? Thinking of switching to ubuntu :P – Vijay – 2018-09-25T09:50:18.387

1Can you share the .etl file? – Jamie Hanrahan – 2018-09-25T10:03:00.810

1Oh - for Process Explorer's symbol settings: Assuming you're on a 64-bit machine, for the dbghelp.dll path you want C:\Program File (x86)\Windows Kits\10\Debuggers\x64\dbghelp.dll . For Process Explorer's symbol path you do not want to point to ...\NGenPdbs_Cache! That contains only cached PDBs from the modules that provide support for dot-net code, which is completely irrelevant to the kernel mode code running in the System process. Change it to srv*c:\symbols*http://msdl.microsoft.com/download/symbols . (contd.) – Jamie Hanrahan – 2018-09-26T08:24:19.407

1(...) You could use srv*c:\symcache*http://msdl.microsoft.com/download/symbols if you wanted to. Just be aware that Process Explorer doesn't know about the new cached symbol format (.symcache files) described at that blog. However the two caches can coexist, so it all works. Don't worry about symbol load times in Process Explorer; the old-style folder-based symbol file cache is plenty fast enough for ProcExp as it has far fewer symbols to look up than WPA ever does. – Jamie Hanrahan – 2018-09-26T08:28:33.260

1Oh, and speaking of that blog - in WPA's symbol config, ignore the advice to turn off the symbol path that refers to the MS symbol server (you've turned it off in your screen snap). All you have to do is make sure it's last in the list. That way WPA will always use the local caches before going to the web, and it will use the new fast symcache before using the old folder-based cache. – Jamie Hanrahan – 2018-09-26T08:32:07.753

After adding srv*c:\symcache*http://msdl.microsoft.com/download/symbols to process explorer, it now shows KeBalanceSetManager as TID 56, and consistent cpu usage around 7% after uptime of 2 days. – Vijay – 2018-09-26T11:11:55.300

ETL File password is test – Vijay – 2018-09-26T11:23:02.603

1As you can see from the call tree in your WPA screen snap, KeBalanceSetManager is indeed the top-level routine of that thread (not counting the two above it, which are the same for every thread in that process). I'll have to look at the ETL tomorrow. – Jamie Hanrahan – 2018-09-26T11:37:37.443

So do you think the Zombie process explains all the CPU usage that we see by the KeBalanceSetManager? – Vijay – 2018-10-01T11:04:58.163

Zombie processes plural. Very much plural... It's a possibility. I don't believe KeBalanceSetManager has any way to check if the process it's looking at is a zombie or not, so all those zombies are just giving it more work to do. Are you still getting zombies that are blamed on Fences, even though you've disabled it? Maybe you should just try removing it. – Jamie Hanrahan – 2018-10-01T11:42:09.097

@JamieHanrahan You seem to know a lot about windows internals. What do you know about the following details of the page file? https://imgur.com/a/UDoLZAd I've lost the post but I took a screenshot of it at the time. Nobody answered and well I don't really want to open a question about it and make a song and dance, so i'll slip it in here as it will notify you, can you confirm or add anything? Btw. I'm assuming that the PFN database entry is stored in an array in the page file as well as the page itself.

– Lewis Kelsey – 2019-04-29T00:22:57.893