21

I'm working with an unhealthy Windows 2008 R2 Terminal Server configured in a vSphere environment. It currently has 4 vCPUs and 32GB RAM. No overcommitment.

The concurrent user count on this server has risen sharply in recent months (~70), and is possibly over the recommended level. Due to the applications used by the users on this system, splitting this into multiple servers will be a challenge beyond the scope of this question.

However, at certain points during the week (and now, almost daily), new user logons produce the following errors: Event ID 1500

Windows cannot log you on because your profile cannot be loaded. Check that you are connected to the network, and that your network is functioning correctly.

DETAIL - Insufficient system resources exist to complete the requested service.

This remains until some users log off, sessions are manually disconnected or the system is rebooted entirely.

I'd like to know:

  • What resource(s) is this error message referring to? What's actually constrained?
  • Is there an OS-level tunable or configuration that can help with this?
  • Users are content with performance, except for the increased frequency of this error message. Is there something else at play here?
  • Is there an absolute limit to the number of users a terminal server can accommodate? I see 150+ users described in certain tuning guides for Terminal Servers.

enter image description here

enter image description here

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Is [this your problem?](http://support.microsoft.com/kb/2567018). I can't say that I've experienced this on a Windows Server 2008 **R2** Server, but I ran into it a lot on 2003 and 2008, so maybe it still applies. – HopelessN00b Jan 17 '14 at 18:36
  • @HopelessN00b The **Event ID 1508** that is often referenced does not appear in this environment. Most of my research has led me to solutions geared towards Windows 2003 environments, but maybe my Google skills are off now... – ewwhite Jan 17 '14 at 18:39
  • This is for 2003, but you may want to look at if it seems relevant: http://support.microsoft.com/kb/935649 – ErikE Jan 17 '14 at 18:40
  • @HopelessN00b I checked `RegistrySizeLimit`, and it's not defined. – ewwhite Jan 17 '14 at 18:44
  • 1
    @ErikE Those registry entries are [ignored in 2008 R2](http://blogs.technet.com/b/askperf/archive/2008/02/01/ws2008-upgrade-paths-resource-limits-registry-values.aspx). – ewwhite Jan 17 '14 at 18:47
  • There isn't a specific user limit, but every interactive login does use some fixed resources. Page table entries, non-paged pool, GDI elements, desktop heap, etc. You've definitely got some digging to do. This was much more common on 32-bit terminal servers – mfinni Jan 17 '14 at 21:35

6 Answers6

16

This has been solved.

I began to examine the registry because increasing CPU and RAM resources on the virtual machine did not resolve the issue.

I was pointed to Microsoft's dureg tool to estimate the registry's size. Browsing via regedit, I encountered issues opening the keys under HKEY_USERS\.Default\PRINTERS. Using dureg, I started probing under that hierarchy.


Printers were the problem. The cause and fix are detailed in:
The size of the "HKEY_USERS.DEFAULT" registry hive continuously increases on a Windows Server 2008 R2 SP1-based server

Hotfix: http://support.microsoft.com/kb/2871131

This apparently stops the growth, but the keys and registry need to be compressed to reclaim space.

Compressing bloated registry: http://support.microsoft.com/kb/2498915

1)  Boot from a WinPE disk.
2)  Open regedit while booted in WinPe, load the bloated hive under HLKM. (e.g. HKLM\Bloated)
3)  Once the bloated hive has been loaded, export the loaded hive as a "Registry Hive" file with a unique name.
4) Unload the bloated hive from regedit.
5) Rename the hives so that you will boot with the compressed hive.
e.g.
c:\windows\system32\config\ren software software.old
c:\windows\system32\config\ren compressedhive software

Hmm, a few steps... kinda tricky to do remotely during production hours. I tried to reach out to my resident Microsoft expert to complete, but he was busy chasing down some SCCM or SCVMM issue somewhere. Reading through some Citrix-related forums, I took note of a tool that could perform the above with fewer steps...

So I took a virtual machine snapshot, then downloaded and ran freeware registry compression software (Tweaking.com); despite the overwhelming sound of the collective groans of Microsoft systems engineers everywhere...

note the 1.4GB saved in the default Config... tucows

PLEASE REBOOT!

Following a reboot, all was well. The user count reached 86 with no ill effects and no profile-related errors. I've monitored the printer registry hive and it's held stable.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Could this have been prevented by disabling RDP Printer Redirection? Sometimes clients will have *terrible* print drivers that get copied up to whatever servers they RDP too. Of course, for a terminal server you might need RDP Printer Redirection... –  Jan 26 '14 at 22:28
  • 1
    @kce All clients in this environment were thin clients, except for maybe 2 or 3 PCs. There could also be an issue with the customer installing local printers on the TS instead of the GPO--distributed printers... but the bug mentioned in the hotfix was an issue regardless. – ewwhite Jan 26 '14 at 22:32
  • thanks for the diagnosis, hotfix, and tool! I vaguely recall this issue happening to me once, but then an unrelated total corruption happened, so I just reinstalled everything. I'll certainly bookmark this in my Evernote, if I experienced a similar problem in the future. Again, thanks! – pepoluan Jan 27 '14 at 05:13
  • For the records, I have done the above and it resolved, but now I'm facing with another registry bloating: `HKU\.DEFAULT\Software\Hewlett-Packard` and `HKU\.DEFAULT\Software\Lexmark` both together making up for about 1.2GB of the DEFAULT registry file! – ETL Jan 15 '17 at 00:57
3

In Windows Server 2003 that error was a result of kernel memory exhaustion. Because you're dealing with Windows Server 2008 R2 I'm not sure how closely related the cause of the problem is to the cause in W2K3, but I would bet that it is a memory issue due to the number of users and processes. I would take a look at Nonpaged Pool memory exhaustion as the probable cause. In addition, the number of procceses is at almost 800, which is quite high. MS would probably tell you to reduce the number of processes, which can only be done by reducing the user load.

This article has some good information regarding memory usage in Windows and how you can view the Nonpaged Pool limit to see if that's the cause of the problem:

https://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx

joeqwerty
  • 108,377
  • 6
  • 80
  • 171
  • 2
    800 processes is too high?!? ***But in Linux...*** :( – ewwhite Jan 17 '14 at 21:30
  • Before complaining about 800 processes being high versus Linux, add the "threads" column to process monitor and see how many of them you see... processes in Linux and Windows are different birds. Comparing them is unfair to both kernel designs. – Mark Jan 22 '14 at 17:35
2

Start up Windows Performance Monitor to monitor the various counters:

  • Context Switches
  • Page Table Entries
  • GDI elements
  • Handles
  • … (whatever you can find)

And see if one of these peaks when you get a failed login.

Also: something is causing high kernel CPU% on your system - you should investigate that to see if it leads you to a related problem.


The User Profile Hive Cleanup service may help out here as it "helps to ensure user sessions are completely terminated when a user logs off".

MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • Can I just add more vCPUs? – ewwhite Jan 20 '14 at 17:30
  • Adding more processing power won't fix the high kernel% usage, it'll just mask it. Also, it's not likely directly the source of your login failures. – MikeyB Jan 20 '14 at 18:24
  • Which I'm trying to get to the bottom of... – ewwhite Jan 20 '14 at 18:24
  • The UPHClean utility functionality is provided natively through the User Profile Cleanup Service from w2k8 and onward. – ErikE Jan 20 '14 at 20:09
  • @ewwhite [Here's a Microsoft article mentioning PTE exhaustion on W2k3 TS servers](http://blogs.technet.com/b/askperf/archive/2012/06/12/terminal-server-and-ptes.aspx). Might be worth throwing up some perfmon counters to check if that's what's happening to you. – HopelessN00b Jan 20 '14 at 21:18
1

I have very little time so I'll just do a sketchy answer and hopefully flesh it out later.

When I was doing spells in Citrix teams I recall us trying to level to 15-20 users per server, but those had some heavy apps running. These days of x64 we load more users, but 70+ does sound like a lot.

The perfmon counter maxing out was not rarely context switching, it would floor a server whilst other counters like RAM, CPU etc looked good. Possibly that could be a reason (the server can't allocate resources before timing out due to excessive context switching). Here are two ways to monitor context switching:

The System\Context Switches/sec counter in 
System Monitor reports systemwide context 
switches.

The Thread(_Total)\Context Switches/sec  
counter reports the total number of context 
switches generated per second by all threads.

Also you might find something of use in the capacity planning guide, you find a link to it in this blog post.

When I can pull time on this answer I'll do so, I'll just add here throwing in a caution on all time based measurements within a vSphere virtual machine.

Due to how the vCPU has been abstracted from the physical CPUs the vCPU does not have a clue what time it is (one virtual second may be more or less than one real (or at least physical) second. As a consequence, all time based perfmon counters (CPU time, context switches/sec and so on) are inaccurate (sometimes even wildly so), even if they may serve as very coarse grained indicators.

To verify this, compare any native time based CPU counter within the VM with its counterpart on the vSphere host for that VM. For this reason VMware publishes some counters for CPU (and Memory which also is inaccurate from the guest perspective) via VMware tools into two VMguest perfmon objects.

Thus the correct time based values are made available from within the guest perfmon, but only if one looks at the VMware published objects counters.

I just thought this basic info a bit relevant as the answers so far are focusing on time based measurements from within a vSphere virtual machine, where this is in some cases a crucial circumstance for correct analysis. It also of course relates directly to the theme of this particular (unfinished) answer and its comments. It may be of use to someone.

As soon as I get time I'll edit in links to the whitepapers etc which elaborate on this, and the exact counter paths\names. Naturally it is all googleable too.

ErikE
  • 4,676
  • 1
  • 19
  • 25
  • Are you suggesting that I need to reduce context-switching? The figures reported via procmon were far lower than other examples I saw online. But can't that be countered by additional hardware/CPU resources? – ewwhite Jan 19 '14 at 18:19
  • I'm suggesting you look at if it may be relevant to your issue. If you have measured it and the amount seems low according to your research it obviously is not. The tolerance level increases linearly for each processor added to the system. However I don't believe there is an absolute threshold level but in principle it needs to be baselined per (healthy) system. – ErikE Jan 19 '14 at 20:06
  • This blog post was just plain interesting from the virtualization perspective, even if probably not relevant: http://professionalvmware.com/2010/11/context-switching-some-resources/ And as seen in this linked doc, cost estimation of virtualized multicore context switching is tricky: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html?m=1 – ErikE Jan 19 '14 at 20:27
1

Well, from what I've read about RDS capacity planning in Server 2008 R2, you might just be running your poor terminal server on insufficient resources for the number of users you have using it. In particular, I notice that you have 80 users on 4 vCPUS, and MS recommends 1 core per 15 users.

From the technet blog titled RDS Sizing and Capacity Planning Guidance:

We always felt the need of Hardware capacity guidance and sizing information for Terminal Services or Remote Desktop services for Server 2008 R2, Whenever I am engaged in any architectural guidance discussion for RDS deployment i always get a question what needs to be taken into consideration while deciding the hardware configuration and to do capacity planning.

Here are some bullet points which I recommend to my partners and customers to consider:

  • 2GB Memory (RAM) is the optimum limit for each core of a CPU. E.g. If you have 4 GB RAM then for optimum performance there should be Dual core CPU.
  • 2 Dual Core CPU perform better then single Quad core processor.
  • Recommended bandwidth for LAN of 30 users and WAN of 20 users. Bandwidth (b) = 100 megabits per second (Mbps) with Latency (l) Less than 5 milliseconds.
  • On a Terminal Server 64 MB per user is the Ideal Memory (RAM) requirement for GP Only use + 2 GB for OS E.g. (100 users * 64) + 2000 = 8.4 GB i.e. 8GB RAM.
  • More applications used (i.e. Office, CAD Apps and etc.) will require more memory per user to be added to this calculation over the 64 MB base memory per user.
  • 15 TS session per CPU core is the optimum performance limit of a Terminal Server.
  • Network should not have more than 5 hops, and latency should be under 100ms.
  • 64 kbps is the Ideal Bandwidth per user session. (256 color, switched network, bitmap caching only)
  • CPU performance degrades if %processor time per core is constantly above 65%.
  • Terminal servers performance doubles when it is running on a X64 HW and OS.

In addition to that, Microsoft has just released a whitepaper on Capacity Planning in Windows Server 2008 R2.

Download it here

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
0

I would suggest implementing WSRM (Windows System Resource Manager). When there are a ton of apps, connections, services running on one host the system doesn't know that everyone needs to play nice together. Windows Server naturally tries to use all of it's resources to complete everything all the time unless it is made aware...enter WSRM.

By implementing WSRM you can set resource limits by all sorts of variations to make sure there is an even playing field for everything running or users connected. From your notes this doesn't seem like it is a ESX/vSphere issue but rather too many connected users who are constantly competing for everything. You will have to test WSRM to find a happy medium of balancing resources among everything but also not affecting performance levels everyone has grown accustomed to.

WSRM Overview: http://technet.microsoft.com/en-us/library/cc732553.aspx

MethoteK
  • 21
  • 2
  • Thanks. I already have WSRM installed with the **Equal per session** profile. – ewwhite Jan 25 '14 at 03:51
  • I'm not sure that WSRM can alleviate the underlying problem, which my gut tells me is memory exhaustion of some type (and based on the same problem and error message in W2K3 is some type of kernel memory exhaustion). – joeqwerty Jan 25 '14 at 04:01