Random servers in Citrix farm suddenly bluescreens (mostly 0x0000008e and 0x0000007e)

Question

I'm responsible for a Citrix Presentation Server 4.5 farm. Starting Friday 30. November, my servers started to crash randomly. So far we've experienced 80 crashes, so it's obviously becoming an increasingly big problem for us. I have 12+ years experience with IT, so I know the difference between 0 and 1, but I have a hard time cracking this.

We've rolled back any recent changes I can think of for different groups of servers, but all groups still seem to crash. I don't have the skills to interpret the memory dumps to find the culprit.

Has anyone encountered the same or a similar problem? - might be a generic Windows issue
Other than executing "analyze -v" in WinDbg, how do I work my way through the memory dumps to see what actually triggered the BSOD?
Any suggested steps in getting to the bottom of this?

Any help is greatly appreciated. I can also provide links to kernel memory dumps or WinDbg output if necessary.

Thanks!

Problem description

The majority of the STOP errors we encounter are:

0x0000008e KERNEL_MODE_EXCEPTION_NOT_HANDLED (50%)
0x0000007e SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (26%)
0x00000050 PAGE_FAULT_IN_NONPAGED_AREA (21%)

We also see a few 0x0000000a IRQL_NOT_LESS_OR_EQUAL (3%).

For both 0x0000008e and 0x0000007e bug checks, the exception code is 0xc0000005 (Access Violation). When opening dump files in WinDbg, most details are exactly the same, for all the 0x0000008e and 0x0000007e bug checks respectively:

0x0000008e

Exception address: 0x808bc9e3
Trap frame: [varies]
FAILURE_BUCKET_ID: 0x8E_nt!HvpGetCellMapped+97
Probably Caused by (IMAGE_NAME): ntkrpamp.exe

0x0000007e

Exception address: 0x808369b6
Exception record address: 0xf70d3be0
Context record address: 0xf70d38dc
FAILURE_BUCKET_ID: 0x7E_nt!MmPurgeSection+14
Probably Caused by: memory_corruption

About 30% of the crashes happens between 17:00 and 19:00, which leads me to believe this tend to happen more often during logoffs. But then again, only ~15% occurs between 15:00 and 17:00.

Summary of farm

Citrix Presentation Server 4.5 R06 on Windows Server 2003 R2 SP2
All high priority patches, at least as of October installed
Virtualized using VMWare ESX/vSphere 4.1 on HP Proliant BL460c G6 blade servers
About 53 Presentation Servers in production, divided into three silos - only one of which, the largest, is affected
2 vCPU's (5 GHz reserved), 8 GB RAM (all reserved) for each Presentation Server
Plenty of free disk space
Very few printer drivers - automated deletion of non-approved drivers every night
~1.000 peak concurrent users, which is reached at around 10:30 (on weekdays)
Number of sessions steadily decline between 15:00 and 19:00 to ~230

Are the affected servers isolated to a particular blade/blades or chassis? — joeqwerty, Dec 11 '12 at 03:02
@joeqwerty: More or less evenly spread across the 13 hosts. One host does however seem to attract more crashes, but no host goes free. — abstrask, Dec 11 '12 at 03:08
Antivirus software get an update or "enhancement" recently, or anything else that installs filter drivers like security, or backup software? Or perhaps the actual *deletion* of a printer driver by your automated process could have left something behind? — mfinni, Dec 11 '12 at 03:23
@mfinni: Actually, we don't currently run with AV protection, due to legacy problems (my guess due to not define reasonable exceptions and "aggresiveness" - it was before my time). Forefront AV was however pushed out by SCCM to a few servers, but have since been removed. Some of the servers that did get AV installed haven't crashed yet, others have crashed a few times and others a lot, so I don't see a pattern here either. I've checked on several servers: We only have the same type 3 printer drivers as specified on our whitelist. — abstrask, Dec 12 '12 at 00:47
And those printer drivers haven't been changed or updated? Could something have been left behind by a "deleted" driver? Dump the drivers list on a server that it happens to, and one that it doesn't, and diff them for any variances. — mfinni, Dec 12 '12 at 00:49
I think it's very unlikely, but I don't have any audit logs to support it. Honestly, I don't think the script has deleted a drivers for months (or more). Previously, admins could inadvertently install print drivers by connecting through RDP, but that's not the case anymore. Guess we've just kept the script "just in case" someone gets wise, but as far as I know, it hasn't had to do anything for a long time. Will run a checksum compare of C:\WINDOWS\system32\spool\drivers between a good and bad server... — abstrask, Dec 12 '12 at 01:00

score 2 · Answer 1 · answered Dec 11 '12 at 09:01

2

We had a similar issue on an older version of citrix (PS4) that was down to HP Print drivers. I had to clear the whole lot off before re-installing the appropriate ones and it seemed to clear the blue scdreen issue. Also Curious about "automated deletion of non-approved drivers every night". If you clear non-approved ones down each night, why do you allow them to be installed in the first place? You can stop them being installed in the citrix policies. Think it is under Printing -> Drivers -> Native printer driver auto-install (set to do not automatically install)

answered Dec 11 '12 at 09:01

user114106

141
1
6

Actually, the majority of the printer drivers on our whitelist, are HP drivers: CLJ 4500, CLJ PS, LJ 4, LJ 4000 PCL, LJ 4000 PCL6, LJ 5N, LJ II, UPD PCL5, UPD, PCL6 and UPD PS, but that list hasn't changed for months. The reason the cleanup script is there, is mostly for legacy reasons. More supporters than I care for have admin access, and previously, admins could inadvertently install print drivers by connecting through RDP. I don't think this happens anymore, but the script is still there, just in case someone thinks it's a good idea to install a print driver. – abstrask Dec 12 '12 at 00:55

score 0 · Accepted Answer · answered Mar 13 '13 at 07:43

We ended up applying PS 4.5 roll-up pack 7 (which wasn't installed, because it previously broke session reliability for us) and a number of post-R07 hotfixes.

Furthermore we replaced the newest beta of UPHClean 2.0, which Microsoft have since abandoned as a separate component (still built-in to later versions of Windows), with the newer UPHClean 1.6g.

The farm has been stable since, but it's still a mystery why all hell suddenly broke lose, without making any major changes.

Random servers in Citrix farm suddenly bluescreens (mostly 0x0000008e and 0x0000007e)

Problem description

Summary of farm

2 Answers2