1

I'm in quite a bind here, hopefully someone can help.

Here's what I have: Dell R710 with one Xeon 2.7 GHz processor, 18GB Ram, Server 2008 x64 SP2 I'm running HyperV with about 5 servers.

Starting in Jan I've had problems with crashing.

First time it was one of the VMs (a Server 2003 SBS). It crashed with no error entries in the Event Log and no Crash Dump. The server came back on by itself.

Then twice the Host server (the 2008 server) crashed last week, and then today, about a week later. Again, no entries in the event log, no crash dump, it came on again by itself.

I had made changes to the server in the beginning of January. I updated the Network Drivers (Broadcom) and added the Teaming Software, and teamed two interfaces. I also upgraded my Symantec Endpoint Protection on all of the Servers to the latest version 12. I also replaced the switch, but not counting that as part of the problem.

I was thinking that this is a memory problem because one of the VMs crashed as well as the host. But it could also be the Symantec.

I don't have all of the crash DUMPS because the idiot that configured the server didn't leave enough room on the system drive to copy the DMP files

Here is one of the DMP files:

Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\Mini012412-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*e:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 6002.18484.amd64fre.vistasp2_gdr.110617-0336
Machine Name:
Kernel base = 0xfffff800`01c1d000 PsLoadedModuleList = 0xfffff800`01de1dd0
Debug session time: Tue Jan 24 18:58:02.334 2012 (UTC - 5:00)
System Uptime: 9 days 13:32:35.727
Loading Kernel Symbols
...............................................................
................................................................
.............................
Loading User Symbols
Loading unloaded module list
..................................................
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 7F, {8, 80050033, 6f8, fffff80001c70da4}

Probably caused by : NETIO.SYS ( NETIO!MatchValues+14e )

Followup: MachineOwner
---------

I have since disabled the Teaming

Here is another:

Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 6002.18484.amd64fre.vistasp2_gdr.110617-0336
Machine Name:
Kernel base = 0xfffff800`01c4b000 PsLoadedModuleList = 0xfffff800`01e0fdd0
Debug session time: Sat Jan 28 07:42:48.945 2012 (UTC - 5:00)
System Uptime: 0 days 21:36:52.143
Loading Kernel Symbols
...............................................................
................................................................
.............................
Loading User Symbols
Loading unloaded module list
...........
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 7F, {8, 80050033, 6f8, fffff80001ceeaa2}

Probably caused by : ntkrnlmp.exe ( nt!KiDoubleFaultAbort+b8 )

Followup: MachineOwner
---------

3: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

UNEXPECTED_KERNEL_MODE_TRAP (7f)
This means a trap occurred in kernel mode, and it's a trap of a kind
that the kernel isn't allowed to have/catch (bound trap) or that
is always instant death (double fault).  The first number in the
bugcheck params is the number of the trap (8 = double fault, etc)
Consult an Intel x86 family manual to learn more about what these
traps are. Here is a *portion* of those codes:
If kv shows a taskGate
        use .tss on the part before the colon, then kv.
Else if kv shows a trapframe
        use .trap on that value
Else
        .trap on the appropriate frame will show where the trap was taken
        (on x86, this will be the ebp that goes with the procedure KiTrap)
Endif
kb will then show the corrected stack.
Arguments:
Arg1: 0000000000000008, EXCEPTION_DOUBLE_FAULT
Arg2: 0000000080050033
Arg3: 00000000000006f8
Arg4: fffff80001ceeaa2

Debugging Details:
------------------


USER_LCID_STR:  ENU

OS_SKU:  7

BUGCHECK_STR:  0x7f_8

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP

PROCESS_NAME:  System

CURRENT_IRQL:  d

LAST_CONTROL_TRANSFER:  from fffff80001ca522e to fffff80001ca5490

STACK_TEXT:  
fffffa60`019e9a68 fffff800`01ca522e : 00000000`0000007f 00000000`00000008 00000000`80050033 00000000`000006f8 : nt!KeBugCheckEx
fffffa60`019e9a70 fffff800`01ca3a78 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x6e
fffffa60`019e9bb0 fffff800`01ceeaa2 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiDoubleFaultAbort+0xb8
fffffa60`005a8000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!HvlEndSystemInterrupt+0x2


STACK_COMMAND:  kb

FOLLOWUP_IP: 
nt!KiDoubleFaultAbort+b8
fffff800`01ca3a78 90              nop

SYMBOL_STACK_INDEX:  2

SYMBOL_NAME:  nt!KiDoubleFaultAbort+b8

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nt

IMAGE_NAME:  ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  4dfb5a33

FAILURE_BUCKET_ID:  X64_0x7f_8_nt!KiDoubleFaultAbort+b8

BUCKET_ID:  X64_0x7f_8_nt!KiDoubleFaultAbort+b8

Followup: MachineOwner
---------

Hopefully I can get some much needed guidance here.

Thanks

hmallett
  • 2,425
  • 14
  • 26
Chris
  • 11
  • 2
  • 1
    Were the bsods occuring before you updated the network drivers? if they started after have you rolled back the driver? Im willing to bet this is directly related to that driver or as you suspected a memory issue. Id recommend burning some new memory and swapping out the old to eliminate that as a possible culprit – RomeNYRR Feb 17 '12 at 19:20
  • 1
    Are you running Dell OpenManage Server Administrator (OMSA)? In the Dell OMSA web interface, you can access the hardware logs and see if there is any log messages in there. – Mei Feb 17 '12 at 20:29
  • There are no hardware errors in the log. – Chris Mar 07 '12 at 08:17
  • I'm rolling back the Network Drivers now. I don't want to, and I don't think that's it, but maybe I'll be surprised.... – Chris Mar 07 '12 at 08:24
  • Nevermind, rolling back the drivers failed. I need to work on this another time. – Chris Mar 07 '12 at 08:49

3 Answers3

1

When it comes to BSOD, 99% it's a driver problem.

You can change config to only store a kernel dump instead of a full ram one, so you can keep more.

What i would do:

  1. Upgrade Broadcom driver. I know you say you did it, but check again, and from broadcom, not Dell. There are always 6 monthes lates.
  2. Check for settings on the netword card, like receive and send buffer. Reset to factory in doubt
  3. Disable temporary symantec to check. Also ensure your endpoint driver is up to date. I already saw update of symantec that left an older driver version.
Mathieu Chateau
  • 3,175
  • 15
  • 10
0

Have you tried the following hotfix? We had this with our set of R710's not sure if its already been included in an update by now though.

http://support.microsoft.com/kb/975530

-1

Microsoft suggests not to team NICS together when using Hyper-V. I'm glad you turned it off. I figured that might have been the problem. If you feel it's a memory issue, try taking out the two awkward RAM modules. 18GB seems off. You should have at least 16GB. I would probably start there.

  • On a system with 3-channel memory, 18GB is probably three banks of 6G each (4G + 2G modules). My company has about a dozen servers with that exact configuration. Not an issue. We had a guy last year who though the same thing and reconfigured a few of them with 4x4G modules totaling 16GB, but of course that configuration drops down to 2-channel access and NUMA was all jacked up. Performance on those servers dropped by a third. – ryandenki Feb 29 '12 at 01:01
  • The servers in my example above were R610s, but yes, an R710 has 3-channel memory. – ryandenki Feb 29 '12 at 01:04
  • I ran the Dell memory test a couple of times and it didn't show any problems. I can reproduce this problem every time if I run windows update, then open Network, and open Network and Sharing center. BSOD every time. – Chris Mar 07 '12 at 08:17