15

I have a Windows Server 2012 R2 virtual machine; yes with all the updates. Additional software include Microsoft SQL Server 2014 (was 2012 on a previous VM). The web hosting company includes xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper) as part of their default installation on all VMs and Plesk.

Periodically, the OS either hangs, blue screens, or reboots. I do get mini dumps, though not all the time. The usual problem is:

Error: CRITICAL_STRUCTURE_CORRUPTION

The specific top level file, obviously not the cause, varies: win32k.sys, ntoskrnl.exe, xenpci.sys (the Xen driver, though only showed up a couple of times), and ndis.sys.

The OSR (Open System Resources) analyzer was not of much help. The WhoCrashed analyzer was a bit more helpful.

It stated:

17 crash dumps have been found and analyzed. Only 10 are included in this report. A third party driver has been identified to be causing system crashes on your computer. It is strongly suggested that you check for updates for these drivers on their company websites. Click on the links below to search with Google for updates for these drivers:

xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper)

I tried to push the web hosting company to research the topic, but they can up empty handed. I am not convinced that the Xen drivers are at fault. WhoCrashed picked up on it, I presume merely because that was the last driver a couple of times and it is a third party, so that makes it guilty. I did not write WhoCrashed, so hard to comment further.

My question is how to troubleshoot the problem.

The web hosting company already tried giving me two new virtual machines throughout the last couple of years. The problem migrates. I installed SQL Server, but the OS and Plesk came by default. Okay, there is the mail server software too. The web hosting company also told me that they do not have other clients similarly complaining. They ran disk tests multiple times. Disk health is good.

I did not check the registry's health, but the problem goes across installations and happens pretty routinely, so I would have to discount that. I am on my third or fourth VM now.

Again, I mention Xen because WhoCrashed mentioned it, but I am not convinced that as the cause, and other clients really do use that. The system has adequate memory and storage, so that is not a problem.

UPDATE: Here are some answers from the web hosting company to my query.

In usual scenario, performance of the VM will get degraded once you uninstall the drivers. There might be some synchronization issues with the Hardware Node.

Am I using a checked or release build?

You are using a test-signed build, the same ones from developer's site.

How can I tell? The Xen PCI properties dialog in Device Manager did not say one way or the other. Is the entry in Device Manager the sole location? I checked in Programs and Features and saw nothing listed.

You can check the version under Add or Remove programs. Refer to the snapshot attached.

How/where I can I find where the latest version is on their site?

Developer's site is not working - http://www.meadowcourt.org/downloads/ You can donwload the latest signed releases from here - http://wiki.univention.de/index.php?title=Installing-signed-GPLPV-drivers

How can I tell which Xen, 0.11.0.373 belongs to (Xen 4.6? 3.0? x.y?)

We are using Xen 3.4.4, you can't see it from your VM. It can only be viewed from hardware node.

Update 2: The hosting company installed two James Harper software.

GPL PV Drivers for Windows
EJB PV Drivers for Windows
Sarah Weinberger
  • 421
  • 2
  • 9
  • 23
  • 2
    A quick Google search for "Xen Windows BSOD" turns out *lots* of results, especially when old Xen drivers are used on recent Windows VMs; the culprit is quite likely to be there. – Massimo Sep 08 '15 at 22:33
  • The easiest way to determine if the Xen driver is the culprit would be to uninstall it. – joeqwerty Sep 08 '15 at 22:57
  • I thought of that option, sadly I do not think my web hosting company will go for that. They told me on several occasions that the system needs the hypervisor to gain access to the PCIe slots and function correctly. I would have to search my huge message archive with them to get the exact wording, but basically they were not very willing. – Sarah Weinberger Sep 08 '15 at 23:16
  • 1
    Uninstalling the hypervisor drivers from a VM is very likely to achieve unpleasant results, up to not being able to boot anymore. – Massimo Sep 08 '15 at 23:17
  • That was what the hosting company said, but I just a few seconds ago opened a ticket and asked them again. – Sarah Weinberger Sep 08 '15 at 23:19
  • 3
    @SarahWeinberger, when the web host gave you new VMs, were they on different hardware? A bunch of BSoDs showing different top level files that ultimately seem to be caused by the hypervisor driver sure seems like something that could be caused by bad memory. If your "new" VMs are on the same hardware, then it's possible this will keep manifesting. – briantist Sep 09 '15 at 02:56
  • @ briantist I first ran into the problem on my first server at this company. Their response was to give me a new VM, not sure if on same hardware or not. The problem happened again, but coinciding with that was their move to the cloud, which caused yet another upgrade and to new hardware. The problem happened again, so new VM, probably same hardware though. I did upgrade the memory and storage at some point. They ran memory and storage checks, though I guess that I could do that independently as well. – Sarah Weinberger Sep 09 '15 at 11:53
  • I got moved to a new hardware box, one with more RAM and less other VMs, because the hosting company saw a post by James Harper in a forum: "in which he suggested that check the memory consumption of the hypervisor as opposed the memory health itself.". They ruled out Microsoft Hyper-V at the moment. If the new box yields the same problem, that rules out RAM and leads itself to Xen. The checked/tested build is a Xen thing as that is what they publish. One has to think to ask hosting companies what hypervisor they use. I never thought to ask that. – Sarah Weinberger Sep 09 '15 at 22:15
  • *"The checked/tested build is a Xen thing as that is what they publish."* Sorry, still don't buy that. :( Check the Xen wiki itself: http://wiki.xen.org/wiki/Windows_PV_Drivers/Building `build.py checked` to create a **debug** build of the driver, `build.py free` to build a non-debug driver. – Ryan Ries Sep 10 '15 at 03:28
  • Did you ever solve this. I am having the exact same problem – Louise Eggleton Mar 11 '16 at 16:36
  • @LouiseEggleton That depends on what the definition of is, is. ;-) Cirrus did not solve the problem and will not solve the problem. The solution is to move to another web hosting company and ditch Canada. The web hosting company uses a Xen based hypervisor. Their Windows Server VPS template comes with the Xen driver installed. The physical hardware and their backend stuff requires Xen. I spoke with them 2-days ago and they offered a Citrix based solution, but that offers no backup and limited or no other things. The solution seems to be avoid Cirrus Hosting for Windows based anything. – Sarah Weinberger Mar 11 '16 at 17:35
  • OMG, I am using the same host. Thanks very much. We were using a very stable US provider for 15 years. Switched to Cirrus because of currency fluctuations. Maybe it was a bad move. – Louise Eggleton Mar 11 '16 at 17:39
  • @LouiseEggleton I switched to Cirrus when my previous hosting company had issues and the U.S. vs. Canadian dollar was in my favor, but in the 2+ years that I am with them they have denied, obfuscated, ignored, and made my life hell. My site crashes routinely, twice yesterday. They are in over their heads. They told me once: 1. I am the only person with the problem (fat lie it seems now) 2. They hired a consultant, who set their system up and that consultant is gone, hence continued problem. The Citrix option is not an option. Why am I debugging THEIR PROBLEM! Cirrus is unreliable. – Sarah Weinberger Mar 11 '16 at 18:03
  • We are based in Canada, so my company prefers a Canadian provider, but there are not many Canadian cloud providers. When I was a freelancer I used Netsonic in Wisconsin and they were very stable. – Louise Eggleton Mar 11 '16 at 18:15
  • @LouiseEggleton If you want to continue the conversation, we should continue outside of this thread. I want to keep this thread technical and per SO guidelines. Connect with me on LinkedIn or via my site. On a technical note, leaving a VPS solution to a hosted solution causes issues with DevExpress, as hosted environments do not allow registration of the DevExpress ASP.Net assemblies in the GAC meaning that I have to copy the assemblies up to the bin folder for every site. – Sarah Weinberger Mar 11 '16 at 18:37

2 Answers2

18

xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper)

(Checked Build) is a huge red flag. You absolutely should not be using "checked" builds of anything in production. If your hosting company loaded this driver for you, then they absolutely made a mistake.

Checked builds include extraneous symbols and extra error checking that aid developers. They are not production builds.

To further elaborate, what this tells me is that whatever error is causing the machine to stop probably still occurs in the un-checked build of the driver, however, it's probably just causing a non-fatal side effect such as a memory leak in the release build. But in the checked build, because of the more stringent error checking, it stops the entire OS. That's the point of checked builds, to accentuate errors and shove them in the developer's faces, before they ship the code to customers.

To further elaborate, it doesn't really matter if other VMs also have that same exact driver loaded (the checked build) and don't seem to be crashing. Some component specific to that VM is invoking some certain behavior or state that is triggering the bug in that driver. (Drivers and applications interact in all sorts of ways, maybe two machines have the same buggy driver loaded, but only one of the servers has SQL installed, and since the server has SQL installed it does this unique memory page locking in a way that the other server doesn't do, which causes the 3rd party driver bug to rear its ugly head. (Just an example.))

There's really no where else to place the blame here. You cannot run checked builds of drivers in production and expect to have a good time. They're only for development and testing purposes.

Lastly, the only other place to go from here would be to collect full dump and run it through WinDBG. You can spend six hours of intense debugging, unwinding stacks, tracing threads, following IRPs to their completion ports... or you can just get rid of that checked build driver. :)

Might also try running the driver through Driver Verifier. In a test environment. Where checked builds should stay. ;)

Ryan Ries
  • 55,011
  • 9
  • 138
  • 197
  • 1
    I will ask about the hosting company if I am using a checked build or not. I just checked Device Manager and all it says for "Xen PCI Device Driver" is "9/17/2014 | 0.11.0.373". I do not see checked or release anywhere in the properties dialog, but the hosting company will know more. Stay tuned. – Sarah Weinberger Sep 08 '15 at 23:23
  • 2
    Then where did you get the "Checked Build" from in your post? – Ryan Ries Sep 08 '15 at 23:44
  • I copy and pasted the WhoCrashed analysis summary block at the bottom of the report. OSR's report was not very helpful, which is why I focused on the WhoCrashed one, but the problem may not be Xen related and could be something else. At one point, the one of the web hosting company support agents tried to pin the blame on Microsoft and IIS, for which I dismissed that out of hand. I agree with WhoCRashed in suspecting a third party agent. – Sarah Weinberger Sep 08 '15 at 23:50
  • @SarahWeinberger Check the file size, and at least compare between the servers you have access to; I don't know for certain, but the file sizes *should* be different for checked and non-checked builds (the differences are significant enough that it should show up in file size). If all else fails, compute a checksum; even MD5 will do just fine here. The checked and non-checked builds *will* be different there, even if the file version is the same. – user Sep 09 '15 at 07:16
  • @MichaelKjörling I have to go to the site and understand the downloads. Up until the host's post, I thought the download came from a different site, the one that I posted originally. My guess is that the site has one download, a released version, but checked builds can be digitally signed too. Sadly, the uninstall is on the hardware node, which seems to be an older version of Windows, as he talks "Add/Remove Programs" not "Programs and Features". Either way, uninstall is on their end. I already checked P&F and did not see Xen. – Sarah Weinberger Sep 09 '15 at 12:03
  • I found a list of hypervisor products (https://www.flexiant.com/2014/02/12/hypervisor-comparison-kvm-xen-vmware-hyper-v/). The list include: KVM, Xen, Vmware, and Microsoft Windows Server Hyper-V. I asked the web hosting company about the Microsoft solution, but as I understand the product, the primary OS on the physical box needs Hyper-V. They appear to be using a free hypervisor and hence my problem. I hope that does not mean switching hosting companies or other grief. – Sarah Weinberger Sep 09 '15 at 18:04
7

Xen 3.4.4 is too old. 13 March 2013

Windows 2012 R2 got released on 18 October 2013.

To give you an actual comparison, XenServer from Citrix added Windows Server 2012 R2's support in their version 6.2SP1, which was released the 13 December 2013. (http://support.citrix.com/article/CTX139788)

See that for the GPLPV's driver; you see only reference to Xen branch 4.4.0 for 2008R2 ..

The signed drivers from ejbdigital work great on Xen 4.4.0. If you experience a bluescreen while installing these drivers, or after a reboot after installing them, please try adding device_model_version = "qemu-xen-traditional". I had an existing 2008 R2 x64 system that consitently failed with a BSOD after the gpl_pv installation. Switching to the 'qemu-xen-traditional' device model resolved the issue. However, on a clean 2008 R2 x64 system, I did not have to make this change, so please bear this in mind if you run into trouble. http://wiki.xen.org/wiki/Xen_Windows_GplPv

See that for the official statement of xen project on Windows's support. I don't mean it can't run it, but you see the support. You will run a production server on that platform ?

Does Xen Project support Microsoft Windows?

The paravirtualized approach we use to get such high performance has not been usable directly for Windows to date. However Xen 3.0 added Intel VT-x support to enable the running of unmodified guest operating systems, including Windows XP & 2003 Server, using hardware virtualization technology. Xen 3.0.2 and later support AMD Pacifica technology as well. Check to see if your CPU is among the list of HVM Compatible Processors, and if your motherboard is among the list of HVM Compatible Motherboards.

(Note: This does not necessarily mean, that just about any OS runs in HVM Mode! Reports vary about getting *BSD to work in HVM mode, on the Mailing List there's only 1 success for OpenBSD (http://www.openbsd-france.org/ml/archives/msg02494.html), no success message for the others, but some problem reports) http://wiki.xenproject.org/wiki/Xen_FAQ_Drivers,_Windows

yagmoth555
  • 16,300
  • 4
  • 26
  • 48
  • 1
    I marked the answer up (I am #4) and forwarded the answer along with this URL to my web hosting company. I never even heard of Xen prior to this problem, so I am new to the whole thing. Your point on version dates is very valid. How could Xen 3.4.0 support Windows 2012R2, when that was not even out at the time. I shall see what the hosting company says. Sadly, they ruled out Microsoft Hyper-V, not sure why. – Sarah Weinberger Sep 11 '15 at 14:16
  • 1
    This answer made a difference. The web hosting company, infamous for changing nothing, just wrote me this email: "Please be informed that we are planning to update the version of Xen server on which your cloud server is running, to the latest since the older versions seems to be vulnerable." Wow! – Sarah Weinberger Sep 12 '15 at 13:49
  • Thanks for the feedback. Iam glad if my answer helped :) – yagmoth555 Sep 12 '15 at 16:14