8

I have this Windows Server 2008 R2 domain controller running on a physical Dell server, model PowerEdge R510.

There are some electrical problems around here, thus a black-out is, unfortunately, quite a common occurrence; there are UPSes, but they are not as reliable as they should be, and sometimes servers will experience unclean shutdowns.

For some reason I really am unable to understand, sometimes this specific DC will come up after an unclean shutdown and encounter a USN rollback, forcing us to demote and promote it back.

This doesn't make sense at all, as the server is a physical one and no snapshot, cloning and/or restore has ever been performed on it; also, no additional software is installed on it, it only performs DC duties; specifically, no cloning/recovery/whatever software is present.

A filesystem corruption would at least make some sense, but a USN rollback really doesn't, as there is no way the server could be bringed back to a previous state. However, this has happened at least three times in the last two months, so it was definitely not a one-time crazy event; but I'm completely unable to come up with an explanation.

What could be the reason for this issue?

Massimo
  • 68,714
  • 56
  • 196
  • 319
  • 3
    How exactly did you determine that it was in fact a USN rollback? – Mathias R. Jessen Oct 08 '13 at 13:10
  • `HKLM\System\CurrentControlSet\Services\NTDS\Parameters\DSA not writable` = 4 – Massimo Oct 08 '13 at 15:18
  • Very good question. I've been thinking about it for a couple hours now. I still don't know. But incidentally, since you anticipate the server to be experiencing power outages frequently, have you confirmed that write caching is still turned off on all volumes? I know that's the default once you dcpromo, but it can be overridden. Just want to make sure that you didn't turn write caching back on. – Ryan Ries Oct 08 '13 at 15:39
  • Good guess about write caching. Apart from the system cache, the server has a hardware RAID controller, so that should be checked too. I'll have a look tomorrow. – Massimo Oct 08 '13 at 17:29

1 Answers1

6

I thought on this for a few hours today. It's a bit perplexing, but as I indicated in my comment, my best guess is that you either have some sort of disk caching going on that is not getting committed to disk before the power outage/dirty shutdown has wiped out the contents of the cache... Or, since you are running on a RAID volume that's housing ntds.dit, the power outage might be causing your RAID volume to temporarily break or become incoherent, if even for a moment.

We know that the party line on USN rollbacks is when a DC is restored to a state as it was earlier in time, the classic example being restoring a virtualized DC from a snapshot. I know that doesn't apply to you exactly... but even in the case of a disk with a write cache, you can think of the data that is physically on the disk as containing a "previous state," while the write cache is what actually contains the most up-to-date state of the DC... even if the two states are only half a second apart.

Ruminate on these comments from Microsoft:

Guidelines for virtualized domain controllers

Virtual SCSI disks provide increased performance compared to virtual IDE and they support Forced Unit Access (FUA). FUA ensures that the operating system writes and reads data directly from the media bypassing any and all caching mechanisms.

I know that your DC is not a VM, but the concept still applies. Disk caching and DCs do not mix. Which is why installing Active Directory turns write caching off as a Windows policy, but you can still have caching mechanisms in your hardware RAID controller, etc.

Scenario B: Starting Active Directory from other drives in a broken mirror

  1. Promote a domain controller. Locate the Ntds.dit file on a mirrored drive.

  2. Break the mirror.

  3. Continue to inbound replicate and outbound replicate by using the Ntds.dit file on the first drive in the mirror.

  4. Start the domain controller by using the Ntds.dit file on the second drive in the mirror.

That's a replication killer that has bitten me a lot on physical DCs with RAID 1 volumes. I've never personally had an actual USN rollback caused by it, but it will kill replication on that DC. I mean, imagine a RAID 1 volume of 2 disks. 1 drive dies. You remove it, pop in a new drive... aaaaaaand DSA Not Writable.

From the AskDS blog:

If you do not have uninterruptable power supplies (UPS) for your VM hosts or the storage disk where the active directory database resides, then ensure write-caching is disabled on the virtual machine’s host computer. Please refer this link for additional guidance. Conversely, if the write caching needs to stay enabled for the VM host which hosts the DC, then install a UPS to avoid damage to the DC(s).

Again, it's talking about virtualized DCs, but the disk caching concept applies to physical DCs as well.

So there's my idea. I think it's got something to do with your storage system. Definitely want to disable any and all caching mechanisms at least on the ntds.dit volume, especially if you're prone to power outages.

Ryan Ries
  • 55,011
  • 9
  • 138
  • 197
  • 2
    Exactly my thoughts. Write cache on array adapter, but not battery backed. Would bet 0.05 GBP on it :-) – Simon Catlin Oct 08 '13 at 18:17
  • 2
    Write cache was in fact enabled on the RAID controller, and the OS was unable to automatically disable it; I've manually disabled it and I hope this fixed the the issue once and for all. This configuration was very likely its root cause. – Massimo Oct 09 '13 at 10:25
  • Nice! That should hold you over until you can better UPS! ;) – Ryan Ries Oct 09 '13 at 14:21
  • Confirmed: the problem never happened again after the (not battery-backed) write cache was disabled on the physical disk controller. – Massimo Aug 21 '17 at 19:32
  • 1
    @Massimo I love that you came back to confirm this after 4 years. :) – Ryan Ries Aug 21 '17 at 21:37