3

I have a win2k (mixed mode domain) with 4 DCS. One of these also acts an exchange 2000 server which uses 2 logical volumes from an MSA 2000 array. AD etc is stored on local drives.

We experienced a problem last week when the raid array fell back to a redundant controller and this temporarily meant that the two logical drives were not visible to the server for around 5 minutes and a couple of reboots. The log records these

Events as Type: Warning
Event Source: Disk
Event Category: None
Event ID: 51
Date: 06/11/2009
Time: 11:46:23
User: N/A
Computer: server1
Description:
An error was detected on device \Device\Harddisk1\DR1 during a paging operation.

Following these problems, the server “kerberos Key Distribution” service refuses to start with an error:A device attached to the system is not functioning. All other automatic start services (including net logon) are running and there are no DNS issues etc.

All devices are also functioning but the two logical MSA disks are now numbered in the Windows Disk Management MMC as 2 and 4 and I suspect that they may have previously been identified as disks 1 & 2 and perhaps windows still sees this as an ongoing failure??

Replication has not been affected but obviously there are many audit failures in the security log relating to users and workstations presumably linked to the Kerberos issue.

Attempting to manually start the kerberos service generates the following in the System Log.

Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7023
Date: 09/11/2009
Time: 09:46:55
User: N/A
Computer: Server1
Description:
The Kerberos Key Distribution Center service terminated with the following error:
A device attached to the system is not functioning.

DCDIAG passes all tests except “Advertising” and “Services” which I believe relate directly to the failure of Kerberos only.

Any advice would be appreciated.

Gryu
  • 479
  • 1
  • 6
  • 14

1 Answers1

4

I'm wondering if the volume GUID changed somehow. The Active Directory database location is kept in the registry (see HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\DS Drive Mappings). That's all I can come up with re: what might've happened, and that's not really a "user serviceable part".

What you're seeing there makes me feel a lack of trust in the machine's ability to ever operate properly again. (I wouldn't trust that disk array or RAID controller further than I could throw it if it's going to go off and renumber disks it presents to Windows, but that's another story.)

Restoring from backup may be problematic if you've continued to have users using Exchange on that machine during this outage, since more data is piling up on the box.

I'd bring up a temporary Exchange Server computer on another machine, move all the mailboxes to that secondary server, replicate your public folders, OAB, etc such that you can decommission Exchange on the failed server properly. You'll need to leave the failed server running long enough for all users to access their mailboxes in the new location once so that Outlook updates their MAPI profiles to refer to the temporary server's name.

Once you've done that, I'd rebuild the failed server from the ground up, performing an NTDS metadata cleanup, if necessary, if it won't demote back to a member server properly (see http://support.microsoft.com/kb/216498).

After you've rebuilt the machine, you can reinstall Exchange and move the mailboxes back, replicate public folders, etc. Again, you'll need to leave both Exchange Server computers running together until all users have accessed their mailbox at least once so that their MAPI profiles are updated and you can decommission the temporary Exchange Server computer.

Evan Anderson
  • 141,071
  • 19
  • 191
  • 328