0

Environent:
4 sites, connected by fully meshed VPN over T1s. Each site had one 2003 DC.

History:
About 18 months ago a DC failed. It was rebuilt with the same name, and promoted to DC, however it didn't function properly. I wasn't there at the time so I don't have any more details, but they ended up with a new server with a different name configured as the DC in that site.

I don't know how the failed DC was demoted, but I assume it wasn't correctly.

Problem:
GPOs only work in one site. The other site's DCs still share SYSVOL, but the data is now over 18 months out of date.

Steps taken: I've run FRSDiag.exe, but most of the output is beyond me. When I was selecting servers it offered to scan the original failed DC. This makes me think the FRS configuration wasn't updated when the failed DC was demoted, but I really don't know.

The summary report for each server from FRSDiag can be found here: http://pastebin.com/f458b50fb

Event logs from the servers: http://cid-fa226a18f1d8248d.skydrive.live.com/self.aspx/.Public/Johnsburg%20logs/Johnsburg%20Logs%20all.zip

For reference the active DCs are: HSServer2, JServer, RServer, BServer1.

I found this TechNet article, but I was hoping for a more subtle approach. http://support.microsoft.com/kb/315457

Eric Haskins
  • 257
  • 1
  • 3
  • 11

1 Answers1

2

With things only working at one site, I would look for a more endemic comunication issue than the one DC causing all this trouble.

Can you please post the output of your FRSDiag, dcdiag, and netdiag (possibly use http://pastebin.com/ as they are fairly long bits of output.)

Also what errors do you see in the File Replication Service and Directory Service Event logs?


Ok, so yes it looks like it can't talk to 'hsserver2' 'BServer' I'm going to assume that this is the server that that you had issues with. What you are going to want to do to clear that issue up is to forcibly remove the server from the domain to do this you should do the following:
  1. Go to Active Directory Sites and Services
  2. Go to the site that DC was in, and drill down to the NTDS settings
  3. Delete all connections in there
  4. Delete the server object, when it prompts you chose the "this server is offline and cannot be restored option" (or something similar)

That will once and for all get rid of that server so we can start working on the other problems.

Although not being able to replicate to the one server should not have stopped replication to the rest of the servers in your domain. We'll need to see the other logs I asked for to diagnose what is going on there.


Those logs are a little old, however they are showing what I was expecting to see. FRS is actually not in too bad of shape, but you have pretty much constant NTDS KCC errors. What you will want to do after you remove BServer from AD is to run netdiag and dcdiag with the /fix switch on the FSMO master first. Then for good measure i would run the same commands on all of your DC's. Give it about 10 mins and run netdiag/dcdiag (without any switches at first) on the DC's again to check their health. If you are lucky you won't have to reset FRS sync but that isn't such a bad procedure - not like the one you linked to. Let me know if you start seeing errors in the FRS log after you get rid of the KCC errors and i'll dig up the directions to reset FRS.

Zypher
  • 36,995
  • 5
  • 52
  • 95
  • I've edited the original post to include the link to the summary reports for each server from FRSDiag. – Eric Haskins Jan 03 '10 at 23:16
  • I won't have access to the server until tomorrow, and I don't have copies of the output from DCDiag or netdiag to post. If the FRSDiag reports aren't enough I post the output fro those tools tomorrow. – Eric Haskins Jan 03 '10 at 23:27
  • Actually HSServer2 is working, and it is in the site where GPOs are up to date. The server that failed was BServer, it was replaced by BServer1. Where should I start to troubleshoot the communication issue with HSServer? I'll post the other logs after I have get access to the server tomorrow. – Eric Haskins Jan 04 '10 at 00:33
  • Ahh ok, remove BServer1 with the procedure above, run netdiag/dcdiag on hsserver – Zypher Jan 04 '10 at 00:49
  • I found a copy of the event logs from all the servers. Its linked in the question. I'll run those tools and post the results tomorrow. – Eric Haskins Jan 04 '10 at 00:57
  • +1 Great troubleshooting steps. Have had to do this exact same process before with the exact same problem setup. (missing DC) – Dayton Brown Jan 04 '10 at 03:02
  • I did that, but still no sucess. I did discover that HSServer2 is in jrnl_wrap_error, and the other 3 DCs appear to be replicating properly. I think I'm going to attempt to perform a Nonauthoritative (D2) restore on HSServer2. Do you see any issues with this plan? – Eric Haskins Jan 06 '10 at 00:09
  • Yep that's exactly the approach to take. – Zypher Jan 06 '10 at 00:55
  • Finished the restore, and everything seems good. Hopefully there aren't any underlying issues I didn't find. – Eric Haskins Jan 06 '10 at 23:08
  • If anyone else finds this question here is the link to performaing a nonauthoritative restore. http://support.microsoft.com/kb/290762 – Eric Haskins Jan 06 '10 at 23:09