4

We have a 2 domain environment. We were having issues with slow connections, authentication failures, and hung resources only during OFF-PEAK hours when there were very few users logged on.

The issue occurred when a user from DOMAIN A is accessing a resource located on DOMAIN B and is using ntlm authentication. There are no issues with users from DOMAIN A accessing resources in DOMAIN A, or with users from DOMAIN B accessing resources in DOMAIN B.

We were able to track down the problem to the secure channels that are used for netlogon traffic. When a resource from domain B had a secure channel with one particular DC (I'll call it DC-B1), then everything worked fine. We can follow the traffic chain from client(A)->resource(B)->DC-B1(B)->DC-A1(A) (for authentication) and then back again. However, if the resource server in B had a secure channel with any of the other DC's in DOMAIN B, the authentication would hang and never complete.

So it looks like with the exception of DC-B1, every DC in DOMAIN B is having trouble talking creating a domain trust secure channel with DOMAIN A. To test, we ran nltest /sc_verify:DOMAINA from each DC in DOMAIN B.

When run from DC-B1, the response was instantaneous. When run from any other DC on domain B, it hung for about 40 seconds before showing a success (never showed an error, just took a long time).

Any ideas on why some DC's would be struggling with establishing and using the domain trust secure channel and another DC in the same domain never has an issue?

For what it's worth, the DC that works is server 2008, the ones that don't work are server 2012 R2, however the problem existed on some domain controllers before migrated to 2012 R2, we just didn't pin-point the issue until after we were done migrating them.

Thanks for the help.

Edit: Additional Information...

Compared a weekend's worth of NetLogon.log files for each of the Domain Controllers...

Every

[LOGON] SamLogon: Transitive Network logon of DOMAINA\testuser Entered

record in the DC-B1 log (this is the good DC) had a corresponding

[LOGON] SamLogon: Transitive Network logon of DOMAINA\testuser Returns 0x0

however on the other DCs in Domain B each return had one of the following 3 errors:

[LOGON] ... DOMAINA\testuser ... Returns 0xC0020017
[LOGON] ... DOMAINA\testuser ... Returns 0xC0020050
[LOGON] ... DOMAINA\testuser ... Returns 0xC000005E

And here is how often each of the different errors occured:

77% of errors were: 0xC0020017 RPC SERVER UNAVAILABLE
21% of errors were: 0xC0020050 RPC CALL CANCELED
 1% of errors were: 0xC000005E NO LOGON SERVERS AVAILABLE
 0% of returns were: 0x0 (no error)

We compared the all the security setting between the DCs that do not work and the one that does but couldn't find anything that would cause the RPC issues. Any suggestions on where we could look next? We are confused as to why the 2008 domain controller in "B" would have no trouble talking to 2012 DCs in "A", but the 2012 Dcs in "B" cannot use pass through authentication to "A".

Edit: Additional Requested Information...

Test run from DC-B2 & DC-B3 (same results) (pass through authentication originating here does not work)

C:\>nltest /dsgetdc:DOMAINA.local
           DC: \\DC-A3.DOMAINA.local
      Address: \\555.555.555.127
     Dom Guid: 9f3a0668-c245-4493-be03-0f7edf534d27
     Dom Name: DOMAINA.local
  Forest Name: DOMAINA.local
 Dc Site Name: Company
Our Site Name: Company
        Flags: GC DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST CLOSE_SITE FULL_SECRET WS DS_8 DS_9
The command completed successfully

Edit: Additional Information...

Results from PortQry from Domain B -> Domain A (GC DC)

TCP port 135  (epmap service):      LISTENING
TCP port 389  (ldap service):       LISTENING
UDP port 389  (unknown service):    LISTENING or FILTERED
TCP port 636  (ldaps service):      LISTENING
TCP port 3268 (msft-gc service):    FILTERED
TCP port 3269 (msft-gc-ssl service):    FILTERED
TCP port 53   (domain service):     NOT LISTENING
UDP port 53   (domain service):     NOT LISTENING
TCP port 88   (kerberos service):   LISTENING
UDP port 88   (kerberos service):   LISTENING or FILTERED
TCP port 445  (microsoft-ds service):   LISTENING
UDP port 137  (netbios-ns service):     LISTENING or FILTERED
UDP port 138  (netbios-dgm service):    LISTENING or FILTERED
TCP port 139  (netbios-ssn service):    LISTENING
TCP port 42   (nameserver service):     FILTERED
j-Geek
  • 141
  • 1
  • 3
  • 1
    After reading the first paragraph - that this *only* happens during off-peak hours... I'd stop there before troubleshooting any further - Active Directory does not behave differently depending on what time of day it is. So what are you doing during off-peak hours that wrecks the network? Is it a scheduled backup job? A script that copies files? A vulnerability scanner that floods the network, triggering SYN attack protection on your router? – Ryan Ries Dec 22 '15 at 23:14
  • The problem surfaced anytime there few users on.. nights, evenings, weekends, holidays. As explained above, we were able to track down the problem down to the fact that only 1 DC was handling the pass-through authentication to Domain A, all other DCs in domain B were failing with the errors above. The reason it only caused issues at night/weekends is because during the day the high traffic kept the secure channels open to the DC that does work. At night it would time out and resources would create a SC to the other DCs and it would take a while before it got back to DC-B1. – j-Geek Dec 23 '15 at 02:59
  • From the 2012 R2 DC's in domain B, run nltest /dsgetdc:domainA, and PortQueryUI targeting the 2012 R2 DC's in domain A. – Greg Askew Dec 23 '15 at 17:52
  • I added the dsgetdc test results above. We don't have portqryui on the DCs in question. I tried portqry on ports 135 and 389 and didn't see anything out of place. Is there anything in particular we should be looking for? If portqryUI is a better tool for the job I will get it on one of the DCs. – j-Geek Dec 23 '15 at 18:51
  • PortQueryUI does not require an installation, you can copy and run it. What you are checking for is if any of the advertised RPC services are inaccessible ("filtered"). This should be the first item to check if there are RPC connectivity issues. http://blogs.technet.com/b/askds/archive/2009/01/22/using-portqry-for-troubleshooting.aspx – Greg Askew Dec 23 '15 at 18:55
  • So PortQuery revealed that the GC Ports (3268 and 3269) are filtered from B->A. It looks like the port is open from A->A, A->B, and B->B. We have Global Catalogs in both domains. Is it required to have this port open across domains if both domains have a GC? Also is it possible that this would break NTLM pass through authentication in Server 2012 but not Server 2008? – j-Geek Jan 06 '16 at 13:37
  • Is the port open (use telnet)? Why would it be blocked? – Greg Askew Jan 07 '16 at 15:08
  • Will check to make sure there are no firewall rules blocking this. If not, could a configuration error cause the port to appear filtered/blocked? – j-Geek Jan 08 '16 at 16:51

1 Answers1

0

After taking Greg's advice and focusing on the firewall we found the solution. At some point in the past, a firewall rule had changed and the dynamic port range (49152-65535) was being filtered. Once the network guys added the rule to allow dynamic ports from DOMAIN B to DOMAIN A the issue was completely resolved.

For some reason in server 2008, this would only cause issues at the time the secure channel is being created. It would take 21 seconds (or some multiple of 21 seconds) to create the secure channel. After the secure channel was established, the authentication worked fine. The 21 second delay makes sense due to the nature of TCP communication.

In Server 2012 R2, the behavior was different. Regardless of whether the secure channel was established accross domains, it would fail to authenticate and break the secure channel to go look for another available domain controller.

I'm not sure why this worked at all in Server 2008... maybe it was defaulting to another port somewhere when it failed to establish a connection in the ephemeral ports?

In any event we've learned a valuable lesson: "This (filtered ports) should be the first item to check if there are RPC connectivity issues" - Greg Askew

j-Geek
  • 141
  • 1
  • 3