0

Trying to figure out a problem where the server not responds to rdp or smb connections after being not used for around 3 hours (i.e. all users are out and only background services running). Two onboard 1GB Nics are teamed switch-dependent LACP mode with Mikrotik RB2011iL 802.3ad l2+l3 hash policy. The connection restores after one of two randomly:

  1. Logging in through supermicro kvm interface and entering credentials (connection restores itself and server starts responding again, which is weird for me)
  2. Same as 1 but connection restores only after teaming network interface is restarted manually.

Looks like the server is going sleep for some reason, but some of services are still up, for example I can establish an L2TP connection to router, which sends a RADIUS request to the server (so the radius request goes ok, NPS log tells that I have authenticated sucessfully). But at that point RDP is still down.

The other thing is that router have no idea that I'm restarting the interface (when in normal case it tells that link is up/down). Also the problem had appeared once before teaming was used, but didn't persisted for long, so I had no idea what happened then, now its back. There's also no logs in event viewer about interface is down or something, only services like NTP start spamming that they are unable to resolve their addresses.

What I've tried so far:

  1. Updating network drivers to latest available at Supermicro site
  2. Setting "allow to go sleep" to disabled
  3. Setting Energy Efficient Ethernet to disabled on both NICs
  4. Server reboot

What else can I do to resolve this?

Edit: setting the GPO for session expiration time to never seems to be a temporary problem solve. Since I had an active sessions, the server didn't fall to its mysterious sleep and was accessible normally. But anyway this isn't a full answer to the problem, just changes the subject to "why WS stops responding on rdp/smb/pings/probably something else when all user sessions are out"

SelfishCrawler
  • 117
  • 1
  • 3
  • Sorry, can you confirm my understanding: NPS requests sent by the router are approved and logged on the server, but the server's internal services complain that they have no network connectivity? What does the port/LACP status on the switch say when the server is non-responsive? – RobbieCrash Aug 18 '21 at 07:12
  • @RobbieCrash yep router allows to auth with NPS and when I get to windows logs, there's an entry which tells that nps server successfully provided client with credentials (so probably no caching involved and the auth is real). Logs are telling nothing about port status on both router and windows (but I'm unsure what is shown in interface statistics about it when the problem is up, but since no log events are present, I assume router thinks that the link is still up with no traffic). As I said even manual restart of teaming interface doesn't lead to a link down event on a router side – SelfishCrawler Aug 18 '21 at 07:58
  • What happens if you unplug one of the teamed NICs? – RobbieCrash Aug 18 '21 at 08:17
  • You mean unplug physically or disable in windows? @RobbieCrash – SelfishCrawler Aug 18 '21 at 08:19
  • Either, I guess would be effective in making sure you've not got something wonky in your LACP configuration. But disabling it in Windows wouldn't rule out any problems on the server itself. If you unplug one of the NICs while Windows is unresponsive, and it comes back on, you can focus on your team config. What's resource utilization like on the server when it's non-responsive (CPU specifically)? – RobbieCrash Aug 18 '21 at 08:23
  • @RobbieCrash the thing is that server is fully responsible with no drops during the work day, but like around 3hrs after its over (I think it happens when last user session expires), server stops responding to rdp-connection attempts, smb, pings etc. At the morning I login to the system through KVM and it becomes responsible again for the whole day. – SelfishCrawler Aug 18 '21 at 08:30
  • I tried out to disable interfaces in windows one by one. Routes tells port link is down but the bonding interface stays alive and the rdp connection is also keeps responsible when either of nics are disabled – SelfishCrawler Aug 18 '21 at 08:31
  • Check CPU usage, and what happens if you unplug one NIC when it happens next. – RobbieCrash Aug 18 '21 at 08:38
  • @RobbieCrash CPU usage is zero during that time since its a terminal server and no one is using it. What I see in logs now: last session expired at 1:56 AM, at 3:04 AM NTP-client dropped a message about 8 fail attempts of time sync, so around an hour for the server to start "sleeping" – SelfishCrawler Aug 18 '21 at 08:46

0 Answers0