Random SSH connection timeout on first try to CentOS servers

Question

Have got a curious issue with SSH on CentOS 6 and haven't found a solution yet.

We have our CentOS 6 servers all joined to an Active Directory 2012 R2 domain using Winbind. It is not used for filesharing, but single sign-on and group access. Most of the time an initial SSH connection works using Ansible or Putty. This can either be key based or using passwords. However, sometimes we fail to get logged in and the connection will timeout. This does not seem to target any one server, but happens randomly thoughout the environment. All servers are patched and rebooted to the same updates each month, so all running the same versions everywhere.

I've tried some of the obvious answers out there I could find related to this.

Set "UseDNS no", "AddressFamily inet", "GSSAPIAuthentication no" in /etc/ssh/sshd_config. No change in behavior.
Set "LogLevel DEBUG2" in /etc/ssh/sshd_config. I don't see any warnings or errors in the output.
Set "options single-request-reopen" in /etc/resolv.conf. No change in behavior.
For Winbind, "wbinfo -u", "wbinfo -g", "getent password", etc all work just fine.
Tried increasing the verbosity of Samba output, but haven't found anything in the logs that would point me in the right direction.
Watch the Windows Event logs on the Domain controllers, and don't see anything showing up there either.

Prior to Windows 2012 R2, the domain was Windows 2008 and we had the same problem. Users in AD have all the proper UNIX attributes set.

Using Putty we experience the same problem as when we use Ansible. The Ansible server is on the same LAN as all the servers. Reverse DNS works just fine, all servers are pingable. Services on the server are always responding. It's just that first connection sometimes doesn't work. No prompt is shown. It is almost like SSH is "sleeping". It has become more of a nuisance with Ansible and trying to automate various things.

I'm at a loss here as to how to troubleshoot this further.

Does anyone have a suggestion that might be helpful? Posting my configs in case they prove useful.

/etc/samba/smb.conf

[global]
    workgroup = COMPANY
    netbios name = SERVER01
    password server = dc01.company.local dc02.company.local
    realm = COMPANY.LOCAL
    security = ads
    smb encrypt = yes
    template shell = /bin/bash
    template homedir = /home/%U
    winbind nss info = rfc2307
    winbind use default domain = true
    winbind offline logon = false
    winbind enum users = yes
    winbind enum groups = yes
    idmap config *:backend = tdb
    idmap config *:range = 1000000-1999999
    idmap config COMPANY:backend = ad
    idmap config COMPANY:default = yes
    idmap config COMPANY:range = 2048-999999
    idmap config COMPANY:schema_mode = rfc2307
    server string = Samba Server
    log file = /var/log/samba/log.%m
    max log size = 5000
    log level = 4
    passdb backend = tdbsam
    load printers = no
    printcap name = /dev/null
    disable spoolss = yes
[homes]
    comment = Home Directories
    browseable = no
    writable = no

/etc/ssh/sshd_config

AddressFamily inet
Protocol 2
SyslogFacility LOCAL6
LogLevel DEBUG3
LoginGraceTime 60
PermitRootLogin no
PermitEmptyPasswords no
PasswordAuthentication yes
ChallengeResponseAuthentication no
GSSAPIAuthentication no
UsePAM yes
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE
AcceptEnv XMODIFIERS
X11Forwarding no
UseDNS no
Banner /etc/ssh/sshd-banner
Subsystem       sftp    /usr/libexec/openssh/sftp-server
AllowGroups linuxadmins otheradmin 
MACS hmac-sha2-256,hmac-sha1,hmac-sha2-512

Thanks!

Update

All the CentOS servers and Domain Controllers are pointed at NTP servers, so time is synchronized across everything.

just an idea, could it be that at the time you have those timeouts someone is trying to brute force ssh access? can you check your logs (/var/log/secure) — Hrvoje Špoljar, Feb 22 '16 at 20:59
Good suggestion, but none of these servers are accessible from the Internet. The only failure in the logs are users with valid accounts that experience the timeout. — titleistfour, Feb 22 '16 at 21:10
server side strace of ssh process and all it's forks + tcpdump of all traffic between 2 systems interacting would be my starting point ... — Hrvoje Špoljar, Feb 29 '16 at 23:02
Would love to have that, except we never know which server this will affect. So doing an strace and tcpdump between 2 points is difficult to pinpoint. — titleistfour, Mar 02 '16 at 23:24

Random SSH connection timeout on first try to CentOS servers

/etc/samba/smb.conf

/etc/ssh/sshd_config

Update

0 Answers0