radius authentication -- spiking load ever two hours on the hour

Question

We're using freeradius & winbindd in order to authenticate our EDUROAM Wifi users against the Active Directory domain.

This is working like a charm, but we get load-spikes of 30 and more almost every two hours on the hour (during the work hours, at 10:00, 12:00 and 14:00)

In our mschap config we're invoking a script

    ntlm_auth = "wrap_ntlm_auth.pl challenge %{%{Stripped-User-Name}:-%{%{User-Name}:-None}} %{mschap:Challenge} %{mschap:NT-Response} %{%{Calling-Station-ID}:-none}"

This script basically implements blocking mechanisms ańd mapping of usernames. In the end it invokes ntlm_auth, which is taklking to the winbindd:

    ntlm_auth --use-cached-creds --username='%s' --password='%s'

or

    ntlm_auth --use-cached-creds --request-nt-key --username='%s' --challenge='%s' --nt-response='%s'

depending on whether a password or challenge authentication is being used.

We're absolutely baffled why the authentication process would take considerably longer every two hours on the hour. Are there glaring errors / mission optimizations in our setup?

smb.conf for winbindd:

   [global]
   workgroup = DOMAIN
   server string = %h server
   dns proxy = no
   log file = /var/log/samba/log.%m
   max log size = 1000
   # winbind offline logon = yes
   panic action = /usr/share/samba/panic-action %d
   security = ads
   encrypt passwords = true
   passdb backend = tdbsam
   obey pam restrictions = yes
   realm = domain.de
   ntlm auth = no
   lanman auth = no
   client ntlmv2 auth = yes
   winbind max clients = 400
   winbind max domain connections = 400
   password server = *
   log level = 1 #winbind:3 auth:3

vmstat 1 output:

   procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
    10  0 197140 220856  47372 315532    0    1    14   118    8    5 15  4 81  0  0
     8  0 197124 247660  47372 315524    8    0     8     0  966 2305 77 19  4  0  0
    20  0 197124 185552  47380 315504    0    0     0    20  870 1814 81 19  0  0  0
    23  0 197112 171336  47384 315540    0    0     0    40  886 2120 81 19  0  0  0
    28  0 197108 162388  47384 315540    4    0     4     0  909 2105 80 20  0  0  0
    31  0 197100 133700  47384 315552    4    0     4    40  747 1825 81 19  0  0  0
    27  0 197096 126248  47388 315648    4    0     4    12  594 1642 86 14  0  0  0
    21  0 197088 144092  47396 315848    8    0     8    52  843 2398 81 19  0  0  0
    15  0 197084 188100  47404 315764    4    0     4    28  841 2176 82 18  0  0  0
     7  0 197084 227992  47404 315680    0    0     0     0  792 2226 82 18  0  0  0
     3  0 197084 253444  47404 315656    0    0     0    16  827 2033 63 18 19  0  0
     3  0 197060 258232  47404 315712    0    0     0     0  743 1764 57 13 30  0  0
     6  0 197060 234608  47412 315712    0    0     0    16  833 2009 83 17  0  0  0
    16  0 197056 205704  47420 315744    4    0     4    32  921 2069 86 14  0  0  0
     1  0 197052 260584  47420 315760    4    0     4     0  684 2086 78 19  3  0  0
     4  0 197052 236652  47420 315772    0    0     0     8  793 1792 67 12 21  0  0
     6  0 197052 228844  47420 315788    0    0     0     0  834 2094 78 22  0  0  0
     1  0 197052 262472  47420 315788    0    0     0     4  771 2055 71 19 10  0  0
     1  0 197052 268048  47432 315792    0    0     0    40  777 1874 59 16 25  0  0

Are you sure the RADIUS authentication is causing the load spikes? — Lenniey, Jan 11 '17 at 12:23
Yes, definitely, there's nothing else running on the box. When the load spike occurs, htop displays quite a few wrap_ntlm_auth.pl processes (obviously waiting for winbindd to respond). — Ralf Hildebrandt, Jan 11 '17 at 12:29
Freeradius uses about 12% CPU, each wrap_ntlm_auth.pl process about 2.8% - so I think the system is mostly waiting. Disk IO isn't big, either. — Ralf Hildebrandt, Jan 11 '17 at 12:55
Could you monitor your processes at the time of the spikes via performance monitor or in realtime via Sysinternals ProcExp or something? — Lenniey, Jan 13 '17 at 08:54
Sorry, of course (my brain made the AD<->Windows connection...). — Lenniey, Jan 13 '17 at 09:58
I'd recommend to troubleshoot further on the DC. Given that you're still using NTLM for authentication, first thing to check would be probably semaphore timeouts, especially if you have quite large environment or multiple child domains where authentication needs to be passed between DCs. Check [here](https://blogs.technet.microsoft.com/askds/2011/09/15/is-this-horse-dead-yet-ntlm-bottlenecks-and-the-rpc-runtime/) and [here](https://social.technet.microsoft.com/wiki/contents/articles/9759.configuring-maxconcurrentapi-for-ntlm-pass-through-authentication.aspx) for details. — Martin Lhotsky, Jan 28 '17 at 23:33

radius authentication -- spiking load ever two hours on the hour

0 Answers0