QNAP TS-809U: Domain Users / Groups disappear and the server has to be rejoined to the AD Domain

5

3

We have a TS809U that we have joined to the domain. Shares and access rights works as the should with the domain users and everything is just the way it should be. But after a couple of weeks/a month the domain users and groups disappear from the TS809, and I have to manually rejoin the domain again. After rejoining the domain the process repeats itself within the same timeframe, and I have to rejoin the domain yet again.

There is no errors in the logs in the web interface, and it shows the NAS joining the domain succesfully. I updated the TS809U to the latest firmware 4.0.3 (from 3.x) in hopes that this would solve it, but the problem still persists.

Has anyone encountered this before and would what the issue could be, or how to troubleshoot it further?

The only message I've been able to find in the event viewer that references the NAS is a 5722 that might point in the direction of the comment below:

The session setup from the computer NASC473CD failed to authenticate. The name(s) of the account(s) referenced in the security database is NASC473CD$.
The following error occurred:
Access is denied.

The timing between when the entries disappeared and then re-appeared seems to be 14 days. Our domain is (still) based on Windows Server 2003.

Update

Update: The problem has surfaced again, but logs didn't really show anything interesting. wbinfo -t (testing the trust secret) did not work and (unsurprisingly) neither did wbinfo -c (changing the trust secret). I did discover that the current kerberos5 ticket store hadn't been refreshed and the validity of the kerberos tickets had expired, which might be connected. I've now added /sbin/update_krb5_ticket to the crontab to see if that'll help (and it's now being refreshed each hour).

Update 2014-02-25

Still no success. log.wb-DOMAINNAME shows that we're apparently being refused access, probably because of timed out credentials or invalid secrets. Not sure how to progress, as the kerberos ticket list (klist) showed a valid ticket when it occurred.

log.wb-DOMAINNAMEshows:

[2014/02/25 03:05:20.545176,  3] winbindd/winbindd_pam.c:1902(winbindd_dual_pam_auth_crap)
  could not open handle to NETLOGON pipe (error: NT_STATUS_ACCESS_DENIED)
[2014/02/25 03:05:20.545198,  2] winbindd/winbindd_pam.c:2003(winbindd_dual_pam_auth_crap)
  NTLM CRAP authentication for user [DOMAINNAME]\[MACHINE$] returned NT_STATUS_ACCESS_DENIED (PAM: 4)
[2014/02/25 03:05:20.548424,  3] winbindd/winbindd_pam.c:1841(winbindd_dual_pam_auth_crap)
  [20497]: pam auth crap domain: DOMAINNAME user: MACHINE$

(the same error messages occur when referring to users). At least the issue seems to be that the server responds with ACCESS_DENIED when samba tries to use the NETLOGON resource as far as I understand. I did however discover that one of the DNS servers on the TS809 was set to an external server - and not a server in the domain. I've updated the DNS-servers to both point to our AD DC-s to see if that could be the reason (if it falls over to the external, it will get host not found instead of timeouts for internal, domain based hosts).

Update 2015-03-04. Automated rejoin script deployed as a work around.

We're still no closer to determining a lasting solution, but we're currently seeing timeouts each week. This seems to be the same time as a valid kerberos ticket, but I've been unable to find any setting that changes it.

I have however created a small script that checks if we've lost the user list from the domain, and rejoins the server if needed. (Using Samba's net rpc join command.) "username" is a user in the domain that have access to join computers into the domain (we created a user for the qnap for this purpose only):

COUNT=`wbinfo -g | grep DOMAINNAME | wc -l`

if [ "$COUNT" -lt "1" ]
then
    /usr/local/samba/bin/net rpc join -Uusername%password
fi

This script is run on the qnap with cron (search for qnap cron on Google on how to set up cron properly). This has worked decent the last months.

fiskfisk

Posted 2014-01-27T12:01:34.713

Reputation: 111

Are you actually joining the domain or just enabling authentication via RADIUS etc? What about logs on a primary DC reporting something to the effect of a failed trust relationship or failed authentications from that device? Another random thought is that it may be dropping the credentials due to your cached credential time (30 days by default). – Andrew M. – 2014-02-03T18:40:44.523

Actually joining the domain. The NAS' account can be seen under Computers in AD, and that's the only account created for the NAS. The NAS might be caching the credentials and not renewing them, although it should be a standard Samba installation. I've added the only error message I could find in the Event Viewer to the Q (5722, which might indicate you're on the right track). If someone has any experience with debugging this from the NAS-side it'd be helpful! – fiskfisk – 2014-02-04T11:32:10.237

Do you have any group policies beyond the standard ones (if unchanged) that apply to the OU that the NAS is under? Also are you registering the NAS with one account (your domain admin account) and then using subsequent authentication on a different account (one created for the NAS itself) as per best practices? Another thought I have is that it might be trying to authenticate to often and or times out every now and then and triggers your account or device lockout limit (I had this happen on our domain with a device once). I wish I had a similar device to test with here... – Andrew M. – 2014-02-04T19:27:49.383

Just had another thought... We had RADIUS auth problems when one of our edge devices got out of time sync for some reason. And, more recently, a problem with DNS. Just some thoughts... – Andrew M. – 2014-02-04T19:32:11.057

Time is in sync, and DNS should be hosted by the same AD server as is the PDC. I'll find out if it times out within the next 7 days if it's a 14 day interval. :-) I'm fairly certain the QNAP doesn't support different AD credentials, and I'd think it didn't need to reauthorize after it has joined the domain again? I'll try to find out if it only happens on mondays or something like that, timing out because there are no authentication requests on weekends or similar problems. According to rsop there are account policies applied to the computers group in the forest. Thank you for the help so far! – fiskfisk – 2014-02-05T09:28:12.483

Been thinkin about it and the last ideas I have are: Maybe the domain is hanging onto an old registration and you need to purge it before re-joining it? Maybe one of the controllers is getting out of sync? Maybe an update broke things? Maybe its because your running on a 2003 based domain (I am in the process of moving off of one right now myself). Did it ever function for longer than it functions for now? Did it have any other issues? Have you considered a firmware downgrade? Just trying to be helpful; you have probably already considered those options though... – Andrew M. – 2014-02-06T21:19:19.413

I've tried most of those options without any result, it has never worked and we've tried with several firmware versions. I have however dug a bit more into the samba configuration, and have increased the logging level significantly. Hopefully it'll show something useful next time it happens that could tell us more. max log size = 2000 and log level 5 for anyone wondering (/etc/config/smb.conf). (2MB log file, 5 is quite a bit of debug information) – fiskfisk – 2014-02-10T10:16:41.110

I've updated the question twice with a bit more information during the last 14 days. Still no success, although I'm pretty sure it's down to authentication / trust relationships timing out. – fiskfisk – 2014-02-25T09:48:09.707

Answers

0

Seems like an problem with the machine account password to me. By Design in a 2k3 Domain the reset is generated every 30 days, but the reset of the machine account password could be triggered by the client whenever you want.

Normally, the Member first creates the new password and then pulls this to the DC.

For whatever reason it seams like that your qnap is generating an new password after two weeks, but then is not able to push it to the DC cause of a broken secure channel.

I don't know the features offered by qnap, could you logon via ssh? I think it's an unix based system?! Maybe there's an option to disable machine account password. The trust won't stop working after this 30 days.

Maybe interesting: Link collection:

Berndinox

Posted 2014-01-27T12:01:34.713

Reputation: 121

Yes, this lines up with my suspicions as well. It's linux based using samba as the SMB implementation. I've increased the verbosity of logging to see if we're able to get more debug information (see comment on question if you're wondering about which settings do this). – fiskfisk – 2014-02-10T10:17:40.300