We have a customer with a Windows Server 2016 domain controller. It's a small business so their server infrastructure consists of a Hyper-V host and this DC. The DC hosts file shares and Azure AD Connect for syncing identity with Office 365.
We monitor for event ID 4625 and have an alerting threshold to help us identify potential brute force attacks against the network.
In October of last year we began receiving alerts that the failed logon alert threshold had been exceeded. Upon investigation we have the following description of the problem:
- The failed logon events occur organically whenever our VSS (Datto) backups run or whenever AADC syncs
- Backups succeed and AADC syncs normally with no errors
- There are two accounts that fail to log on:
- SERVERNAME$ (e.g. the SYSTEM account) whenever the backups run
- AAD_* whenever AADC runs
- The event STATUS is 0xC000006D - failed username or password
- The event SUB STATUS is 0x0 - Status OK
- The SYSTEM logon failure can be easily replicated by running
vssadmin list writers
The list of troubleshooting over the last several months is long. This is not a comprehensive list:
- Uninstall RRAS and WID (including deleting WID folders to ensure permissions are set correctly when roles are reinstalled)
- Clearing SYSTEM credential cache (with psexec &
rundll32 keymgr.dll,KRShowKeyMgr
- no credentials cached) - Log in to WID with SQL Server Management Studio and verified database permissions (for LOCALSERVICE and NETWORKSERVICE accounts)
- Adjusting permissions on various registry keys (this did get rid of unrelated CAPI2 & WIDWRITER errors in the application event log)
- Running DCDIAG & reviewing application event logs and clearing up any errors & warnings (including DNS warnings, adding SPNs, and reregistering AD DNS entries, and running a D4 authoritative restore of DFSR to clear up warnings from the server migration)
- Monitoring with sysinternals ProcessMonitor to identify any access denied or other errors (this got me on to adjusting permissions on folders & registry entries to make sure that both LOCALSERVICE and NETWORKSERVICE (the service accounts running the WIDWRITER & other VSS services) had access)
- Verifying service startup type and logon accounts for VSS services & writers, AADC, etc.
- Stopping services and running tests (stopping AADC sync service resolves all logon failures. That's how I narrowed it down to AADC)
- Uninstalling AADC
- Run a repair on SQL Express Local DB
- Calling Microsoft Support using our MS partnership support contract - who said "there's no loss of functionality and you don't have an actual user who can't log on, so we can't help you, sign up for Premier support!" (I'm sorry, does SYSTEM not count as an important user????)
- Banging my head against several walls and many other things
A useful thing learned during all of this
- When uninstalling and reinstalling AADC, running
vssadmin list writers
continually during the installation process, the errors begin immediately after the SQL components are installed, before the installer has even finished running. - When SSMS is installed and I log in to a database, I also get failed login events for the dom admin account I logged in with, though my SSMS session seems unaffected.
The problem is clearly related to AADC because I can stop the AAD sync service or uninstall AADC and all failed logon events go away. But uninstalling AADC & deleting AADC folders & cleaning out AADC user accounts & clearing AADC registry entries to try and get a truly fresh install has no effect, the errors return immediately when I reinstall AADC.
At this point I'm at my wits end and I don't know what else to do or where else to even look. I'm hoping someone out there in the aether knows more than I do (likely) or has experienced this before and found a fix.
One final note - the server's DNS name is 9 characters long, meaning that it does not match its NETBIOS name. I don't think this is the cause, but if necessary I can rename the server. It's just a bit of a headache to do for an in-production DC & file server.