The following problem can be viewed as a CIFS/AD related problem (specific view) or as a question about service restarts, error handling and log parsing (general view). I will present both areas here, but would be happy to receive answers on any of them (just skip the parts you are not interested in).
Specific situation: idmap does not periodically rescan for domain controllers
In a Windows Server 2008 compatible Active Directory there are usually multiple domain controllers for high availability. If all of those servers are unavailable at the same time and an OmniOS (r151018) file server with active kernel SMB/CIFS server (that is successfully joined to the domain and works as expected) boots up, the following happens:
The idmap
service tries to reach a DC for 60 seconds and then gives up...
root@omnios:/root# tail -n 20 /var/svc/log/system-idmap:default.log
@ Tue Sep 6 10:19:42 2016
Global Catalog servers not configured/discoverable
Domain controller servers not configured/discoverable
created thread ID 3 - 1 threads currently active
[ Sep 6 10:19:42 Method "start" exited with status 0. ]
@ Tue Sep 6 10:19:53 2016
created thread ID 4 - 2 threads currently active
getdcname wait begin
@ Tue Sep 6 10:19:57 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep 6 10:20:08 2016
getdcname timeout
@ Tue Sep 6 10:20:12 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep 6 10:20:27 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep 6 10:20:42 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
Domain discovery took 60 sec.
Check the DNS configuration.
... but does not fail critically:
root@omnios:/root# svcs -xv idmap
svc:/system/idmap:default (Native Identity Mapping Service)
State: online since Tue Sep 6 10:19:42 2016
See: man -M /usr/share/man -s 1M idmapd
See: man -M /usr/share/man -s 1M idmap
See: /var/svc/log/system-idmap:default.log
Impact: None.
After that, smbd
complains (rightfully) in the syslog every minute that it cannot find the DC:
smbd[525]: [ID 510351 daemon.notice] smb_locate_dc status 0xc0000233
smbd[525]: [ID 199031 daemon.notice] smbd_dc_update: testdomain.intranet: locate failed
This persists even after the DC is back online and reachable. It is worked around instantly by restarting idmap
with svcadm restart idmap
. Of course, as those outages may happen without planning, it should not have to be done by hand.
- How could the
idmap
restart be scripted to automatically occur on those events? I did try to use the SMF, but it seems that this only works for crashed services, whileidmap
reports no problems (andsmbd
only reports notices). Another possibility would be to monitor the logfiles constantly and grep for the events, but this seems inefficient to me. I also tried to reduce theconfig/rediscovery_interval
value to 60 seconds but it seems it is ignored (or not applicable here). - Alternatively, what might be the solution that removes the problem itself? Unfortunately, I did not find anything usable except a posting that acknowledges how a full restart solves the problem (as
idmap
is also restarted there).
Edit: Output of svccfg -s idmap listprop
- only thing I did change was config/rediscovery_interval
(default 3600), IDs have been removed afterwards by hand.
config application
config/id_cache_timeout count 86400
config/list_size_limit count 0
config/name_cache_timeout count 604800
config/preferred_dc astring
config/stability astring Unstable
config/use_ads boolean true
config/use_lsa boolean true
config/value_authorization astring solaris.smf.value.idmap
config/machine_uuid astring [...]
config/machine_sid astring [...]
config/rediscovery_interval count 60
config/domain_name astring testdomain.intranet
debug application
debug/all integer 0
debug/config integer 0
debug/discovery integer 0
debug/dns integer 0
debug/ldap integer 0
debug/mapping integer 0
debug/stability astring Unstable
debug/value_authorization astring solaris.smf.value.idmap
rpcbind dependency
rpcbind/entities fmri svc:/network/rpc/bind
rpcbind/grouping astring require_all
rpcbind/restart_on astring restart
rpcbind/type astring service
filesystem-minimal dependency
filesystem-minimal/entities fmri svc:/system/filesystem/minimal
filesystem-minimal/grouping astring require_all
filesystem-minimal/restart_on astring error
filesystem-minimal/type astring service
manifestfiles framework
manifestfiles/lib_svc_manifest_system_idmap_xml astring /lib/svc/manifest/system/idmap.xml
general framework
general/action_authorization astring solaris.smf.manage.idmap
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring solaris.smf.manage.idmap
start method
start/exec astring /usr/lib/idmapd
start/timeout_seconds count 60
start/type astring method
stop method
stop/exec astring :kill
stop/timeout_seconds count 60
stop/type astring method
refresh method
refresh/exec astring ":kill -HUP"
refresh/timeout_seconds count 60
refresh/type astring method
tm_common_name template
tm_common_name/C ustring "Native Identity Mapping Service"
tm_man_idmapd1M template
tm_man_idmapd1M/manpath astring /usr/share/man
tm_man_idmapd1M/section astring 1M
tm_man_idmapd1M/title astring idmapd
tm_man_idmap1M template
tm_man_idmap1M/manpath astring /usr/share/man
tm_man_idmap1M/section astring 1M
tm_man_idmap1M/title astring idmap
General problem: how to efficiently react to syslog messages if the process seems to be running normally?
This problem could be generalized to the question of how to watch logfiles in the most efficient manner on Solaris. I've searched and came across several tools like swatch
, logsurfer
, logwatcher
, or the plain old cron job executed every minute and connected to a simple script that reads dmesg
output.
- Are these the only possible ways to do that or is there a better solution?
- Is there a way to tell the SMF that certain notices on some processes should be met with actions, even if no critical condition has occurred?
- I have stumbled upon the fault manager FMA, but it seems that it also only works on critical conditions, not on mere notices (or any user-specifiable strings). Is this correct?
- If it is the only way, what would you suggest using and why?