How to automatically restart Opensolaris/OmniOS service/daemon that logs notices, but not critical errors to syslog?

Question

The following problem can be viewed as a CIFS/AD related problem (specific view) or as a question about service restarts, error handling and log parsing (general view). I will present both areas here, but would be happy to receive answers on any of them (just skip the parts you are not interested in).

Specific situation: idmap does not periodically rescan for domain controllers

In a Windows Server 2008 compatible Active Directory there are usually multiple domain controllers for high availability. If all of those servers are unavailable at the same time and an OmniOS (r151018) file server with active kernel SMB/CIFS server (that is successfully joined to the domain and works as expected) boots up, the following happens:

The idmap service tries to reach a DC for 60 seconds and then gives up...

root@omnios:/root# tail -n 20 /var/svc/log/system-idmap:default.log
@ Tue Sep  6 10:19:42 2016
Global Catalog servers not configured/discoverable
Domain controller servers not configured/discoverable
created thread ID 3 - 1 threads currently active
[ Sep  6 10:19:42 Method "start" exited with status 0. ]
@ Tue Sep  6 10:19:53 2016
created thread ID 4 - 2 threads currently active
getdcname wait begin
@ Tue Sep  6 10:19:57 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep  6 10:20:08 2016
getdcname timeout
@ Tue Sep  6 10:20:12 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep  6 10:20:27 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
@ Tue Sep  6 10:20:42 2016
DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure
Domain discovery took 60 sec.
Check the DNS configuration.

... but does not fail critically:

root@omnios:/root# svcs -xv idmap
svc:/system/idmap:default (Native Identity Mapping Service)
 State: online since Tue Sep  6 10:19:42 2016
   See: man -M /usr/share/man -s 1M idmapd
   See: man -M /usr/share/man -s 1M idmap
   See: /var/svc/log/system-idmap:default.log
Impact: None.

After that, smbd complains (rightfully) in the syslog every minute that it cannot find the DC:

smbd[525]: [ID 510351 daemon.notice] smb_locate_dc status 0xc0000233
smbd[525]: [ID 199031 daemon.notice] smbd_dc_update: testdomain.intranet: locate failed

This persists even after the DC is back online and reachable. It is worked around instantly by restarting idmap with svcadm restart idmap. Of course, as those outages may happen without planning, it should not have to be done by hand.

How could the idmap restart be scripted to automatically occur on those events? I did try to use the SMF, but it seems that this only works for crashed services, while idmap reports no problems (and smbd only reports notices). Another possibility would be to monitor the logfiles constantly and grep for the events, but this seems inefficient to me. I also tried to reduce the config/rediscovery_interval value to 60 seconds but it seems it is ignored (or not applicable here).
Alternatively, what might be the solution that removes the problem itself? Unfortunately, I did not find anything usable except a posting that acknowledges how a full restart solves the problem (as idmap is also restarted there).

Edit: Output of svccfg -s idmap listprop - only thing I did change was config/rediscovery_interval (default 3600), IDs have been removed afterwards by hand.

config                                           application
config/id_cache_timeout                          count    86400
config/list_size_limit                           count    0
config/name_cache_timeout                        count    604800
config/preferred_dc                              astring
config/stability                                 astring  Unstable
config/use_ads                                   boolean  true
config/use_lsa                                   boolean  true
config/value_authorization                       astring  solaris.smf.value.idmap
config/machine_uuid                              astring  [...]
config/machine_sid                               astring  [...]
config/rediscovery_interval                      count    60
config/domain_name                               astring  testdomain.intranet
debug                                            application
debug/all                                        integer  0
debug/config                                     integer  0
debug/discovery                                  integer  0
debug/dns                                        integer  0
debug/ldap                                       integer  0
debug/mapping                                    integer  0
debug/stability                                  astring  Unstable
debug/value_authorization                        astring  solaris.smf.value.idmap
rpcbind                                          dependency
rpcbind/entities                                 fmri     svc:/network/rpc/bind
rpcbind/grouping                                 astring  require_all
rpcbind/restart_on                               astring  restart
rpcbind/type                                     astring  service
filesystem-minimal                               dependency
filesystem-minimal/entities                      fmri     svc:/system/filesystem/minimal
filesystem-minimal/grouping                      astring  require_all
filesystem-minimal/restart_on                    astring  error
filesystem-minimal/type                          astring  service
manifestfiles                                    framework
manifestfiles/lib_svc_manifest_system_idmap_xml  astring  /lib/svc/manifest/system/idmap.xml
general                                          framework
general/action_authorization                     astring  solaris.smf.manage.idmap
general/entity_stability                         astring  Unstable
general/single_instance                          boolean  true
general/value_authorization                      astring  solaris.smf.manage.idmap
start                                            method
start/exec                                       astring  /usr/lib/idmapd
start/timeout_seconds                            count    60
start/type                                       astring  method
stop                                             method
stop/exec                                        astring  :kill
stop/timeout_seconds                             count    60
stop/type                                        astring  method
refresh                                          method
refresh/exec                                     astring  ":kill -HUP"
refresh/timeout_seconds                          count    60
refresh/type                                     astring  method
tm_common_name                                   template
tm_common_name/C                                 ustring  "Native Identity Mapping Service"
tm_man_idmapd1M                                  template
tm_man_idmapd1M/manpath                          astring  /usr/share/man
tm_man_idmapd1M/section                          astring  1M
tm_man_idmapd1M/title                            astring  idmapd
tm_man_idmap1M                                   template
tm_man_idmap1M/manpath                           astring  /usr/share/man
tm_man_idmap1M/section                           astring  1M
tm_man_idmap1M/title                             astring  idmap

General problem: how to efficiently react to syslog messages if the process seems to be running normally?

This problem could be generalized to the question of how to watch logfiles in the most efficient manner on Solaris. I've searched and came across several tools like swatch, logsurfer, logwatcher, or the plain old cron job executed every minute and connected to a simple script that reads dmesg output.

Are these the only possible ways to do that or is there a better solution?
- Is there a way to tell the SMF that certain notices on some processes should be met with actions, even if no critical condition has occurred?
- I have stumbled upon the fault manager FMA, but it seems that it also only works on critical conditions, not on mere notices (or any user-specifiable strings). Is this correct?
If it is the only way, what would you suggest using and why?

What properties exist for the `idmap` service on your install? Something in the output of `svccfg -s idmap listprop` might be useful, such as changing the retry time from 60 seconds to something that causes the service to retry indefinitely - or at least some *very* long time. — Andrew Henle, Sep 07 '16 at 09:23
@AndrewHenle thanks for your suggestion, I've updated the question. The values should be fairly default, as this was a new VM for testing purposes. — user121391, Sep 07 '16 at 10:11
The source code that's doing the discovery is here: http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/idmap/idmapd/idmap_config.c#1960 According to that, you should be getting a log file `/var/run/idmap/discovery.log`. The code that actually does the discovery is at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libadutils/common/addisc.c#1205 It looks like all the work is done in `libadutils.so`. — Andrew Henle, Sep 07 '16 at 12:15
Are you getting anything in `/var/run/idmap/discovery.log`? From browsing the source code, it appears that you should be getting log entries there. I also noted that there's a dependency on DNS. Are you using your domain controllers as your name servers also? So when the domain controllers are down you also don't have DNS servers? I'm guessing that `libadutils.so` is designed for internal-use only as it appear to be completely undocumented. — Andrew Henle, Sep 08 '16 at 18:32
@AndrewHenle Sorry for the late reply, I forgot. The log only mentions about 4 times that DNS discovery was unsuccessful while the DC was down (I did find the corresponding part in the source). Yes, DNS is on the DC (as is recommended by Microsoft), I will try with another DNS when I find time, but I believe the problem is that the discovery is not done again at regular intervals automatically (as it is possible that also additional DNS servers might be down or slower to restart). I've read that Windows Server does the scanning regularly. — user121391, Sep 09 '16 at 10:36

Jim Klimov · Answer 1 · 2020-08-08T22:03:17.737

Thanks for detailing this issue so well in the question. I've recently hit it too, with the caveat that the MS AD controller is a VM running on the (post-)OpenSolaris host, and as Win/AD support dwindles at that site, this is the only surviving replica. So when it is early in boot (and idmap is part of dependency tree leading up to multi-user-server) the VM is not yet running and idmap fails to contact it just like you posted, and smbd complains. I guess this is better than failing to boot the server completely because of a hiccup contacting the AD.

I think as a direct answer to your question, there are other logging daemons :) For example, rsyslog can trigger commands if a logged message matches some pattern you set.

With SMF svc logs being not quite syslogged, I guess one would have to make an ad-hoc script for this situation to check a particular log file in /var/svc/log/... and restart the service.

Also can do a delayed action (e.g. an init-script that would echo "svcadm restart idmap" | at now + 10min) to always kick this after boot-up, which I guess is what I'll do here.

Finally, one might add some activity into the VM startup script to kick idmap after the VM has booted (and so not rely on purely hardcoded timing) - but not as part of SMF dependency tree since idmap is needed so early in boot... or at least not as a strict dependency, but something like a restart_on...

How to automatically restart Opensolaris/OmniOS service/daemon that logs notices, but not critical errors to syslog?

Specific situation: idmap does not periodically rescan for domain controllers

General problem: how to efficiently react to syslog messages if the process seems to be running normally?

1 Answers1