Overview
We have DAG with two nodes of Exchange 2016 CU12 running on up-to-date Windows Server 2016.
Search is failing on any mailbox residing in any database mounted on one of the nodes. Other node works properly.
Event 1012
is reported repeatedly by MSExchangeIS
on the affected host, with the followng content (this particular repeatedly generated by SearchQueryStxProbe
health monitor):
Exchange Server Information Store has encountered an error while executing a full-text index query ("and(subject:string("SearchQueryStxProbe*", mode="and"), folderid:string("3753F38349D8A943AE346EACDBD8B91300000000010C0000"))"). Error information: System.ServiceModel.EndpointNotFoundException: The message could not be dispatched because the service at the endpoint address 'net.pipe://localhost/3867' is unavailable for the protocol of the address.
Problem appears to be completely unrelated to the content index, the event itself and further diagnostics suggest some problem with some part of search service not running properly on this specific host.
Checklist
- There are 48 databases. Symptoms present on all of them equally, as long as these are mounted on affected host.
ContentIndexState
is reported healthy on all databases by both hosts.- Search probes
SearchQueryFailureMonitor
andSearchQueryStxMonitor
return unhealthy state on the affected host. Test-ExchangeSearch
returns literally nothing on either of the hosts. No result objects, no errors, nothing but a progress bar for a while. Never used this tool, thus don't know much what to expect.- Microsoft knowledgebase on the Search Health Set is a joke (in mild words).
- Problem is unaffected by service- or server-level restarts.
- Search works with all databases when database moved to the second DAG node.
Google does return numerous posts on a wide variety of issues resulting in Event 1012. Unfortunately, the 1012 is apparently covering a wide area of problems. Not one issue matches my event details or presents similar side symptoms while providing any solution or clue as to what too look for.
Comparative analysis
With lack of any reasonable documentation, further steps were limited to comparative analysis of the two hosts - the healthy and the failing one.
Following event data, I've checked for the TCP 3867 binding. On the failed host, the port is unbound. On the healthy host, the port is bound by an instance of the the service-run noderunner.exe
process, one with following arguments:
"C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\Runtime\1.0\NodeRunner.exe"
--noderoot "C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\HostController\Data\Nodes\Fsis\IndexNode1"
--addfrom "C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\HostController\Data\Nodes\Fsis\IndexNode1\Configuration\Local\Node.ini"
--tracelog "C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\HostController\Data\Nodes\Fsis\IndexNode1\Logs\NodeRunner.log"
I've compared the referred files and paths on both hosts:
NodeRunner.log
file is not being generated on either nodes.- File structure is identical and average file sizes are similar.
- Any plain text files show identical content baring the host name references.
- File permissions are identical.
Thus, no obvious differences. Also, no significant difference between search catalogs on replicated databases.
Question
Anyone had a similar problem? Anyone solved in? Anyone has a hint, where to look? Any log files or diagnostic tools?