We had an issue on one of our production database nodes which caused it to fail over to another. This caused a brief outage whilst the service rolled. This temporary glitch however was enough to cause some AOS servers to crash (i.e. the services were still showing as running, but we couldn't connect to them through AX until we'd restarted the AOS services).
NB: our application files are also hosted on our SQL cluster (on a separate disk), the idea being that rather than relying on a single share to hold these files we have the files on a clustered share, so that should the server (node) hosting the files fail they can still be available. I point this out as it may have been the brief loss of connection with these files which caused the issue instead of / as well as the loss of DB connection. We had toyed with the idea of having each AOS server hold a copy of the application files locally (i.e. so if one AOS goes down we only lost that one; the others aren't affected by the loss of its copy of the application files), but a number of consultants advised against this, quoting MS best practices.
However it seems that a temporary connectivity glitch should cause such an issue; I'd have expected the error to simply be caught and for AX to retry connecting to these resources until connection was re-established.
Does anyone know of a hotfix for this issue? Has anyone else had this problem & devised a workaround?
Thanks in advance.