0
I have four SQL Agent jobs that are running continuously. Each job is executing a stored procedure which queries recent ATM transactions, all via the same linked server. This happens in a loop within each job with a 5 second delay between proc executions (using WAITFOR DELAY).
Specifically, one of the four jobs is using one stored procedure and the other three are all using another. Each job execution has its own set of parameters related to specific patterns of transactions that will trigger a response. They are all querying the same linked server.
Most of the time, everything is working fine. But occasionally (almost*) any of the jobs will fail with this error:
TCP Provider: The specified network name is no longer available. [SQLSTATE 42000] (Error 64) OLE DB provider "SQLNCLI11" for linked server "ATMDB" returned message "Communication link failure". [SQLSTATE 01000] (Error 7412).
This only started happening after the jobs were updated to run continuously with the WAITFOR loop, rather than running once every minute as a separate SQL Agent job execution. This change was made to avoid excessive SQL job logging and also to monitor the ATM transactions much closer to real-time.
There is no consistency with how often they will fail, but it is at least a few times a day.
*I say "almost" above, because one of the jobs has never failed. This is the one that most often gets "hits" and when that occurs the job step is allowed to complete so we have explicit logging of that in the SQL job logs. (It then loops back to the first step again after a 1 second delay.)
All of the jobs are configured to run every minute. So after a failure the job will restart at the top of the next minute, therefore this isn't impacting things too badly. It is just very annoying!
I suspect that the SQL Native Client and how it implements connection pooling is the root of the issue. While testing, I updated the linked server in one of the two procedures to be different than the other. Two different host names, but really still the same server (I just used the hosts file to create a new name). This caused the one job to no longer fail, but two out of the three other jobs would still intermittently fail. After switching the other procedure to also use the same linked server again, now all jobs again would fail occasionally. Because connection pools are specific to the connection string, a different server name would cause a different pool to be used.
Using Windows resource monitor, I can see that there are usually three TCP connections open to the linked server. These stay open on the same ports over multiple minutes, so I'm guessing that it is using a connection pool of some kind. When the error happens, it coincides with one of the connections being closed and shortly after a new connection will open.
My current theory: because the processes are looping repeatedly while using the linked server, something about the connection pool causes a situation where it is trying to use a connection that has been closed for whatever reason. Perhaps the one job step that does "complete" every so often gets refreshed in some way and prevents the error from happening on that specific process.
I tried wrapping the stored procedure calls within sp_executesql as a workaround to "force" another process or execution context, but that didn't help.
Both servers are Windows 2012, running in VMWare.
Any ideas on how to further troubleshoot?